MineDojo: Building Open-Ended
Embodied Agents with Internet-Scale Knowledge
MineDojo：利用互联网规模的知识构建开放式的具身化代理

Linxi Fan¹, Guanzhi Wang^2∗, Yunfan Jiang^3∗, Ajay Mandlekar¹, Yuncong Yang⁴,
林希凡，关志旺，江云帆，阿贾伊·曼德勒卡尔，杨云聪
Haoyi Zhu⁵, Andrew Tang⁴, De-An Huang¹, Yuke Zhu^1 6†, Anima Anandkumar^1 2†
朱浩一，唐安德鲁，黄德安，朱宇科，阿尼玛·阿南德库马
¹NVIDIA, ²Caltech, ³Stanford, ⁴Columbia, ⁵SJTU, ⁶UT Austin
NVIDIA， ² 加州理工， ³ 斯坦福， ⁴ 哥伦比亚， ⁵ 上海交通大学， ⁶ 德克萨斯大学奥斯汀
^∗Equal contribution ^†Equal advising
平等贡献平等指导
https://minedojo.org

Abstract 摘要

Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo’s data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite, knowledge bases, algorithm implementation, and pretrained models (https://minedojo.org) to promote research towards the goal of generally capable embodied agents.
自主代理在 Atari 游戏和围棋等专业领域取得了巨大进展。然而，它们通常在孤立的环境中进行白板学习，目标有限且手动设计，因此无法在广泛的任务和能力范围内进行泛化。受到人类在开放世界中不断学习和适应的启发，我们提倡构建通用代理的三个要素：1）支持多种任务和目标的环境，2）大规模的多模态知识数据库，3）灵活可扩展的代理架构。我们介绍了 MineDojo，这是一个基于流行的 Minecraft 游戏构建的新框架，它具有数千个多样化的开放式任务的模拟套件，以及一个包含 Minecraft 视频、教程、维基页面和论坛讨论的互联网规模的知识库。利用 MineDojo 的数据，我们提出了一种新颖的代理学习算法，它利用大规模预训练的视频语言模型作为学习的奖励函数。我们的代理能够在自由形式语言中解决各种开放式任务，而无需任何手动设计的密集塑形奖励。我们开源了模拟套件、知识库、算法实现和预训练模型（https://minedojo.org），以促进研究朝着通用能力的具体化代理的目标发展。

1 Introduction 1介绍

Refer to caption — Figure 1: MineDojo is a novel framework for developing open-ended, generally capable agents that can learn and adapt continually to new goals. MineDojo features a benchmarking suite with thousands of diverse open-ended tasks specified in natural language prompts, and also provides an internet-scale, multimodal knowledge base of YouTube videos, Wiki pages, and Reddit posts. The database captures the collective experience and wisdom of millions of Minecraft gamers for an AI agent to learn from. Best viewed zoomed in.
图 1：MineDojo 是一个新颖的框架，用于开发开放式、通用能力的代理程序，可以不断学习和适应新的目标。MineDojo 具有一个基准测试套件，其中包含数千个多样化的开放式任务，以自然语言提示的形式进行说明，还提供了一个互联网规模的多模态知识库，包括 YouTube 视频、维基页面和 Reddit 帖子。该数据库捕捉了数百万 Minecraft 玩家的集体经验和智慧，供 AI 代理程序学习。最好放大查看。

Developing autonomous embodied agents that can attain human-level performance across a wide spectrum of tasks has been a long-standing goal for AI research. There has been impressive progress towards this goal, most notably in games [80, 85, 126] and robotics [68, 99, 146, 134, 107]. These embodied agents are typically trained tabula rasa in isolated worlds with limited complexity and diversity. Although highly performant, they are specialist models that do not generalize beyond a narrow set of tasks. In contrast, humans inhabit an infinitely rich reality, continuously learn from and adapt to a wide variety of open-ended tasks, and are able to leverage large amount of prior knowledge from their own experiences as well as others.
开发能够在各种任务中达到人类水平表现的自主体代理一直是人工智能研究的长期目标。在这个目标上取得了令人瞩目的进展，尤其是在游戏[80, 85, 126]和机器人技术[68, 99, 146, 134, 107]方面。这些自主体代理通常在复杂度和多样性有限的孤立世界中进行白板训练。虽然表现出色，但它们是专门的模型，无法推广到更广泛的任务范围。相比之下，人类生活在一个无限丰富的现实中，不断从各种开放性任务中学习和适应，并能够利用自己和他人的大量先前知识。

We argue that three main pillars are necessary for generalist embodied agents to emerge. First, the environment in which the agent acts needs to enable an unlimited variety of open-ended goals [116, 71, 120, 117]. Natural evolution is able to nurture an ever-expanding tree of diverse life forms thanks to the infinitely varied ecological settings that the Earth supports [117, 129]. This process has not stagnated for billions of years. In contrast, today’s agent training algorithms cease to make new progress after convergence in narrow environments [80, 146]. Second, a large-scale database of prior knowledge is necessary to facilitate learning in open-ended settings. Just as humans frequently learn from the internet, agents should also be able to harvest practical knowledge encoded in large amounts of video demos [42, 77], multimedia tutorials [79], and forum discussions [127, 65, 54]. In a complex world, it would be extremely inefficient for an agent to learn everything from scratch through trial and error. Third, the agent’s architecture needs to be flexible enough to pursue any task in open-ended environments, and scalable enough to convert large-scale knowledge sources into actionable insights [19, 96]. This motivates the design of an agent that has a unified observation/action space, conditions on natural language task prompts, and adopts the Transformer pre-training paradigm [27, 91, 15] to internalize knowledge effectively.
我们认为，通用的具身代理出现需要三个主要支柱。首先，代理所行动的环境需要能够实现无限多样的开放目标[116, 71, 120, 117]。自然进化能够培育出不断扩张的多样生命形式树，这要归功于地球支持的无限多样的生态环境[117, 129]。这个过程已经持续了数十亿年。相比之下，当今的代理训练算法在狭窄环境中收敛后就停止了新的进展[80, 146]。其次，在开放环境中促进学习需要一个大规模的先前知识数据库。就像人类经常从互联网上学习一样，代理也应该能够获取大量视频演示[42, 77]、多媒体教程[79]和论坛讨论[127, 65, 54]中编码的实用知识。在一个复杂的世界中，一个代理从头开始通过试错学习所有东西将非常低效。第三，代理的架构需要足够灵活，以在开放环境中追求任何任务，并且足够可扩展，以将大规模的知识源转化为可操作的见解[19, 96]。这激发了设计一个具有统一的观察/行动空间、依赖自然语言任务提示的代理，并采用 Transformer 预训练范式[27, 91, 15]来有效内化知识。

In light of these three pillars, we introduce MineDojo, a new framework to help the community develop open-ended, generally-capable agents. It is built on the popular Minecraft game, where a player explores a procedurally generated 3D world with diverse types of terrains to roam, materials to mine, tools to craft, structures to build, and wonders to discover. Unlike most other games [80, 85, 126], Minecraft defines no specific reward to maximize and no fixed storyline to follow, making it well suited for developing open-ended environments for embodied AI research. We make the following three major contributions:
鉴于这三个支柱，我们介绍了 MineDojo，这是一个新的框架，帮助社区开发开放性、通用性强的智能体。它建立在流行的 Minecraft 游戏之上，玩家可以在一个由程序生成的三维世界中探索，其中包含各种类型的地形、可供漫游的材料、可制作的工具、可建造的结构和可发现的奇迹。与大多数其他游戏不同[80, 85, 126]，Minecraft 没有定义特定的最大化奖励和固定的故事情节，因此非常适合开发用于体验式人工智能研究的开放性环境。我们做出了以下三个重要贡献：

1. Simulation platform with thousands of diverse open-ended tasks.
具有数千个多样化开放式任务的模拟平台。

MineDojo provides convenient APIs on top of Minecraft that standardize task specification, world settings, and agent’s observation/action spaces. We introduce a benchmark suite that consists of thousands of natural language-prompted tasks, making it two orders of magnitude larger than prior Minecraft benchmarks like the MineRL Challenge [48, 62]. The suite includes long-horizon, open-ended tasks that cannot be easily evaluated through automated procedures, such as “build an epic modern house with two floors and a swimming pool”. Inspired by the Inception score [98] and FID score [55] that are commonly used to assess AI-generated image quality, we introduce a novel agent evaluation protocol using a large video-language model pre-trained on Minecraft YouTube videos. This complements human scoring [104] that is precise but more expensive. Our learned evaluation metric has good agreement with human judgment in a subset of the full task suite considered in the experiments.
MineDojo 在 Minecraft 之上提供方便的 API，标准化任务规范、世界设置和代理的观察/动作空间。我们引入了一个基准套件，包含数千个自然语言提示的任务，比之前的 Minecraft 基准测试（如 MineRL Challenge [48, 62]）大两个数量级。该套件包括长期目标、开放式任务，无法通过自动化程序轻松评估，例如“建造一个有两层和游泳池的史诗般的现代房屋”。受到 Inception 分数[98]和 FID 分数[55]的启发，这些分数通常用于评估 AI 生成的图像质量，我们引入了一种新颖的代理评估协议，使用在 Minecraft YouTube 视频上预训练的大型视频语言模型。这补充了人工评分[104]，后者精确但更昂贵。我们学习到的评估指标在实验中对于完整任务套件的一个子集与人类判断具有良好的一致性。

2. Internet-scale multimodal Minecraft knowledge base.
互联网规模的多模态 Minecraft 知识库。

Minecraft has more than 100 million active players [131], who have collectively generated an enormous wealth of data. They record tutorial videos, stream live play sessions, compile recipes, and discuss tips and tricks on forums. MineDojo features a massive collection of 730K+ YouTube videos with time-aligned transcripts, 6K+ free-form Wiki pages, and 340K+ Reddit posts with multimedia contents (Fig. 3). We hope that this enormous knowledge base can help the agent acquire diverse skills, develop complex strategies, discover interesting objectives, and learn actionable representations automatically.
Minecraft 拥有超过 1 亿活跃玩家[131]，他们共同产生了大量的数据。他们录制教程视频，直播游戏过程，整理配方，并在论坛上讨论技巧和窍门。MineDojo 拥有超过 73 万个与时间对齐的 YouTube 视频转录，超过 6 千个自由形式的维基页面，以及超过 34 万个带有多媒体内容的 Reddit 帖子（图 3）。我们希望这个庞大的知识库能帮助代理程序获得多样化的技能，发展复杂的策略，发现有趣的目标，并自动学习可操作的表征。

3. Novel algorithm for embodied agents with large-scale pre-training.
具有大规模预训练的具身化代理的新算法。

We develop a new learning algorithm for embodied agents that makes use of the internet-scale domain knowledge we have collected from the web. Using the massive volume of YouTube videos from MineDojo, we train a video-text contrastive model in the spirit of CLIP [92], which associates natural language subtitles with their time-aligned video segments. We demonstrate that this learned correlation score can be used effectively as an open-vocabulary, massively multi-task reward function for RL training. Our agent solves the majority of 12 tasks in our experiment using the learned reward model (Fig. 2). It achieves competitive performance to agents trained with meticulously engineered dense-shaping rewards, and in some cases outperforms them, with up to 73% improvement in success rates. For open-ended tasks that do not have a simple success criterion, our agents also perform well without any special modifications.
我们为具身代理开发了一种新的学习算法，利用了我们从网络上收集到的互联网规模的领域知识。利用来自 MineDojo 的大量 YouTube 视频，我们训练了一个视频-文本对比模型，类似于 CLIP [92]的精神，它将自然语言字幕与其时间对齐的视频片段相关联。我们证明了这个学习到的相关性分数可以有效地作为开放词汇量、大规模多任务奖励函数用于强化学习训练。我们的代理使用学习到的奖励模型成功解决了我们实验中的大部分 12 个任务（图 2）。它的性能与经过精心设计的密集奖励训练的代理相当，并且在某些情况下表现优于它们，成功率提高了高达 73%。对于没有简单成功标准的开放式任务，我们的代理也表现良好，无需任何特殊修改。

In summary, this paper proposes an open-ended task suite, internet-scale domain knowledge, and agent learning with recent advances on large pre-trained models [13]. We have open-sourced MineDojo’s simulator, knowledge bases, algorithm implementations, pretrained model checkpoints, and task curation tools at https://minedojo.org/. We hope that MineDojo will serve as an effective starter framework for the community to develop new algorithms and advance towards generally capable embodied agents.
总之，本文提出了一个开放式任务套件、互联网规模的领域知识以及利用大型预训练模型的代理学习[13]。我们已经在 https://minedojo.org/上开源了 MineDojo 的模拟器、知识库、算法实现、预训练模型检查点和任务策划工具。我们希望 MineDojo 能够成为社区开发新算法并向普遍能力的具体化代理迈进的有效起点框架。

2 MineDojo Simulator & Benchmark Suite
2 MineDojo 模拟器和基准测试套件

MineDojo offers a set of simulator APIs help researchers develop generally capable, open-ended agents in Minecraft. It builds upon the open-source MineRL codebase [48] and makes the following upgrades: 1) We provide unified observation and action spaces across all tasks, facilitating the development of multi-task and continually learning agents that can constantly adapt to new scenarios and novel tasks. This deviates from the MineRL Challenge design that tailors observation and action spaces to individual tasks; 2) Our simulation unlocks all three types of worlds in Minecraft, including the Overworld, the Nether, and the End, which substantially expands the possible task space, while MineRL only supports the Overworld natively; and 3) We provide convenient APIs to configure initial conditions and world settings to standardize our tasks.
MineDojo 提供一组模拟器 API，帮助研究人员在 Minecraft 中开发具有普遍能力和开放性的代理程序。它基于开源的 MineRL 代码库[48]，并进行了以下升级：1）我们提供统一的观察和行动空间，适用于所有任务，便于开发多任务和持续学习的代理程序，可以不断适应新的情境和新的任务。这与 MineRL 挑战的设计不同，后者将观察和行动空间针对个别任务进行定制；2）我们的模拟器解锁了 Minecraft 中的三种世界类型，包括主世界、地狱和末地，大大扩展了可能的任务空间，而 MineRL 只原生支持主世界；3）我们提供方便的 API 来配置初始条件和世界设置，以标准化我们的任务。

With this MineDojo simulator, we define thousands of benchmarking tasks, which are divided into two categories: 1) Programmatic tasks that can be automatically assessed based on the ground-truth simulator states; and 2) Creative tasks that do not have well-defined or easily-automated success criteria, which motivates our novel evaluation protocol using a learned model (Sec. 4). To scale up the number of Creative tasks, we mine ideas from YouTube tutorials and use OpenAI’s GPT-3 [15] service to generate substantially more task definitions. Compared to Creative tasks, Programmatic tasks are simpler to get started, but tend to have restricted scope, limited language variations, and less open-endedness in general.
通过这个 MineDojo 模拟器，我们定义了数千个基准任务，分为两类：1）可以根据真实模拟器状态自动评估的编程任务；2）没有明确定义或容易自动化的成功标准的创造性任务，这促使我们使用学习模型（第 4 节）来进行新颖的评估协议。为了扩大创造性任务的数量，我们从 YouTube 教程中挖掘思路，并使用 OpenAI 的 GPT-3 [15]服务生成更多的任务定义。与创造性任务相比，编程任务更容易入门，但往往范围有限，语言变化少，并且总体上不太开放。

2.1 Task Suite I: Programmatic Tasks
2.1 任务套件 I：编程任务

We formalize each programmatic task as a 5-tuple: $T=(G,\mathcal{G},\mathcal{I},f_{\mathcal{S}},f_{\mathcal{R}})$ . $G$ is an English description of the task goal, such as “find material and craft a gold pickaxe”. $\mathcal{G}$ is a natural language guidance that provides helpful hints, recipes, or advice to the agent. We leverage OpenAI’s GPT-3-davinci API to automatically generate detailed guidance for a subset of the tasks. For the example goal “bring a pig into Nether”, GPT-3 returns: 1) Find a pig in the overworld; 2) Right-click on the pig with a lead; 3) Right-click on the Nether Portal with the lead and pig selected; 4) The pig will be pulled through the portal! $\mathcal{I}$ is the initial conditions of the agent and the world, such as the initial inventory, spawn terrain, and weather. $f_{\mathcal{S}}$ : $s_{t}\rightarrow\{0,1\}$ is the success criterion, a deterministic function that maps the current world state $s_{t}$ to a Boolean success label. $f_{\mathcal{R}}$ : $s_{t}\rightarrow\mathbb{R}$ is an optional dense reward function. We only provide $f_{\mathcal{R}}$ for a small subset of the tasks in MineDojo due to the high costs of meticulously crafting dense rewards. For our current agent implementation (Sec. 4.1), we do not use detailed guidance. Inspired by concurrent works SayCan [3] and Socratic Models [143], one potential idea is to feed each step in the guidance to our learned reward model sequentially so that it becomes a stagewise reward function for a complex multi-stage task.
我们将每个编程任务形式化为一个五元组： $T=(G,\mathcal{G},\mathcal{I},f_{\mathcal{S}},f_{\mathcal{R}})$ 。 $G$ 是任务目标的英文描述，例如“找到材料并制作一个金镐”。 $\mathcal{G}$ 是提供有用提示、配方或建议给代理的自然语言指导。我们利用 OpenAI 的 GPT-3-davinci API 自动生成一些任务的详细指导。对于示例目标“将猪带入地狱”，GPT-3 返回：1）在主世界找到一只猪；2）用绳子右键点击猪；3）用绳子和选中的猪右键点击地狱传送门；4）猪将被拉到传送门里！ $\mathcal{I}$ 是代理和世界的初始条件，例如初始物品栏、生成地形和天气。 $f_{\mathcal{S}}$ ： $s_{t}\rightarrow\{0,1\}$ 是成功标准，一个确定性函数，将当前世界状态 $s_{t}$ 映射到布尔值的成功标签。 $f_{\mathcal{R}}$ ： $s_{t}\rightarrow\mathbb{R}$ 是一个可选的稠密奖励函数。由于精心制作稠密奖励的成本较高，我们只为 MineDojo 中的一小部分任务提供 $f_{\mathcal{R}}$ 。对于我们当前的代理实现（第 4.1 节），我们不使用详细指导。受到 SayCan [3]和 Socratic Models [143]的并行工作的启发，一个潜在的想法是将每个步骤依次输入到我们学习的奖励模型中，从而使其成为复杂多阶段任务的逐阶段奖励函数。

MineDojo provides 4 categories of programmatic tasks with 1,581 template-generated natural language goals to evaluate the agent’s different capabilities systematically and comprehensively:
MineDojo 提供了 4 个类别的编程任务，包含 1,581 个模板生成的自然语言目标，以系统和全面地评估代理的不同能力

1.

Survival: surviving for a designated number of days.

生存：在指定的天数内生存下来。
2.

Harvest: finding, obtaining, cultivating, or manufacturing hundreds of materials and objects.

收获：寻找、获取、培育或制造数百种材料和物品。
3.

Tech Tree: crafting and using a hierarchy of tools.

科技树：制作和使用工具的层次结构。
4.

Combat: fighting various monsters and creatures that require fast reflex and martial skills.

战斗：与需要快速反应和武术技巧的各种怪物和生物战斗。

Each task template has a number of variations based on the terrain, initial inventory, quantity, etc., which form a flexible spectrum of difficulty. In comparison, the NeurIPS MineRL Diamond challenge [48] is a subset of our programmatic task suite, defined by the task goal “obtain 1 diamond" in MineDojo.
每个任务模板都有许多变化，基于地形、初始库存、数量等，形成了一个灵活的难度范围。相比之下，NeurIPS MineRL Diamond 挑战是我们编程任务套件的一个子集，由任务目标“获取 1 颗钻石”在 MineDojo 中定义。

2.2 Task Suite II: Creative Tasks
2.2 任务套件 II：创意任务

We define each creative task as a 3-tuple, $T=(G,\mathcal{G},\mathcal{I})$ , which differs from programmatic tasks due to the lack of straightforward success criteria. Inspired by model-based metrics like the Inception score [98] and FID score [55] for image generation, we design a novel task evaluation metric based on a pre-trained contrastive video-language model (Sec. 4.1). In the experiments, we find that the learned metric exhibits a high level of agreement with human evaluations (see Table 2).
我们将每个创意任务定义为一个 3 元组 $T=(G,\mathcal{G},\mathcal{I})$ ，与程序化任务不同的是，由于缺乏明确的成功标准。受到图像生成的基于模型的度量指标如 Inception 分数[98]和 FID 分数[55]的启发，我们设计了一种基于预训练对比视频-语言模型的新型任务评估指标（第 4.1 节）。在实验中，我们发现学习到的指标与人类评估具有高度一致性（见表 2）。

We brainstorm and author 216 Creative tasks, such as “build a haunted house with zombie inside” and “race by riding a pig”. Nonetheless, such a manual approach is not scalable. Therefore, we develop two systematic approaches to extend the total number of task definitions to 1,560. This makes our Creative tasks 3 orders of magnitude larger than Minecraft BASALT challenge [104], which has 4 Creative tasks.
我们进行头脑风暴并创作了 216 个创意任务，例如“建造一个里面有僵尸的鬼屋”和“骑猪比赛”。然而，这种手动方法不具有可扩展性。因此，我们开发了两种系统方法，将任务定义的总数扩展到 1,560 个。这使得我们的创意任务比 Minecraft BASALT 挑战[104]的 4 个创意任务大了 3 个数量级。

Approach 1. Task Mining from YouTube Tutorial Videos.
方法 1. 从 YouTube 教程视频中挖掘任务。

We identify our YouTube dataset as a rich source of tasks, as many human players demonstrate and narrate creative missions in the tutorial playlists. To collect high-quality tasks and accompanying videos, we design a 3-stage pipeline that makes it easy to find and annotate interesting tasks (see Sec. C.2 for details). Through this pipeline, we extract 1,042 task ideas from the common wisdom of a huge number of veteran Minecraft gamers, such as “make an automated mining machine” and “grow cactus up to the sky”.
我们将 YouTube 数据集视为一个丰富的任务来源，因为许多玩家在教程播放列表中展示和叙述创造性任务。为了收集高质量的任务和相关视频，我们设计了一个三阶段的流程，使得寻找和注释有趣的任务变得容易（详见第 C.2 节）。通过这个流程，我们从大量经验丰富的 Minecraft 老玩家的共同智慧中提取了 1,042 个任务创意，例如“制作一个自动化的采矿机器”和“将仙人掌种植到天空中”。

Approach 2. Task Creation by GPT-3.
方法 2. 通过 GPT-3 创建任务。

We leverage GPT-3’s few-shot capability to generate new task ideas by seeding it with the tasks we manually author or mine from YouTube. The prompt template is: Here are some example creative tasks in Minecraft: {a few examples}. Let’s brainstorm more detailed while reasonable creative tasks in Minecraft.
我们利用 GPT-3 的少样本能力，通过向其提供我们手动编写或从 YouTube 挖掘的任务来生成新的任务想法。提示模板是：这里有一些 Minecraft 中的创意任务示例：{几个示例}。让我们在 Minecraft 中进行更详细但合理的创意任务头脑风暴。
GPT-3 contributes 302 creative tasks after de-duplication, and demonstrates a surprisingly proficient understanding of Minecraft terminology.
GPT-3 在去重后贡献了 302 个创意任务，并展示出对 Minecraft 术语的出乎意料的熟练理解。

2.3 Collection of Starter Tasks
2.3 起始任务的收集

We curate a set of 64 core tasks for future researchers to get started more easily. If their agent works well on these tasks, they can more confidently scale to the full benchmark.
我们为未来的研究人员策划了一组 64 个核心任务，以便更轻松地开始。如果他们的代理在这些任务上表现良好，他们就可以更有信心地扩展到完整的基准测试。

•

32 programmatic tasks: 16 “standard” and 16 “difficult”, spanning all 4 categories (survival, harvesting, combat, and tech tree). We rely on our Minecraft knowledge to decide the difficulty level. “Standard” tasks require fewer steps and lower resource dependencies to complete.

• 32 个程序化任务：16 个“标准”和 16 个“困难”，涵盖了所有 4 个类别（生存、收获、战斗和科技树）。我们依靠我们的 Minecraft 知识来决定难度级别。“标准”任务需要更少的步骤和更低的资源依赖来完成。
•

32 creative tasks: 16 “standard” and 16 “difficult”. Similarly, tasks labeled with “standard” are typically short-horizon tasks.

• 32 个创意任务：16 个“标准”和 16 个“困难”。同样，标有“标准”的任务通常是短期任务。

We recommend that researchers run 100 evaluation episodes for each task and report the percentage success rate. The programmatic tasks have ground-truth success, while the creative tasks need our novel evaluation protocol (Sec. 5).
我们建议研究人员对每个任务运行 100 个评估周期，并报告成功率的百分比。编程任务具有真实成功率，而创造性任务需要我们的新颖评估协议（第 5 节）。

3 Internet-scale Knowledge Base
3 互联网规模的知识库

Two commonly used approaches [112, 126, 85, 36] to train embodied agents include training agents from scratch using RL with well-tuned reward functions for each task, or using a large amount of human-demonstrations to bootstrap agent learning. However, crafting well-tuned reward functions is challenging or infeasible for our task suite (Sec. 2.2), and employing expert gamers to provide large amounts of demonstration data would also be costly and infeasible [126].
训练具身代理的两种常用方法[112, 126, 85, 36]包括使用经过调整的奖励函数从头开始使用强化学习训练代理，或者使用大量的人类示范来启动代理学习。然而，为我们的任务套件（第 2.2 节）设计出经过调整的奖励函数是具有挑战性或不可行的，而雇佣专家玩家提供大量示范数据也将是昂贵且不可行的[126]。

Instead, we turn to the open web as an ever-growing, virtually unlimited source of learning material for embodied agents. The internet provides a vast amount of domain knowledge about Minecraft, which we harvest by extensive web scraping and filtering. We collect 33 years worth of YouTube videos, 6K+ Wiki pages, and millions of Reddit comment threads. Instead of hiring a handful of human demonstrators, we capture the collective wisdom of millions of Minecraft gamers around the world. Furthermore, language is a key and pervasive component of our database that takes the form of YouTube transcripts, textual descriptions in Wiki, and Reddit discussions. Language facilitates open-vocabulary understanding, provides grounding for image and video modalities, and unlocks the power of large language models [27, 109, 15] for embodied agents. To ensure socially responsible model development, we take special measures to filter out low-quality and toxic contents [13, 51] from our databases, detailed in the Appendix (Sec. D).
相反，我们转向开放网络作为具有无限学习材料的源头，供具有实体的代理使用。互联网提供了大量关于 Minecraft 的领域知识，我们通过广泛的网络爬取和过滤来收集这些知识。我们收集了 33 年的 YouTube 视频，6K+的维基页面和数百万条 Reddit 评论线程。我们捕捉了全球数百万 Minecraft 玩家的集体智慧，而不是雇佣一小部分人类示范者。此外，语言是我们数据库的一个关键和普遍组成部分，它以 YouTube 的转录、维基的文本描述和 Reddit 的讨论形式存在。语言促进了开放词汇的理解，为图像和视频模态提供了基础，并为具有实体的代理解锁了大型语言模型的能力[27, 109, 15]。为了确保社会责任的模型开发，我们采取特殊措施从我们的数据库中过滤掉低质量和有毒的内容[13, 51]，详细内容请参见附录（第 D 节）。

YouTube Videos and Transcripts.
YouTube 视频和文字稿件。

Minecraft is among the most streamed games on YouTube [41]. Human players have demonstrated a stunning range of creative activities and sophisticated missions that take hours to complete (examples in Fig. 3). We collect 730K+ narrated Minecraft videos, which add up to 33 years of duration and 2.2B words in English transcripts. In comparison, HowTo100M [77] is a large-scale human instructional video dataset that includes 15 years of experience in total – about half of our volume. The time-aligned transcripts enable the agent to ground free-form natural language in video pixels and learn the semantics of diverse activities without laborious human labeling. We operationalize this insight in our pre-trained video-language model (Sec. 4.1).
Minecraft 是 YouTube 上最受欢迎的游戏之一[41]。人类玩家展示了令人惊叹的创造活动和复杂任务，完成这些任务需要数小时（图 3 中有示例）。我们收集了超过 730K 个解说的 Minecraft 视频，总时长达 33 年，英文转录中包含了 22 亿个单词。相比之下，HowTo100M [77]是一个大规模的人类教学视频数据集，总共包含了 15 年的经验 - 大约是我们的一半。时间对齐的转录使得智能体能够将自由形式的自然语言与视频像素进行关联，并学习各种活动的语义，而无需费力的人工标注。我们在预训练的视频语言模型中实现了这一洞察（第 4.1 节）。

Minecraft Wiki. 我的世界百科全书。

The Wiki pages cover almost every aspect of the game mechanics, and supply a rich source of unstructured knowledge in multimodal tables, recipes, illustrations, and step-by-step tutorials. We use Selenium [103] to scrape 6,735 pages that interleave text, images, tables, and diagrams. The pages are highly unstructured and do not share any common schema, as the Wiki is meant for human consumption rather than AI training. To preserve the layout information, we additionally save the screenshots of entire pages and extract 2.2M bounding boxes of the visual elements (visualization in Fig. A.4 and A.5). We do not use Wiki data in our current experiments. Since the Wiki contains detailed recipes for all crafted objects, they could be provided as input or training data for hierarchical planning methods and policy sketches [8]. Another promising future direction is to apply document understanding models such as LayoutLM [138, 137] and DocFormer [9] to learn actionable knowledge from these unstructured Wiki data.
维基页面涵盖了游戏机制的几乎所有方面，并提供了丰富的非结构化知识来源，包括多模式表格、配方、插图和逐步教程。我们使用 Selenium [103]来爬取 6,735 个页面，这些页面交错着文本、图片、表格和图表。这些页面非常不规则，没有共同的模式，因为维基是为人类消费而不是 AI 训练而设计的。为了保留布局信息，我们还保存了整个页面的截图，并提取了 2.2M 个可视元素的边界框（见图 A.4 和 A.5 中的可视化）。我们在当前的实验中没有使用维基数据。由于维基包含了所有制作物品的详细配方，它们可以作为分层规划方法和策略草图[8]的输入或训练数据。另一个有前景的未来方向是应用文档理解模型，如 LayoutLM [138, 137]和 DocFormer [9]，从这些非结构化的维基数据中学习可操作的知识。

Reddit.

We scrape 340K+ posts along with 6.6M comments under the “r/Minecraft” subreddit. These posts ask questions on how to solve certain tasks, showcase cool architectures and achievements in image/video snippets, and discuss general tips and tricks for players of all expertise levels. We do not use Reddit data for training in Sec. 5, but a potential idea is to finetune large language models [27, 91] on our Reddit corpus to generate instructions and execution plans that are better grounded in the Minecraft domain. Concurrent works [3, 56, 143] have explored similar ideas and showed excellent results on robot learning, which is encouraging for more future research in MineDojo.
我们爬取了“r/Minecraft”子论坛下的 340K+帖子和 6.6M 条评论。这些帖子提出了关于如何解决特定任务的问题，展示了在图像/视频片段中的酷炫建筑和成就，并讨论了适用于各种技能水平的玩家的一般提示和技巧。我们在第 5 节中没有使用 Reddit 数据进行训练，但一个潜在的想法是在我们的 Reddit 语料库上微调大型语言模型[27, 91]，以生成更好地与 Minecraft 领域相关的指令和执行计划。同时进行的研究[3, 56, 143]探索了类似的想法，并在机器人学习方面取得了出色的结果，这对于 MineDojo 的更多未来研究是鼓舞人心的。

4 Agent Learning with Large-scale Pre-training
4 代理学习与大规模预训练

One of the grand challenges of embodied AI is to build a single agent that can complete a wide range of open-world tasks. The MineDojo framework aims to facilitate new techniques towards this goal by providing an open-ended task suite (Sec. 2) and large-scale internet knowledge base (Sec. 3). Here we take an initial step towards this goal by developing a proof of concept that demonstrates how a single language-prompted agent can be trained in MineDojo to complete several complex Minecraft tasks. To this end, we propose a novel agent learning algorithm that takes advantage of the massive YouTube data offered by MineDojo. We note that this is only one of the numerous possible ways to use MineDojo’s internet database — the Wiki and Reddit corpus also hold great potential to drive new algorithm discoveries for the community in future works.
体验智能的一个重大挑战是构建一个能够完成各种开放世界任务的单一代理。MineDojo 框架旨在通过提供一个开放式任务套件（第 2 节）和大规模互联网知识库（第 3 节）来促进朝着这个目标的新技术。在这里，我们通过开发一个概念验证来迈出实现这个目标的初始步骤，该验证展示了一个单一的语言提示代理如何在 MineDojo 中接受训练以完成几个复杂的 Minecraft 任务。为此，我们提出了一种新颖的代理学习算法，利用了 MineDojo 提供的大量 YouTube 数据。我们注意到，这只是使用 MineDojo 互联网数据库的众多可能方式之一 - 维基和 Reddit 语料库在未来的工作中也具有推动社区进行新算法发现的巨大潜力。

In this paper, we consider a multi-task reinforcement learning (RL) setting, where an agent is tasked with completing a collection of MineDojo tasks specified by language instructions (Sec. 2). Solving these tasks often requires the agent to interact with the Minecraft world in a prolonged fashion. Agents developed in popular RL benchmarks [119, 146] often rely on meticulously crafted dense and task-specific reward functions to guide random explorations. However, these rewards are hard or even infeasible to define for our diverse and open-ended tasks in MineDojo. To address this challenge, our key insight is to learn a dense, language-conditioned reward function from in-the-wild YouTube videos and their transcripts. Therefore, we introduce MineCLIP, a contrastive video-language model that learns to correlate video snippets and natural language descriptions (Fig. 4). MineCLIP is multi-task by design, as it is trained on open-vocabulary and diverse English transcripts.
在本文中，我们考虑了一个多任务强化学习（RL）的设置，其中一个代理被指定完成一系列由语言指令（第 2 节）指定的 MineDojo 任务。解决这些任务通常需要代理与 Minecraft 世界进行长时间的交互。在流行的 RL 基准测试中开发的代理[119, 146]通常依赖于精心设计的密集和任务特定的奖励函数来引导随机探索。然而，对于我们在 MineDojo 中多样化和开放性的任务来说，这些奖励很难甚至不可行定义。为了解决这个挑战，我们的关键见解是从野外 YouTube 视频和它们的转录中学习一个密集的、以语言为条件的奖励函数。因此，我们引入了 MineCLIP，一个对比视频语言模型，它学习将视频片段和自然语言描述相关联（图 4）。MineCLIP 是多任务的设计，因为它在开放词汇和多样化的英文转录上进行训练。

During RL training, MineCLIP provides a high-quality reward signal without any domain adaptation techniques, despite the domain gap between noisy YouTube videos and clean simulator-rendered frames. MineCLIP eliminates the need to manually engineer reward functions for each and every MineDojo task. For Creative tasks that lack a simple success criterion (Sec. 2.2), MineCLIP also serves the dual purpose of an automatic evaluation metric that agrees well with human judgement on a subset of tasks we investigate (Sec. 4.2, Table 2). Because the learned reward model incurs a non-trivial computational overhead, we introduce several techniques to significantly improve RL training efficiency, making MineCLIP a practical module for open-ended agent learning in Minecraft (Sec. 4.2).
在 RL 训练过程中，尽管嘈杂的 YouTube 视频和清晰的模拟器渲染帧之间存在领域差异，MineCLIP 提供了高质量的奖励信号，而无需任何领域适应技术。MineCLIP 消除了为每个 MineDojo 任务手动设计奖励函数的需要。对于缺乏简单成功标准的创造性任务（第 2.2 节），MineCLIP 还兼具自动评估指标的双重作用，在我们调查的一部分任务中与人类判断相一致（第 4.2 节，表 2）。由于学习的奖励模型会带来非常重要的计算开销，我们引入了几种技术来显著提高 RL 训练效率，使 MineCLIP 成为 Minecraft 中开放式智能学习的实用模块（第 4.2 节）。

[Uncaptioned image] — Table 1: Our novel MineCLIP reward model is able to achieve competitive performance with manually written dense reward function for Programmatic tasks, and significantly outperforms the CLIP_OpenAI method across all Creative tasks. Entries represent percentage success rates averaged over 3 seeds, each tested for 200 episodes. Success conditions are precise in Programmatic tasks, but estimated by MineCLIP for Creative tasks.
表 1：我们的新颖的 MineCLIP 奖励模型在程序化任务中能够达到与手动编写的密集奖励函数相竞争的性能，并且在所有创造性任务中明显优于 CLIP 方法。条目表示在 3 个种子上平均测试 200 个剧集的成功率百分比。在程序化任务中，成功条件是精确的，但在创造性任务中是由 MineCLIP 估计的。

4.1 Pre-Training MineCLIP on Large-scale Videos
4.1 在大规模视频上进行预训练的 MineCLIP

Formally, the learned reward function can be defined as $\Phi_{\mathcal{R}}:(G,V)\rightarrow\mathbb{R}$ that maps a language goal $G$ and a video snippet $V$ to a scalar reward. An ideal $\Phi_{\mathcal{R}}$ should return a high reward if the behavior depicted in the video faithfully follows the language description, and a low reward otherwise. This can be achieved by optimizing the InfoNCE objective [125, 52, 20], which learns to correlate positive video and text pairs [118, 6, 78, 4, 136].
形式上，学习到的奖励函数可以定义为将语言目标和视频片段映射到标量奖励的函数。理想的奖励函数应该在视频中所展示的行为忠实地遵循语言描述时返回高奖励，否则返回低奖励。这可以通过优化 InfoNCE 目标来实现，该目标学习相关的正面视频和文本对。

Similar to the image-text CLIP model [92], MineCLIP is composed of a separate text encoder $\phi_{G}$ that embeds a language goal and a video encoder $\phi_{V}$ that embeds a moving window of 16 consecutive frames with $160\times 256$ resolution (Fig. 4). Our neural architecture has a similar design as CLIP4Clip [75], where $\phi_{G}$ reuses OpenAI CLIP’s pretrained text encoder, and $\phi_{V}$ is factorized into a frame-wise image encoder $\phi_{I}$ and a temporal aggregator $\phi_{a}$ that summarizes the sequence of 16 image features into a single video embedding. Unlike CLIP4Clip, we insert two extra layers of residual CLIP Adapter [38] after the aggregator $\phi_{a}$ to produce a better video feature, and finetune only the last two layers of the pretrained $\phi_{I}$ and $\phi_{G}$ .
与图像文本 CLIP 模型[92]类似，MineCLIP 由一个独立的文本编码器 $\phi_{G}$ 和一个视频编码器 $\phi_{V}$ 组成，该编码器嵌入了一个连续 16 帧的移动窗口，分辨率为 $160\times 256$ （图 4）。我们的神经架构与 CLIP4Clip [75]具有相似的设计，其中 $\phi_{G}$ 重用了 OpenAI CLIP 的预训练文本编码器，而 $\phi_{V}$ 则被分解为逐帧图像编码器 $\phi_{I}$ 和时间聚合器 $\phi_{a}$ ，将 16 个图像特征序列汇总为单个视频嵌入。与 CLIP4Clip 不同的是，我们在聚合器 $\phi_{a}$ 之后插入了两个额外的残差 CLIP 适配器层[38]，以生成更好的视频特征，并且只微调了预训练的最后两层 $\phi_{I}$ 和 $\phi_{G}$ 。

From the MineDojo YouTube database, we follow the procedure in VideoCLIP [136] to sample 640K pairs of 16-second video snippets and time-aligned English transcripts, after applying a keyword filter. We train two MineCLIP variants with different types of aggregator $\phi_{a}$ : (1) MineCLIP[avg] does simple average pooling, which is fast but agnostic to the temporal ordering; (2) MineCLIP[attn] encodes the sequence by two transformer layers, which is relatively slower but captures more temporal information, and thus produces a better reward signal in general. Details of data preprocessing, architecture, and hyperparameters are listed in the Appendix (Sec. E).
从 MineDojo YouTube 数据库中，我们按照 VideoCLIP [136]中的过程，通过应用关键词过滤器，采样了 640K 对 16 秒视频片段和时间对齐的英文转录。我们使用不同类型的聚合器训练了两个 MineCLIP 变体： $\phi_{a}$ ：(1) MineCLIP[avg]使用简单的平均池化，速度快但对时间顺序不敏感；(2) MineCLIP[attn]通过两个 Transformer 层对序列进行编码，相对较慢但能捕捉更多的时间信息，因此通常产生更好的奖励信号。数据预处理、架构和超参数的详细信息请参见附录（第 E 节）。

4.2 RL with MineCLIP Reward
4.2RL 与 MineCLIP 奖励

We train a language-conditioned policy network that takes as input raw pixels and predicts discrete control. The policy is trained with PPO [102] on the MineCLIP rewards. In each episode, the agent is prompted with a language goal and takes a sequence of actions to fulfill this goal. When calculating the MineCLIP rewards, we concatenate the agent’s latest 16 egocentric RGB frames in a temporal window to form a video snippet. MineCLIP handles all task prompts zero-shot without any further finetuning. In our experiments (Sec. 5), we show that MineCLIP provides effective dense rewards out of the box, despite the domain shift between in-the-wild YouTube frames and simulator frames. Besides regular video data augmentation, we do not employ any special domain adaptation methods during pre-training. Our finding is consistent with CLIP’s strong zero-shot performances on robustness benchmarks in object recognition [92].
我们训练了一个以语言为条件的策略网络，它以原始像素作为输入，并预测离散控制。该策略使用 PPO [102]在 MineCLIP 奖励上进行训练。在每个 episode 中，代理被提示一个语言目标，并采取一系列动作来实现这个目标。在计算 MineCLIP 奖励时，我们将代理的最新 16 个自我中心 RGB 帧连接在一个时间窗口中，形成一个视频片段。MineCLIP 可以在没有任何进一步微调的情况下处理所有任务提示。在我们的实验中（第 5 节），我们展示了尽管在野外 YouTube 帧和模拟器帧之间存在领域转移，MineCLIP 仍然可以提供有效的密集奖励。除了常规的视频数据增强之外，在预训练期间我们没有使用任何特殊的领域适应方法。我们的发现与 CLIP 在目标识别的鲁棒性基准测试中的强大的零样本性能一致[92]。

Compared to hard-coded reward functions in popular benchmarks [146, 119, 34], the MineCLIP model has 150M parameters and is thus much more expensive to query. We make several design choices to greatly accelerate RL training with MineCLIP in the loop:
与流行基准测试中的硬编码奖励函数相比[146, 119, 34]，MineCLIP 模型具有 1.5 亿个参数，因此查询成本更高。我们做出了几个设计选择，以极大地加速在循环中使用 MineCLIP 进行强化学习训练。

1.

The language goal $G$ is fixed for a specific task, so the text features $\phi_{G}$ can be precomputed to avoid invoking the text encoder repeatedly.

语言目标 $G$ 是针对特定任务固定的，因此可以预先计算文本特征 $\phi_{G}$ ，以避免重复调用文本编码器。
2.

Our agent’s RGB encoder reuses the pre-trained weights of $\phi_{I}$ from MineCLIP. We do not finetune $\phi_{I}$ during RL training, which saves computation and endows the agent with good visual representations from the beginning.

我们的代理的 RGB 编码器重复使用了 MineCLIP 中 $\phi_{I}$ 的预训练权重。在强化学习训练过程中，我们不对 $\phi_{I}$ 进行微调，这样可以节省计算资源，并使代理从一开始就具备良好的视觉表示能力。
3.

MineCLIP’s video encoder $\phi_{V}$ is factorized into an image encoder $\phi_{I}$ and a light-weight aggregator $\phi_{a}$ . This design choice enables efficient image feature caching. Consider two overlapping video sequences of 8 frames, V[0:8] and V[1:9]. We can cache the image features of the 7 overlapping frames V[1] to V[7] to maximize compute savings. If $\phi_{V}$ is a monolithic model like S3D [135] in VideoCLIP [136], then the video features from every sliding window must be recomputed, which would incur a much higher cost per time step.

3. MineCLIP 的视频编码器 $\phi_{V}$ 被分解为图像编码器 $\phi_{I}$ 和轻量级聚合器 $\phi_{a}$ 。这种设计选择可以实现高效的图像特征缓存。考虑两个重叠的 8 帧视频序列 V[0:8]和 V[1:9]。我们可以缓存 7 个重叠帧 V[1]到 V[7]的图像特征，以最大化计算节省。如果 $\phi_{V}$ 是像 VideoCLIP [136]中的 S3D [135]这样的整体模型，那么每个滑动窗口的视频特征都必须重新计算，这将导致每个时间步的成本更高。
4.

We leverage Self-Imitation Learning [84] to store the trajectories with high MineCLIP reward values in a buffer, and alternate between PPO and self-imitation gradient steps. It further improves sample efficiency as shown in the Appendix (Fig. A.8).

我们利用自我模仿学习[84]将具有较高 MineCLIP 奖励值的轨迹存储在缓冲区中，并在 PPO 和自我模仿梯度步骤之间交替进行。如附录所示（图 A.8），它进一步提高了样本效率。

Table 2: MineCLIP agrees well with the ground-truth human judgment on the Creative tasks we consider. Numbers are F1 scores between MineCLIP’s binary classification of tasks success and human labels (scaled to the percentage for better readability).
表 2：MineCLIP 在我们考虑的创意任务上与真实人类判断非常一致。数字是 MineCLIP 对任务成功的二分类和人类标签之间的 F1 分数（为了更好的可读性，将其缩放为百分比）。

Tasks 任务	Find Nether Portal 寻找下界传送门	Find Ocean 寻找海洋	Dig Hole 挖洞	Lay Carpet 铺地毯
Ours (Attn) 我们的（注意）	$98.7$	${\color[rgb]{0.09,0.45,0.27}\mathbf{100.0}}$	$99.4$	$97.4$
Ours (Avg) 我们的（平均）	${\color[rgb]{0.09,0.45,0.27}\mathbf{100.0}}$	${\color[rgb]{0.09,0.45,0.27}\mathbf{100.0}}$	${\color[rgb]{0.09,0.45,0.27}\mathbf{100.0}}$	${\color[rgb]{0.09,0.45,0.27}\mathbf{98.4}}$
CLIP_OpenAI 剪辑	$48.7$	$98.4$	$80.6$	$54.1$

5 Experiments 5 个实验

We evaluate our agent-learning approach (Section 4) on 8 Programmatic tasks and 4 Creative tasks from the MineDojo benchmarking suite. We select these 12 tasks due to the diversity of skills required to solve them (e.g., harvesting, combat, building, navigation) and domain-specific entities (e.g., animals, resources, monsters, terrains, and structures). We split the tasks into 3 groups and train one multi-task agent for each group: Animal-Zoo (4 Programmatic tasks on hunting or harvesting resource from animals), Mob-Combat (Programmatic, fight 4 types of hostile monsters), and Creative (4 tasks).
我们在 MineDojo 基准套件中评估了我们的代理学习方法（第 4 节）在 8 个程序化任务和 4 个创造性任务上。我们选择这 12 个任务是因为解决它们需要多样化的技能（例如收获、战斗、建造、导航）和领域特定的实体（例如动物、资源、怪物、地形和结构）。我们将任务分为 3 组，并为每组训练一个多任务代理：动物园（4 个程序化任务，猎杀或从动物身上收集资源）、怪物战斗（程序化，与 4 种敌对怪物战斗）和创造性（4 个任务）。

In the experiments, we empirically check the quality of MineCLIP against manually written reward functions, and quantify how different variants of our learned model affect the RL performance. Table 1 presents our main results, and Fig. 2 visualizes our learned agent behavior in 4 of the considered tasks. Policy networks of all methods share the same architecture and are trained by PPO + Self-Imitation (Sec. 4.2, training details in the Appendix, Sec. F). We compare the following methods:
在实验中，我们经验性地检查了 MineCLIP 与手动编写的奖励函数的质量，并量化了我们学习模型的不同变体对强化学习性能的影响。表 1 呈现了我们的主要结果，图 2 可视化了我们在 4 个考虑的任务中学习到的代理行为。所有方法的策略网络共享相同的架构，并通过 PPO + 自我模仿进行训练（第 4.2 节，训练细节见附录，第 F 节）。我们比较以下方法：

•

Ours (Attn): our agent trained with the MineCLIP[attn] reward model. For Programmatic tasks, we also add the final success condition as a binary reward. For Creative tasks, MineCLIP is the only source of reward.

我们（注意）：我们的代理人接受了基于 MineCLIP 的奖励模型的训练。对于程序化任务，我们还将最终成功条件作为二进制奖励添加。对于创造性任务，MineCLIP 是唯一的奖励来源。
•

Ours (Avg): the average-pooling variant of our method.

我们的（平均值）：我们方法的平均池化变体。
•

Manual Reward: hand-engineered dense reward using ground-truth simulator states.

手动奖励：使用真实模拟器状态进行手工设计的密集奖励。
•

Sparse-only: the final binary success as a single sparse reward. Note that neither sparse-only nor manual reward is available for Creative tasks.

只有稀疏奖励：最终的二进制成功作为单一的稀疏奖励。请注意，创造性任务不提供稀疏奖励或手动奖励。
•

CLIP_OpenAI: pre-trained OpenAI CLIP model that has not been finetuned on any MineDojo videos.

• CLIP _OpenAI ：未在任何 MineDojo 视频上进行微调的预训练 OpenAI CLIP 模型。

MineCLIP is competitive with manual reward.
MineCLIP 与手动奖励相媲美。

For Programmatic tasks (first 8 rows), RL agents guided by MineCLIP achieve competitive performance as those trained by manual reward. In three of the tasks, they even outperform the hand-engineered reward functions, which rely on privileged simulator states unavailable to MineCLIP. For a more statistically sound analysis, we conduct the Paired Student’s t-test to compare the mean success rate of each task (pairing column 3 “Ours (Attn)” and column 5 “Manual Reward” in Table 1). The test yields p-value $0.3991\gg 0.05$ , which indicates that the difference between our method and manual reward is not considered statistically significant, and therefore they are comparable with each other. Because all tasks require nontrivial exploration, our approach also dominates the Sparse-only baseline. Note that the original OpenAI CLIP model fails to achieve any success. We hypothesize that the creatures in Minecraft look dramatically different from their real-world counterparts, which causes CLIP to produce misleading signals worse than no shaping reward at all. It implies the importance of finetuning on MineDojo’s YouTube data.
对于编程任务（前 8 行），由 MineCLIP 引导的 RL 代理在性能上与手动奖励训练的代理相媲美。在其中三个任务中，它们甚至超过了依赖于 MineCLIP 无法获取的特权模拟器状态的手工设计奖励函数。为了进行更加统计上可靠的分析，我们进行了配对的学生 t 检验，比较了每个任务的平均成功率（表 1 中的第 3 列“Ours (Attn)”和第 5 列“Manual Reward”进行配对）。检验结果得到 p 值小于 0.05，这表明我们的方法与手动奖励之间的差异在统计上不具有显著性，因此它们是可比较的。由于所有任务都需要复杂的探索，我们的方法也优于仅稀疏奖励的基线。请注意，原始的 OpenAI CLIP 模型无法取得任何成功。我们假设 Minecraft 中的生物与现实世界中的对应物看起来截然不同，这导致 CLIP 产生的误导信号比没有任何奖励更糟糕。这说明了在 MineDojo 的 YouTube 数据上进行微调的重要性。

MineCLIP provides automated evaluation.
MineCLIP 提供自动评估。

For Creative tasks (last 4 rows), there are no programmatic success criteria available. We convert MineCLIP into a binary success classifier by thresholding the reward value it outputs for an episode. To test the quality of MineCLIP as an automatic evaluation metric, we ask human judges to curate a dataset of 100 successful and 100 failed trajectories for each task. We then run both MineCLIP variants and CLIP_OpenAI on the dataset and report the binary F1 score of their judgement against human ground-truth in Table 2. The results demonstrate that both MineCLIP[attn] and MineCLIP[avg] attain a very high degree of agreement with human evaluation results on this subset of the Creative task suite that we investigate. CLIP_OpenAI baseline also achieves nontrivial agreement on Find Ocean and Dig Hole tasks, likely because real-world oceans and holes have similar texture. We use the attn variant as an automated success criterion to score the 4 Creative task results in Table 1. Our proposed method consistently learns better than CLIP_OpenAI-guided agents. It shows that MineCLIP is an effective approach to solving open-ended tasks when no straightforward reward signal is available. We provide further analysis beyond these 4 tasks in the Appendix (Sec. G.4).
对于创造性任务（最后 4 行），没有可用的程序化成功标准。我们通过对 MineCLIP 输出的奖励值进行阈值处理，将其转化为二元成功分类器。为了测试 MineCLIP 作为自动评估指标的质量，我们请人类评委为每个任务策划了 100 个成功和 100 个失败的轨迹数据集。然后我们在数据集上运行 MineCLIP 的两个变体和 CLIP，并在表 2 中报告它们与人类真实结果的二元 F1 得分。结果表明，MineCLIP[attn]和 MineCLIP[avg]在我们研究的创造性任务子集上与人类评估结果达成了非常高的一致性。CLIP 基准[b1]在寻找海洋和挖洞任务上也取得了一定的一致性，可能是因为现实世界中的海洋和洞穴具有相似的纹理。我们使用 attn 变体作为自动化的成功标准，对表 1 中的 4 个创造性任务结果进行评分。我们提出的方法始终比 CLIP 引导的代理学习效果更好。这表明当没有直接的奖励信号可用时，MineCLIP 是解决开放式任务的有效方法。我们在附录（第 G.4 节）中提供了超出这 4 个任务的进一步分析。

MineCLIP shows good zero-shot generalization to significant visual distribution shift.
MineCLIP 在面对显著的视觉分布转变时展现出良好的零样本泛化能力。

We evaluate the learned policy without finetuning on a combination of unseen weather, lighting conditions, and terrains — 27 scenarios in total. For the baseline, we train agents with the original CLIP_OpenAI image encoder (not trained on our YouTube videos) by imitation learning. The robustness against visual shift can be quantitatively measured by the relative performance degradation on novel test scenarios for each task. Table 3 shows that while all methods incur performance drops, agents with MineCLIP encoder is more robust to visual changes than the baseline across all tasks. Pre-training on diverse in-the-wild YouTube videos is important to boosting zero-shot visual generalization capability, a finding consistent with literature [92, 82].
我们在未经微调的情况下评估了学习策略在未见过的天气、光照条件和地形的组合下的表现-总共 27 个场景。作为基准，我们通过模仿学习使用原始的 CLIP _OpenAI 图像编码器（未经我们的 YouTube 视频训练）来训练代理。对于每个任务，可以通过相对性能下降来定量衡量对视觉变化的鲁棒性。表 3 显示，虽然所有方法都会导致性能下降，但 MineCLIP 编码器的代理在所有任务中对视觉变化更具鲁棒性，相比基准方法。在多样化的野外 YouTube 视频上进行预训练对于提高零样本视觉泛化能力是重要的，这一发现与文献[92, 82]一致。

Learning a Single Agent for All 12 Tasks
学习一个单一代理来完成所有 12 个任务

We have also trained a single agent for all 12 tasks. To reduce the computational burden without loss of generality, the agent is trained by self-imitating from successful trajectories generated from the self-imitation learning pipeline (Section F.3). We summarize the result in Table 4. Similar to our main experiments, all numbers represent percentage success rates averaged over 3 training seeds, each tested for 200 episodes. Compared to the original agents, the new 12-multitask agent sees a performance boost in 6 tasks, degradation in 4 tasks, and roughly the same success rates in 2 tasks. This result suggests that there are both positive and negative task transfers happening. To improve the multi-task performance, more advanced algorithms [141, 133] can be employed to mitigate the negative transfer effects.
我们还为所有 12 个任务训练了一个单一的代理。为了减少计算负担而不损失普适性，该代理通过从自我模仿学习流程（第 F.3 节）生成的成功轨迹进行自我模仿训练。我们在表 4 中总结了结果。与我们的主要实验类似，所有数字表示在 3 个训练种子上平均的成功率百分比，每个种子测试 200 个回合。与原始代理相比，新的 12 任务多任务代理在 6 个任务中表现提升，在 4 个任务中表现下降，在 2 个任务中成功率大致相同。这个结果表明，存在着正向和负向的任务转移。为了提高多任务性能，可以采用更先进的算法[141, 133]来减轻负向转移效应。

We also perform Paired Student’s t-test to statistically compare the performance of the 12-multitask agent and those separately trained on each task group. We obtain a p-value of $0.3720\gg 0.05$ , which suggests that the difference between the two training settings is not statistically significant.
我们还进行了配对的学生 t 检验，以统计比较 12 个多任务代理和分别训练在每个任务组上的代理的性能。我们得到了一个 p 值为 $0.3720\gg 0.05$ ，这表明两种训练设置之间的差异在统计上不显著。

Table 5: We test the open-vocabulary generalization ability to two unseen tasks. All numbers represent percentage success rates averaged over 3 seeds, each tested for 200 episodes.
表 5：我们测试了开放词汇的泛化能力，针对两个未见过的任务。所有数字表示在 3 个种子上平均的成功率百分比，每个种子测试了 200 个情节。

Tasks 任务	Ours (zero-shot) 我们的（零-shot）	Ours (after RL finetune) 我们的（在实际生活中进行了微调）	Baseline (RL from scratch) 基线（从零开始的强化学习）
Hunt Pig	$1.3\pm 0.6$	${\color[rgb]{0.09,0.45,0.27}\mathbf{46.0\pm 15.3}}$	$0.0\pm 0.0$
Harvest Spider String 收获蜘蛛的丝线	$1.6\pm 1.3$	${\color[rgb]{0.09,0.45,0.27}\mathbf{36.5\pm 16.9}}$	$12.5\pm 12.7$

Generalize to Novel Tasks
概括到新任务

To test the ability to generalize to new open-vocabulary commands, we evaluate the agent on two novel tasks: “harvest spider string” and “hunt pig”. Table 5 shows that the agent struggles in the zero-shot setting because it has not interacted with pigs or spider strings during training, and thus does not know how to interact with them effectively. However, the performance improves substantially by finetuning with the MineCLIP reward. Here the baseline methods are trained from scratch using RL with the MineCLIP encoders and reward. Therefore, the only difference is whether the policy has been pre-trained on the 12 tasks or not. Given the same environment sampling budget (only around 5% of total samples), it significantly outperforms baselines. It suggests that the multitask agent has learned transferable knowledge on hunting and resource collection, which enables it to quickly adapt to novel tasks.
为了测试对新的开放词汇命令的泛化能力，我们在两个新任务上评估了代理人：“收集蜘蛛丝”和“猎杀猪”。表 5 显示，在零样本设置下，代理人遇到了困难，因为在训练过程中它没有与猪或蜘蛛丝进行交互，因此不知道如何有效地与它们交互。然而，通过使用 MineCLIP 奖励进行微调，性能显著提高。在这里，基准方法是使用 RL 从头开始训练，使用 MineCLIP 编码器和奖励。因此，唯一的区别是策略是否在 12 个任务上进行了预训练。在相同的环境采样预算下（仅约占总样本的 5%），它明显优于基准方法。这表明多任务代理人已经学到了关于狩猎和资源收集的可转移知识，使其能够快速适应新任务。

6 Related work 相关工作

Open-ended Environments for Decision-making Agents.
开放式环境用于决策代理。

There are many environments developed with the goal of open-ended agent learning. Prior works include maze-style worlds [121, 129, 61], purely text-based game [69], grid worlds [21, 16], browser/GUI-based environments [108, 124], and indoor simulators for robotics [1, 107, 114, 34, 110, 99, 89]. Minecraft offers an exciting alternative for open-ended agent learning. It is a 3D visual world with procedurally generated landscapes and extremely flexible game mechanics that support an enormous variety of activities. Prior methods in open-ended agent learning [30, 57, 130, 63, 26] do not make use of external knowledge, but our approach leverages internet-scale database to learn open-vocabulary reward models, thanks to Minecraft’s abundance of gameplay data online.
有许多环境旨在实现开放式智能体学习的目标。之前的研究包括迷宫式世界[121, 129, 61]，纯文本游戏[69]，网格世界[21, 16]，基于浏览器/GUI 的环境[108, 124]，以及用于机器人的室内模拟器[1, 107, 114, 34, 110, 99, 89]。Minecraft 提供了一个令人兴奋的选择，用于开放式智能体学习。它是一个具有程序生成的景观和极其灵活的游戏机制的 3D 视觉世界，支持各种各样的活动。之前的开放式智能体学习方法[30, 57, 130, 63, 26]没有利用外部知识，但我们的方法利用互联网规模的数据库来学习开放词汇奖励模型，得益于 Minecraft 在线上丰富的游戏数据。

Minecraft for AI Research.
Minecraft 用于 AI 研究。

The Malmo platform [60] is the first comprehensive release of a Gym-style agent API [14] for Minecraft. Based on Malmo, MineRL [48] provides a codebase and human play trajectories for the annual Diamond Challenge at NeurIPS [47, 49, 62]. MineDojo’s simulator builds upon the pioneering work of MineRL, but greatly expands the API and benchmarking task suite. Other Minecraft benchmarks exist with different focuses. For example, CraftAssist [44] and IGLU [66] study agents with interactive dialogues. BASALT [104] applies human evaluation to 4 open-ended tasks. EvoCraft [45] is designed for structure building, and Crafter [50] optimizes for fast experimentation. Unlike prior works, MineDojo’s core mission is to facilitate the development of generally capable embodied agents using internet-scale knowledge. We include a feature comparison table of different Minecraft platforms for AI research in Table A.1.
Malmo 平台[60]是 Minecraft 的第一个基于 Gym 风格的代理 API[14]的全面发布。基于 Malmo，MineRL[48]提供了一个代码库和人类游戏轨迹，用于 NeurIPS[47, 49, 62]的年度 Diamond Challenge。MineDojo 的模拟器建立在 MineRL 的开创性工作基础上，但大大扩展了 API 和基准测试任务套件。其他具有不同重点的 Minecraft 基准测试存在。例如，CraftAssist[44]和 IGLU[66]研究具有交互对话的代理。BASALT[104]将人类评估应用于 4 个开放式任务。EvoCraft[45]专为结构建筑而设计，而 Crafter[50]则优化了快速实验。与以往的工作不同，MineDojo 的核心任务是利用互联网规模的知识促进通用能力的具体化身代理的开发。我们在附录 A.1 中包含了不同 Minecraft 平台用于 AI 研究的功能比较表。

Internet-scale Multimodal Knowledge Bases.
互联网规模的多模态知识库。

Big dataset such as Common Crawl [24], the Pile [37], LAION [100], YouTube-8M [2] and HowTo100M [77] have been fueling the success of large pre-trained language models [27, 91, 15] and multimodal models [118, 6, 78, 145, 7, 4, 136]. While generally useful for learning representations, these datasets are not specifically targeted at embodied agents. To provide agent-centric training data, RoboNet [25] collects video frames from 7 robot platforms, and Ego4D [43] recruits volunteers to record egocentric videos of household activities. In comparison, MineDojo’s knowledge base is constructed without human curation efforts, much larger in volume, more diverse in data modalities, and comprehensively covers all aspects of the Minecraft environment.
大型数据集，如 Common Crawl [24]、Pile [37]、LAION [100]、YouTube-8M [2]和 HowTo100M [77]，推动了大型预训练语言模型 [27, 91, 15]和多模态模型 [118, 6, 78, 145, 7, 4, 136]的成功。虽然这些数据集通常对于学习表示很有用，但它们并不专门针对具体的实体代理。为了提供以代理为中心的训练数据，RoboNet [25]从 7 个机器人平台收集视频帧，而 Ego4D [43]则招募志愿者记录家庭活动的自我中心视频。相比之下，MineDojo 的知识库是在没有人工策划的情况下构建的，体积更大，数据形式更多样，并全面涵盖了 Minecraft 环境的所有方面。

Embodied Agents with Large-scale Pre-training.
具有大规模预训练的具身化代理。

Inspired by the success in NLP, embodied agent research [29, 11, 94, 23] has seen a surge in adoption of the large-scale pre-training paradigm. The recent advances can be roughly divided into 4 categories. 1) Novel agent architecture: Decision Transformer [19, 58, 144] applies the powerful self-attention models to sequential decision making. GATO [95] and Unified-IO [74] learn a single model to solve various decision-making tasks with different control interfaces. VIMA [59] unifies a wide range of robot manipulation tasks with multimodal prompting. 2) Pre-training for better representations: R3M [82] trains a general-purpose visual encoder for robot perception on Ego4D videos [43]. CLIPort [111] leverages the pre-trained CLIP model [92] to enable free-form language instructions for robot manipulation. 3) Pre-training for better policies: AlphaStar [126] achieves champion-level performance on StarCraft by imitating from numerous human demos. SayCan [3] leverages large language models (LMs) to ground value functions in the physical world. [72] and [96] directly reuse pre-trained LMs as policy backbone. VPT [10] is a concurrent work that learns an inverse dynamics model from human contractors to pseudo-label YouTube videos for behavior cloning. VPT is complementary to our approach, and can be finetuned to solve language-conditioned open-ended tasks with our learned reward model. 4) Data-driven reward functions: Concept2Robot [105] and DVD [18] learn a binary classifier to score behaviors from in-the-wild videos [42]. LOReL [81] crowd-sources humans labels to train language-conditioned reward function for offline RL. AVID [113] and XIRL [142] extract reward signals via cycle consistency. MineDojo’s task benchmark and internet knowledge base are generally useful for developing new algorithms in all the above categories. In Sec. 4, we also propose an open-vocabulary, multi-task reward model using MineDojo YouTube videos.
受自然语言处理成功的启发，具身代理研究[29, 11, 94, 23]在大规模预训练范式的采用方面出现了激增。最近的进展可以大致分为 4 个类别。1）新颖的代理架构：决策 Transformer[19, 58, 144]将强大的自注意力模型应用于顺序决策。GATO[95]和 Unified-IO[74]学习一个单一模型来解决具有不同控制接口的各种决策任务。VIMA[59]通过多模态提示统一了各种机器人操作任务。2）用于更好表示的预训练：R3M[82]在 Ego4D 视频[43]上训练了一个通用的视觉编码器，用于机器人感知。CLIPort[111]利用预训练的 CLIP 模型[92]实现了自由形式的语言指令，用于机器人操作。3）用于更好策略的预训练：AlphaStar[126]通过模仿大量人类演示，在星际争霸中实现了冠军级表现。SayCan[3]利用大型语言模型（LMs）将价值函数与物理世界联系起来。[72]和[96]直接重用预训练的 LMs 作为策略骨干。 VPT [10]是一项并发工作，它从人类承包商那里学习逆动力学模型，为行为克隆的 YouTube 视频生成伪标签。VPT 与我们的方法互补，可以通过我们学到的奖励模型进行微调，以解决语言条件下的开放式任务。4）数据驱动的奖励函数：Concept2Robot [105]和 DVD [18]学习二元分类器，从野外视频[42]中评分行为。LOReL [81]通过众包人类标签来训练语言条件下的奖励函数，用于离线强化学习。AVID [113]和 XIRL [142]通过循环一致性提取奖励信号。MineDojo 的任务基准和互联网知识库通常可用于开发上述所有类别的新算法。在第 4 节中，我们还提出了一个使用 MineDojo YouTube 视频的开放词汇、多任务奖励模型。

7 Conclusion 结论

In this work, we introduce the MineDojo framework for developing generally capable embodied agents. MineDojo features a benchmarking suite of thousands of Programmatic and Creative tasks, and an internet-scale multimodal knowledge base of videos, wiki, and forum discussions. As an example of the novel research possibilities enabled by MineDojo, we propose MineCLIP as an effective language-conditioned reward function trained with in-the-wild YouTube videos. MineCLIP achieves strong performance empirically and agrees well with human evaluation results, making it a good automatic metric for Creative tasks. We look forward to seeing how MineDojo empowers the community to make progress on the important challenge of open-ended agent learning.
在这项工作中，我们介绍了用于开发通用能力体现的 MineDojo 框架。MineDojo 具有一个包含数千个编程和创意任务的基准套件，以及一个互联网规模的多模态知识库，包括视频、维基和论坛讨论。作为 MineDojo 所带来的新颖研究可能性的一个例子，我们提出了 MineCLIP 作为一种有效的语言条件奖励函数，通过野外 YouTube 视频进行训练。MineCLIP 在实证上取得了强大的性能，并与人类评估结果相吻合，使其成为创意任务的良好自动度量标准。我们期待看到 MineDojo 如何赋予社区在开放式代理学习的重要挑战上取得进展。

8 Acknowledgement 感谢

We are extremely grateful to Anssi Kanervisto, Shikun Liu, Zhiding Yu, Chaowei Xiao, Weili Nie, Jean Kossaifi, Jonathan Raiman, Neel Kant, Saad Godil, Jaakko Haapasalo, Bryan Catanzaro, John Spitzer, Zhiyuan “Jerry” Lin, Yingqi Zheng, Chen Tessler, Dieter Fox, Oli Wright, Jeff Clune, Jack Parker-Holder, and many other colleagues and friends for their helpful feedback and insightful discussions. We also thank the anonymous reviewers for offering us highly constructive advice and kind encouragement during the review and rebuttal period. NVIDIA provides the necessary computing resource and infrastructure for this project. Guanzhi Wang is supported by the Kortschak fellowship in Computing and Mathematical Sciences at Caltech.
我们非常感谢 Anssi Kanervisto、Shikun Liu、Zhiding Yu、Chaowei Xiao、Weili Nie、Jean Kossaifi、Jonathan Raiman、Neel Kant、Saad Godil、Jaakko Haapasalo、Bryan Catanzaro、John Spitzer、Zhiyuan “Jerry” Lin、Yingqi Zheng、Chen Tessler、Dieter Fox、Oli Wright、Jeff Clune、Jack Parker-Holder 以及其他许多同事和朋友对我们的有益反馈和深入讨论表示感谢。我们还感谢匿名审稿人在审稿和反驳期间给予我们高度建设性的建议和友善的鼓励。NVIDIA 为这个项目提供了必要的计算资源和基础设施。Guanzhi Wang 得到了加州理工学院计算和数学科学领域的 Kortschak 奖学金的支持。

References 参考文献

Abramson et al. [2020] 阿布拉姆森等人[2020] Josh Abramson, Arun Ahuja, Iain Barr, Arthur Brussee, Federico Carnevale, Mary Cassin, Rachita Chhaparia, Stephen Clark, Bogdan Damoc, Andrew Dudzik, Petko Georgiev, Aurelia Guy, Tim Harley, Felix Hill, Alden Hung, Zachary Kenton, Jessica Landon, Timothy Lillicrap, Kory Mathewson, Soňa Mokrá, Alistair Muldal, Adam Santoro, Nikolay Savinov, Vikrant Varma, Greg Wayne, Duncan Williams, Nathaniel Wong, Chen Yan, and Rui Zhu. Imitating interactive intelligence. arXiv preprint arXiv: Arxiv-2012.05672, 2020.
乔什·阿布拉莫森，阿伦·阿胡贾，伊恩·巴尔，亚瑟·布鲁西，费德里科·卡内瓦莱，玛丽·卡辛，拉奇塔·查帕里亚，斯蒂芬·克拉克，博格丹·达莫克，安德鲁·杜兹尼克，佩特科·乔治耶夫，奥雷利亚·盖伊，蒂姆·哈利，费利克斯·希尔，奥尔登·洪，扎卡里·肯顿，杰西卡·兰登，蒂莫西·利利克拉普，科里·马修森，索娜·莫克拉，阿利斯泰尔·穆尔达尔，亚当·桑托罗，尼古拉·萨维诺夫，维克兰特·瓦尔玛，格雷格·韦恩，邓肯·威廉姆斯，纳撒尼尔·黄，陈燕和朱瑞。模仿交互智能。arXiv 预印本 arXiv: Arxiv-2012.05672，2020 年。
Abu-El-Haija et al. [2016] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv: Arxiv-1609.08675, 2016.
Ahn et al. [2022] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, and Mengyuan Yan. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv: Arxiv-2204.01691, 2022.
Akbari et al. [2021] Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv preprint arXiv: Arxiv-2104.11178, 2021.
Al-Rfou et al. [2016] Rami Al-Rfou, Marc Pickett, Javier Snaider, Yun hsuan Sung, Brian Strope, and Ray Kurzweil. Conversational contextual cues: The case of personalization and history for response ranking. arXiv preprint arXiv: Arxiv-1606.00372, 2016.
Alayrac et al. [2020] Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelovic, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-supervised multimodal versatile networks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/0060ef47b12160b9198302ebdb144dcf-Abstract.html.
Amrani et al. [2021] Elad Amrani, Rami Ben-Ari, Daniel Rotman, and Alex M. Bronstein. Noise estimation using density estimation for self-supervised multimodal learning. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 6644–6652. AAAI Press, 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/16822.
Andreas et al. [2017] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 166–175. PMLR, 2017. URL http://proceedings.mlr.press/v70/andreas17a.html.
Appalaraju et al. [2021] Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. Docformer: End-to-end transformer for document understanding. arXiv preprint arXiv: Arxiv-2106.11539, 2021.
Baker et al. [2022] Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. arXiv preprint arXiv: Arxiv-2206.11795, 2022.
Batra et al. [2020a] Dhruv Batra, Angel X. Chang, Sonia Chernova, Andrew J. Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, Manolis Savva, and Hao Su. Rearrangement: A challenge for embodied ai. arXiv preprint arXiv: Arxiv-2011.01975, 2020a.
Batra et al. [2020b] Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv: Arxiv-2006.13171, 2020b.
Bommasani et al. [2021] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. On the opportunities and risks of foundation models. arXiv preprint arXiv: Arxiv-2108.07258, 2021.
Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv: Arxiv-1606.01540, 2016.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Cao et al. [2020] Tianshi Cao, Jingkang Wang, Yining Zhang, and Sivabalan Manivasagam. Babyai++: Towards grounded-language learning beyond memorization. arXiv preprint arXiv: Arxiv-2004.07200, 2020.
Carreira and Zisserman [2017] João Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4724–4733. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.502. URL https://doi.org/10.1109/CVPR.2017.502.
Chen et al. [2021a] Annie S. Chen, Suraj Nair, and Chelsea Finn. Learning generalizable robotic reward functions from in-the-wild human videos. In Dylan A. Shell, Marc Toussaint, and M. Ani Hsieh, editors, Robotics: Science and Systems XVII, Virtual Event, July 12-16, 2021, 2021a. doi: 10.15607/RSS.2021.XVII.012. URL https://doi.org/10.15607/RSS.2021.XVII.012.
Chen et al. [2021b] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 15084–15097, 2021b. URL https://proceedings.neurips.cc/paper/2021/hash/7f489f642a0ddb10272b5c31057f0663-Abstract.html.
Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 2020. URL http://proceedings.mlr.press/v119/chen20j.html.
Chevalier-Boisvert et al. [2019] Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. Babyai: A platform to study the sample efficiency of grounded language learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rJeXCo0cYX.
Chowdhery et al. [2022] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways. arXiv preprint arXiv: Arxiv-2204.02311, 2022.
Collins et al. [2021] Jack Collins, Shelvin Chand, Anthony Vanderkop, and David Howard. A review of physics simulators for robotic applications. IEEE Access, 9:51416–51431, 2021.
[24] Common Crawl. Common crawl. https://commoncrawl.org/, 2012. Accessed: 2022-06-06.
Dasari et al. [2019] Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors, 3rd Annual Conference on Robot Learning, CoRL 2019, Osaka, Japan, October 30 - November 1, 2019, Proceedings, volume 100 of Proceedings of Machine Learning Research, pages 885–897. PMLR, 2019. URL http://proceedings.mlr.press/v100/dasari20a.html.
Dennis et al. [2020] Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre M. Bayen, Stuart Russell, Andrew Critch, and Sergey Levine. Emergent complexity and zero-shot transfer via unsupervised environment design. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/985e9a46e10005356bbaf194249f6856-Abstract.html.
Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: Arxiv-1810.04805, 2018.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: Arxiv-2010.11929, 2020.
Duan et al. [2022] Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied AI: from simulators to research tasks. IEEE Trans. Emerg. Top. Comput. Intell., 6(2):230–244, 2022. doi: 10.1109/TETCI.2022.3141105. URL https://doi.org/10.1109/TETCI.2022.3141105.
Ecoffet et al. [2019] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv: Arxiv-1901.10995, 2019.
Edwards et al. [2018] Ashley D. Edwards, Himanshu Sahni, Yannick Schroecker, and Charles L. Isbell. Imitating latent policies from observation. arXiv preprint arXiv: Arxiv-1805.07914, 2018.
Falcon and The PyTorch Lightning team [2019] William Falcon and The PyTorch Lightning team. PyTorch Lightning. Github, 3 2019. doi: 10.5281/zenodo.3828935. URL https://github.com/PyTorchLightning/pytorch-lightning.
Fan* et al. [2020] Linxi Fan*, Shyamal Buch*, Guanzhi Wang, Ryan Cao, Yuke Zhu, Juan Carlos Niebles, and Li Fei-Fei. Rubiksnet: Learnable 3d-shift for efficient video action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
Fan et al. [2021] Linxi Fan, Guanzhi Wang, De-An Huang, Zhiding Yu, Li Fei-Fei, Yuke Zhu, and Anima Anandkumar. Secant: Self-expert cloning for zero-shot generalization of visual policies. arXiv preprint arXiv: Arxiv-2106.09678, 2021.
Fu et al. [2020] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv: Arxiv-2004.07219, 2020.
Fuchs et al. [2021] Florian Fuchs, Yunlong Song, Elia Kaufmann, Davide Scaramuzza, and Peter Dürr. Super-human performance in gran turismo sport using deep reinforcement learning. IEEE Robotics Autom. Lett., 6(3):4257–4264, 2021. doi: 10.1109/LRA.2021.3064284. URL https://doi.org/10.1109/LRA.2021.3064284.
Gao et al. [2020] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
Gao et al. [2021] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv: Arxiv-2110.04544, 2021.
Gebru et al. [2021] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna M. Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets. Commun. ACM, 64(12):86–92, 2021. doi: 10.1145/3458723. URL https://doi.org/10.1145/3458723.
Gehman et al. [2020] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv: Arxiv-2009.11462, 2020.
Gerblick [2021] Jordan Gerblick. Minecraft, the most-watched game on youtube, passes 1 trillion views, Dec 2021. URL https://www.gamesradar.com/minecraft-the-most-watched-game-on-youtube-passes-1-trillion-views/.
Goyal et al. [2017] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
Grauman et al. [2021] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, and Jitendra Malik. Ego4d: Around the world in 3,000 hours of egocentric video. arXiv preprint arXiv: Arxiv-2110.07058, 2021.
Gray et al. [2019] Jonathan Gray, Kavya Srinet, Yacine Jernite, Haonan Yu, Zhuoyuan Chen, Demi Guo, Siddharth Goyal, C. Lawrence Zitnick, and Arthur Szlam. Craftassist: A framework for dialogue-enabled interactive agents. arXiv preprint arXiv: Arxiv-1907.08584, 2019.
Grbic et al. [2021] Djordje Grbic, Rasmus Berg Palm, Elias Najarro, Claire Glanois, and Sebastian Risi. EvoCraft: A New Challenge for Open-Endedness, pages 325–340. Springer International Publishing, 2021. doi: 10.1007/978-3-030-72699-7_21. URL http://link.springer.com/content/pdf/10.1007/978-3-030-72699-7_21.
Gupta et al. [2022] Agrim Gupta, Linxi Fan, Surya Ganguli, and Li Fei-Fei. Metamorph: Learning universal controllers with transformers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Opmqtk_GvYL.
Guss et al. [2019a] William H. Guss, Cayden Codel, Katja Hofmann, Brandon Houghton, Noboru Kuno, Stephanie Milani, Sharada Mohanty, Diego Perez Liebana, Ruslan Salakhutdinov, Nicholay Topin, Manuela Veloso, and Phillip Wang. The minerl 2019 competition on sample efficient reinforcement learning using human priors. arXiv preprint arXiv: Arxiv-1904.10079, 2019a.
Guss et al. [2019b] William H. Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations. arXiv preprint arXiv: Arxiv-1907.13440, 2019b.
Guss et al. [2021] William H. Guss, Mario Ynocente Castro, Sam Devlin, Brandon Houghton, Noboru Sean Kuno, Crissman Loomis, Stephanie Milani, Sharada Mohanty, Keisuke Nakata, Ruslan Salakhutdinov, John Schulman, Shinya Shiroshita, Nicholay Topin, Avinash Ummadisingu, and Oriol Vinyals. The minerl 2020 competition on sample efficient reinforcement learning using human priors. arXiv preprint arXiv: Arxiv-2101.11071, 2021.
Hafner [2021] Danijar Hafner. Benchmarking the spectrum of agent capabilities. arXiv preprint arXiv: Arxiv-2109.06780, 2021.
Hanu and Unitary team [2020] Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.
He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 9726–9735. Computer Vision Foundation / IEEE, 2020. doi: 10.1109/CVPR42600.2020.00975. URL https://doi.org/10.1109/CVPR42600.2020.00975.
He et al. [2021] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv: Arxiv-2111.06377, 2021.
Henderson et al. [2019] Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšić, Georgios Spithourakis, Pei-Hao Su, Ivan Vulić, and Tsung-Hsien Wen. A repository of conversational datasets. arXiv preprint arXiv: Arxiv-1904.06472, 2019.
Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6626–6637, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/8a1d694707eb0fefe65871369074926d-Abstract.html.
Huang et al. [2022] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv: Arxiv-2207.05608, 2022.
Huizinga and Clune [2022] Joost Huizinga and Jeff Clune. Evolving multimodal robot behavior via many stepping stones with the combinatorial multiobjective evolutionary algorithm. Evolutionary computation, 30(2):131–164, 2022.
Janner et al. [2021] Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 1273–1286, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/099fe6b0b444c23836c4a5d07346082b-Abstract.html.
Jiang et al. [2022] Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv: Arxiv-2210.03094, 2022.
Johnson et al. [2016] Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. IJCAI, 2016. URL https://dl.acm.org/doi/10.5555/3061053.3061259.
Juliani et al. [2019] Arthur Juliani, Ahmed Khalifa, Vincent-Pierre Berges, Jonathan Harper, Ervin Teng, Hunter Henry, Adam Crespi, Julian Togelius, and Danny Lange. Obstacle tower: A generalization challenge in vision, control, and planning. In Sarit Kraus, editor, Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 2684–2691. ijcai.org, 2019. doi: 10.24963/ijcai.2019/373. URL https://doi.org/10.24963/ijcai.2019/373.
Kanervisto et al. [2022] Anssi Kanervisto, Stephanie Milani, Karolis Ramanauskas, Nicholay Topin, Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang, Weijun Hong, Zhongyue Huang, Haicheng Chen, Guangjun Zeng, Yue Lin, Vincent Micheli, Eloi Alonso, François Fleuret, Alexander Nikulin, Yury Belousov, Oleg Svidchenko, and Aleksei Shpilman. Minerl diamond 2021 competition: Overview, results, and lessons learned. arXiv preprint arXiv: Arxiv-2202.10583, 2022.
Kanitscheider et al. [2021] Ingmar Kanitscheider, Joost Huizinga, David Farhi, William Hebgen Guss, Brandon Houghton, Raul Sampedro, Peter Zhokhov, Bowen Baker, Adrien Ecoffet, Jie Tang, Oleg Klimov, and Jeff Clune. Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft. arXiv preprint arXiv: Arxiv-2106.14876, 2021.
Kay et al. [2017] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. arXiv preprint arXiv: Arxiv-1705.06950, 2017.
Kim et al. [2019] Byeongchang Kim, Hyunwoo Kim, and Gunhee Kim. Abstractive summarization of reddit posts with multi-level memory networks. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2519–2531. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1260. URL https://doi.org/10.18653/v1/n19-1260.
Kiseleva et al. [2021] Julia Kiseleva, Ziming Li, Mohammad Aliannejadi, Shrestha Mohanty, Maartje ter Hoeve, Mikhail Burtsev, Alexey Skrynnik, Artem Zholus, Aleksandr Panov, Kavya Srinet, Arthur Szlam, Yuxuan Sun, Katja Hofmann, Michel Galley, and Ahmed Awadallah. Neurips 2021 competition iglu: Interactive grounded language understanding in a collaborative environment. arXiv preprint arXiv: Arxiv-2110.06536, 2021.
Kojima et al. [2022] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv: Arxiv-2205.11916, 2022.
Kolve et al. [2017] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv: Arxiv-1712.05474, 2017.
Küttler et al. [2020] Heinrich Küttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rocktäschel. The nethack learning environment. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 7671–7684. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/569ff987c643b4bedf504efda8f786c2-Paper.pdf.
[70] Label Studio. Label studio. https://labelstud.io/, 2020. Accessed: 2022-06-06.
Langdon [2005] WB Langdon. Pfeiffer–a distributed open-ended evolutionary system. In AISB, volume 5, pages 7–13. Citeseer, 2005.
Li et al. [2022] Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba, and Yuke Zhu. Pre-trained language models for interactive decision-making. arXiv preprint arXiv: Arxiv-2202.01771, 2022.
Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Skq89Scxx.
Lu et al. [2022] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv: Arxiv-2206.08916, 2022.
Luo et al. [2021] Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv: Arxiv-2104.08860, 2021.
Micikevicius et al. [2018] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David García, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=r1gs9JgRZ.
Miech et al. [2019] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. arXiv preprint arXiv: Arxiv-1906.03327, 2019.
Miech et al. [2020] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 9876–9886. Computer Vision Foundation / IEEE, 2020. doi: 10.1109/CVPR42600.2020.00990. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Miech_End-to-End_Learning_of_Visual_Representations_From_Uncurated_Instructional_Videos_CVPR_2020_paper.html.
[79] Minecraft Wiki. Minecraft wiki. hhttps://minecraft.fandom.com/wiki/Minecraft_Wiki, 2016. Accessed: 2022-06-06.
Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv: Arxiv-1312.5602, 2013.
Nair et al. [2021] Suraj Nair, Eric Mitchell, Kevin Chen, Brian Ichter, Silvio Savarese, and Chelsea Finn. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Aleksandra Faust, David Hsu, and Gerhard Neumann, editors, Conference on Robot Learning, 8-11 November 2021, London, UK, volume 164 of Proceedings of Machine Learning Research, pages 1303–1315. PMLR, 2021. URL https://proceedings.mlr.press/v164/nair22a.html.
Nair et al. [2022] Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv: Arxiv-2203.12601, 2022.
Nikishin et al. [2022] Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. arXiv preprint arXiv: Arxiv-2205.07802, 2022.
Oh et al. [2018] Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. In International Conference on Machine Learning, pages 3878–3887. PMLR, 2018.
OpenAI et al. [2019] OpenAI, :, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d. O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv: Arxiv-1912.06680, 2019.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv: Arxiv-1912.01703, 2019.
Perez-Liebana et al. [2019] Diego Perez-Liebana, Katja Hofmann, Sharada Prasanna Mohanty, Noburu Kuno, Andre Kramer, Sam Devlin, Raluca D. Gaina, and Daniel Ionita. The multi-agent reinforcement learning in malmÖ (marlÖ) competition. arXiv preprint arXiv: Arxiv-1901.08129, 2019.
[88] PRAW: The Python Reddit API Wrapper. Praw: The python reddit api wrapper. https://github.com/praw-dev/praw, 2010. Accessed: 2022-06-06.
Puig et al. [2018] Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8494–8502. Computer Vision Foundation / IEEE Computer Society, 2018. doi: 10.1109/CVPR.2018.00886. URL http://openaccess.thecvf.com/content_cvpr_2018/html/Puig_VirtualHome_Simulating_Household_CVPR_2018_paper.html.
Radford et al. [2018] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. OpenAI, 2018.
Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv: Arxiv-2204.06125, 2022.
Ravichandar et al. [2020] Harish Ravichandar, Athanasios S Polydoros, Sonia Chernova, and Aude Billard. Recent advances in robot learning from demonstration. Annual review of control, robotics, and autonomous systems, 3:297–330, 2020.
Reed et al. [2022] Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent. arXiv preprint arXiv: Arxiv-2205.06175, 2022.
Reid et al. [2022] Machel Reid, Yutaro Yamada, and Shixiang Shane Gu. Can wikipedia help offline reinforcement learning? arXiv preprint arXiv: Arxiv-2201.12122, 2022.
Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv: Arxiv-2205.11487, 2022.
Salimans et al. [2016] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2226–2234, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/8a3363abe792db2d8761d6403605aeb7-Abstract.html.
Savva et al. [2019] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
Schuhmann et al. [2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
Schulman et al. [2016] John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1506.02438.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv: Arxiv-1707.06347, 2017.
[103] Selenium WebDriver. Selenium webdriver. https://www.selenium.dev/, 2011. Accessed: 2022-06-06.
Shah et al. [2021] Rohin Shah, Cody Wild, Steven H. Wang, Neel Alex, Brandon Houghton, William Guss, Sharada Mohanty, Anssi Kanervisto, Stephanie Milani, Nicholay Topin, Pieter Abbeel, Stuart Russell, and Anca Dragan. The minerl basalt competition on learning from human feedback. arXiv preprint arXiv: Arxiv-2107.01969, 2021.
Shao et al. [2021] Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2robot: Learning manipulation concepts from instructions and human demonstrations. The International Journal of Robotics Research, 40(12-14):1419–1434, 2021.
Shazeer [2020] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv: Arxiv-2002.05202, 2020.
Shen et al. [2020] Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín, Linxi Fan, Guanzhi Wang, Claudia Pérez-D’Arpino, Shyamal Buch, Sanjana Srivastava, Lyne P. Tchapmi, Micael E. Tchapmi, Kent Vainio, Josiah Wong, Li Fei-Fei, and Silvio Savarese. igibson 1.0: a simulation environment for interactive tasks in large realistic scenes. arXiv preprint arXiv: Arxiv-2012.02924, 2020.
Shi et al. [2017] Tianlin Tim Shi, Andrej Karpathy, Linxi Jim Fan, Jonathan Hernandez, and Percy Liang. World of bits: an open-domain platform for web-based agents. ICML, 2017. URL https://dl.acm.org/doi/10.5555/3305890.3306005.
Shoeybi et al. [2019] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
Shridhar et al. [2020] Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 10737–10746. Computer Vision Foundation / IEEE, 2020. doi: 10.1109/CVPR42600.2020.01075. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Shridhar_ALFRED_A_Benchmark_for_Interpreting_Grounded_Instructions_for_Everyday_Tasks_CVPR_2020_paper.html.
Shridhar et al. [2021] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. arXiv preprint arXiv: Arxiv-2109.12098, 2021.
Silver et al. [2017] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv: Arxiv-1712.01815, 2017.
Smith et al. [2019] Laura Smith, Nikita Dhawan, Marvin Zhang, Pieter Abbeel, and Sergey Levine. Avid: Learning multi-stage tasks via pixel-level translation of human videos. arXiv preprint arXiv: Arxiv-1912.04443, 2019.
Srivastava et al. [2021] Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, Kent Elliott Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, C. Karen Liu, Silvio Savarese, Hyowon Gweon, Jiajun Wu, and Li Fei-Fei. BEHAVIOR: benchmark for everyday household activities in virtual, interactive, and ecological environments. In Aleksandra Faust, David Hsu, and Gerhard Neumann, editors, Conference on Robot Learning, 8-11 November 2021, London, UK, volume 164 of Proceedings of Machine Learning Research, pages 477–490. PMLR, 2021. URL https://proceedings.mlr.press/v164/srivastava22a.html.
Stadie et al. [2017] Bradly C. Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. arXiv preprint arXiv: Arxiv-1703.01703, 2017.
Standish [2003] Russell K Standish. Open-ended artificial evolution. International Journal of Computational Intelligence and Applications, 3(02):167–175, 2003.
Stanley et al. [2017] Kenneth O Stanley, Joel Lehman, and Lisa Soros. Open-endedness: The last grand challenge you’ve never heard of. O’Reilly Online,, 2017.
Sun et al. [2019] Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv: Arxiv-1906.05743, 2019.
Tassa et al. [2018] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. Deepmind control suite. arXiv preprint arXiv: Arxiv-1801.00690, 2018.
Taylor et al. [2016] Tim Taylor, Mark Bedau, Alastair Channon, David Ackley, Wolfgang Banzhaf, Guillaume Beslon, Emily Dolson, Tom Froese, Simon Hickinbotham, Takashi Ikegami, et al. Open-ended evolution: Perspectives from the oee workshop in york. Artificial life, 22(3):408–423, 2016.
Team et al. [2021] Open Ended Learning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, Nat McAleese, Nathalie Bradley-Schmieg, Nathaniel Wong, Nicolas Porcel, Roberta Raileanu, Steph Hughes-Fitt, Valentin Dalibard, and Wojciech Marian Czarnecki. Open-ended learning leads to generally capable agents. arXiv preprint arXiv: Arxiv-2107.12808, 2021.
Torabi et al. [2018] Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. arXiv preprint arXiv: Arxiv-1805.01954, 2018.
Torabi et al. [2019] Faraz Torabi, Garrett Warnell, and Peter Stone. Recent advances in imitation learning from observation. arXiv preprint arXiv: Arxiv-1905.13566, 2019.
Toyama et al. [2021] Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv: Arxiv-2105.13231, 2021.
van den Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv: Arxiv-1807.03748, 2018.
Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojciech M Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, et al. Alphastar: Mastering the real-time strategy game starcraft ii. DeepMind blog, 2, 2019.
Vølske et al. [2017] Michael Vølske, Martin Potthast, Shahbaz Syed, and Benno Stein. TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, Copenhagen, Denmark, sep 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4508. URL https://aclanthology.org/W17-4508.
Wang [2022] Phil Wang. x-transformers. Github, 2022. URL https://github.com/lucidrains/x-transformers.
Wang et al. [2019] Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O. Stanley. Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv: Arxiv-1901.01753, 2019.
Wang et al. [2020] Rui Wang, Joel Lehman, Aditya Rawal, Jiale Zhi, Yulun Li, Jeff Clune, and Kenneth O. Stanley. Enhanced poet: Open-ended reinforcement learning through unbounded invention of learning challenges and their solutions. arXiv preprint arXiv: Arxiv-2003.08536, 2020.
Wikipedia contributors [2022] Wikipedia contributors. Minecraft — Wikipedia, the free encyclopedia, 2022. URL https://en.wikipedia.org/w/index.php?title=Minecraft&oldid=1092238294. [Online; accessed 9-June-2022].
Wolf et al. [2019] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv: Arxiv-1910.03771, 2019.
Wu et al. [2020] Sen Wu, Hongyang R. Zhang, and Christopher Ré. Understanding and improving information transfer in multi-task learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SylzhkBtDB.
Xia et al. [2019] Fei Xia, William B. Shen, Chengshu Li, Priya Kasimbeg, Micael Tchapmi, Alexander Toshev, Li Fei-Fei, Roberto Martín-Martín, and Silvio Savarese. Interactive gibson benchmark (igibson 0.5): A benchmark for interactive navigation in cluttered environments. arXiv preprint arXiv: Arxiv-1910.14442, 2019.
Xie et al. [2018] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), pages 305–321, 2018.
Xu et al. [2021] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv: Arxiv-2109.14084, 2021.
Xu et al. [2020] Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv: Arxiv-2012.14740, 2020.
Xu et al. [2019] Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. Layoutlm: Pre-training of text and layout for document image understanding. arXiv preprint arXiv: Arxiv-1912.13318, 2019.
Yang et al. [2018] Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Learning semantic textual similarity from conversations. In Proceedings of The Third Workshop on Representation Learning for NLP, pages 164–174, Melbourne, Australia, jul 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-3022. URL https://aclanthology.org/W18-3022.
[140] YouTube Data API. Youtube data api. https://developers.google.com/youtube/v3/, 2012. Accessed: 2022-06-06.
Yu et al. [2020] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. arXiv preprint arXiv: Arxiv-2001.06782, 2020.
Zakka et al. [2021] Kevin Zakka, Andy Zeng, Pete Florence, Jonathan Tompson, Jeannette Bohg, and Debidatta Dwibedi. Xirl: Cross-embodiment inverse reinforcement learning. arXiv preprint arXiv: Arxiv-2106.03911, 2021.
Zeng et al. [2022] Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv: Arxiv-2204.00598, 2022.
Zheng et al. [2022] Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. arXiv preprint arXiv: Arxiv-2202.05607, 2022.
Zhu and Yang [2020] Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. arXiv preprint arXiv: Arxiv-2011.07231, 2020.
Zhu et al. [2020] Yuke Zhu, Josiah Wong, Ajay Mandlekar, and Roberto Martín-Martín. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv: Arxiv-2009.12293, 2020.

Appendix A Minecraft Framework Comparison

Table A.1: Comparison table of different Minecraft platforms for AI research.

Environment	Simulator		Task Suite		Knowledge Base
Environment	Features	Real Minecraft	Number of Tasks	Language- grounded	Features	Data Scale
MineDojo	Unified observation and action space; unlocks all three types of world (the Overworld, the Nether, and the End)	✓	$3,000+$	✓	Automatically scraped from the Internet; multimodal data (videos, images, texts, tables and diagrams)	740K YouTube videos; 7K Wiki pages; 350K Reddit posts
MineRL v0.4 [48]	Built on top of Malmo; actively maintained	✓	11		Annotated state-action pairs of human demonstrations	60M frames of recorded human player data
MineRL v1.0 (VPT) [10]	Mouse and keyboard control	✓	5		Labeled contractor data; unlabeled videos scraped from the Internet	2K hours of contractor data; 270K hours of unlabeled videos
MarLÖ [87]	Cooperative and competitive multiagent tasks; parameterizable environments	✓	14
Malmo [60]	First comprehensive release of a Gym-style agent API for Minecraft	✓	N/A
CraftAssist [44]	Bot assistant; dialogue interactions	✓	N/A	✓	Interactive dialogues; crowd-sourced house building dataset	800K dialogue-action dictionary pairs; 2.6K houses with atomic building actions
IGLU [66]	Interactive dialogues with humans; aimed at building structures described by natural language		157	✓
EvoCraft [45]	Aimed at generating creative artifacts; allows for direction manipulation of blocks		N/A
Crafter [50]	2D clone of Minecraft; fast experimentation		22		Human experts dataset	100 episodes

Appendix B MineDojo Simulator

We design unified observation and action spaces across all tasks to facilitate the development of multi-tasking and continually learning agents that can adapt to novel tasks and scenarios. The codebase is open sourced at github.com/MineDojo/MineDojo.

B.1 Observation Space

Our observation space contains multiple modalities. The agent perceives the world mainly through the RGB screen. To provide the same information as human players receive, we also supplement the agent with observations about its inventory, location, health, surrounding blocks, etc. The full observation space is shown below. We refer readers to see our code documentation for technical details such as data type for each observable item.

We also support a LIDAR sensor that returns the groundtruth type of the blocks that the agent sees, however this is considered privileged information and does not go into the benchmarking specification. However, it is still useful for hand engineering the dense reward function, which we use in our experiments (Sec. 5). Amounts and directions of LIDAR rays can be arbitrarily configured at the cost of a lower simulation throughput.

Modality	Shape	Description
RGB	(3, H, W)	Ego-centric RGB frames.
Equipment	(6,)	Names, quantities, variants, and durabilities of equipped items.
Inventory	(36,)	Names, quantities, variants, and durabilities of inventory items.
Voxel	(3, 3, 3)	Names, variants, and properties of $3\times 3\times 3$ surrounding blocks.
Life statistics	(1,)	Agent’s health, oxygen, food saturation, etc.
GPS	(3,)	GPS location of the agent.
Compass	(2,)	Yaw and pitch of the agent.
Nearby tools	(2,)	Indicate if crafting table and furnace are nearby the agent.
Damage source	(1,)	Information about the damage on the agent.
Lidar	(Num rays,)	Ground-truth lidar observation.

B.2 Action Space

We design a compound action space. At each step the agent chooses one movement action (forward, backward, camera actions, etc.) and one optional functional action as listed in the table below. Some functional actions such as craft take one argument, while others like attack does not take any argument. This compound action space can be modelled in an autoregressive manner [126]. We refer readers to our code documentation for example usages of our action space.

Name	Description	Argument
no_op	Do nothing.	$\varnothing$
use	Use the item held in the main hand.	$\varnothing$
drop	Drop the item held in the main hand.	$\varnothing$
attack	Attack with barehand or tool held in the main hand.	$\varnothing$
craft	Execute a crafting recipe to obtain new items.	Index of recipe
equip	Equip an inventory item.	Slot index of the item
place	Place an inventory item on the ground.	Slot index of the item
destroy	Destroy an inventory item.	Slot index of the item

B.3 Customizing the Environment

Environments in MineCLIP simulator can be easily and flexibly customized. Through our simulator API, users can control terrain, weather, day-night condition (different lighting), the spawn rate and range of specified entities and materials, etc. We support a wide range of terrains, such as desert, jungle, taiga, and iced plain, and special in-game structures, such as ocean monument, desert temple, and End city. Please visit our website for video demonstrations.

Appendix C MineDojo Task Suite

In this section, we explain how we collect the Programmatic (Sec. 2.1) and Creative tasks (Sec. 2.2).

⬇

1survival_sword_food:

2 category: survival

3 prompt: survive as long as possible given a sword and some food

5harvest_wool_with_shears_and_sheep:

6 category: harvest

7 prompt: harvest wool from a sheep with shears and a sheep nearby

9techtree_from_barehand_to_wooden_sword:

10 category: tech-tree

11 prompt: find material and craft a wooden sword

13combat_zombie_pigman_nether_diamond_armors_diamond_sword_shield:

14 category: combat

15 prompt: combat a zombie pigman in nether with a diamond sword,

16 shield, and a full suite of diamond armors

Figure A.1: Example specifications.

C.1 Programmatic Tasks

Programmatic tasks are constructed by filling manually written templates for four categories of tasks, namely “Survival”, “Harvest”, “Tech Tree”, and “Combat”. The task specifications are included in our codebase. Please refer to Fig. A.1 for a few samples. We briefly explain each task category:

Survival.

This task group tests the ability to stay alive in the game. It is nontrivial to survive in Minecraft, because the agent grows hungry as time passes and the health bar drops gradually. Hostile mobs like zombie and skeleton spawn at night, which are very dangerous if the agent does not have the appropriate armor to protect itself or weapons to fight back. We provide two tasks with different levels of difficulty for Survival. One is to start from scratch without any assistance. The other is to start with initial weapons and food.

Harvest.

This task group tests the agent’s ability to collect useful resources such as minerals (iron, diamond, obsidian), food (beef, pumpkin, carrots, milk), and other useful items (wool, oak wood, coal). We construct these tasks by enumerating the Cartesian product between target items to collect, initial inventory, and world conditions (terrain, weather, lighting, etc.) so that they cover a spectrum of difficulty. For instance, if the task is to harvest wool, then it is relatively easy if the agent has a shear in its initial inventory with a sheep nearby, but more difficult if the agent has to craft the shear from raw material and explore extensively to find a sheep. We filter out combinations that are impossible (such as farming certain plants in the desert) from the Cartesian product.

Tech Tree.

Minecraft includes several levels of tools and armors with different properties and difficulties to unlock. To progress to a higher level of tools and armors, the agent needs to develop systematic and compositional skills to navigate the technology tree (e.g. wood $\rightarrow$ stone $\rightarrow$ iron $\rightarrow$ diamond). In this task group, the agent is asked to craft and use a hierarchy of tools starting from a less advanced level. For example, some task asks the agent to craft a wooden sword from bare hand. Another task may ask the agent to craft a gold helmet. An agent that can successfully complete these tasks should have the ability to transfer similar exploration strategies to different tech levels.

Combat.

We test agent’s reflex and martial skills to fight against various monsters and creatures. Similar to how we develop the Harvest task group, we generate these tasks by enumerating the Cartesian product between the target entity to combat with, initial inventory, and world conditions to cover a spectrum of difficulty.

C.2 Creative Tasks

We construct Creative tasks using three approaches: 1) manual brainstorming, 2) mining from YouTube tutorial videos, and 3) generate by querying GPT-3 API. We elaborate the second and third approaches below.

Task Mining from YouTube Tutorial Videos.

Our YouTube dataset serves the dual purpose of a rich task source, as many human players demonstrate and narrate creative missions in the tutorial playlists. To collect high-quality tasks and accompanying videos, we design a 3-stage pipeline that makes it easy to find and annotate interesting tasks.

Stage 1:

We search for YouTube playlists with the key phrases, “Minecraft Tutorial” and “Minecraft Guide”. Then we apply heuristic rules (see Sec. D.1) to filter out low-quality videos;
Stage 2:

We only show the title of the video to a human annotator through a command-line interface, who makes a binary decision to accept or reject it as a potential task. This step is typically very fast, taking a few seconds on average;
Stage 3:

For the accepted tasks in stage 2, we design a labeling UI using Label Studio [70] that displays the full video and YouTube description. A human annotator can choose to reject the video, adjust the timestamps, select the title, or refine the description to be the task goal (Fig. A.2). Through this pipeline, we extract 1,042 task ideas from the common wisdom of a huge number of veteran Minecraft gamers. Some examples are “make an automated mining machine” and “grow cactus up to the sky”.

C.3 GPT-3 Guidance

We leverage OpenAI’s GPT-3-davinci API to automatically generate detailed guidance for a subset of the tasks. Inspired by [67], we adopt the following template to prompt GPT-3: How to {task goal} in Minecraft? Let’s think step by step. Here are some examples:

The guidance for the task “find material and craft a gold pickaxe” is 1) Find a place with a lot of trees; 2) Cut down the trees and gather the wood; 3) Find a place with a lot of stone; 4) Mine the stone and gather the cobblestone; 5) Find a place with a lot of iron; 6) Mine the iron and gather the iron ingots; 7) Find a place with a lot of gold; 8) Mine the gold and gather the gold ingots; 9) Craft a gold pickaxe.

The guidance for the task “sail on boat with a sheep” is 1) Find a boat; 2) Place the sheep in the boat; 3) Right-click on the boat with an empty hand to get in; 4) Use the WASD keys to move the boat. The sheep should stay in the boat.

C.4 Playthrough: Defeat the Ender Dragon

Our benchmarking suite includes a special task called “Playthrough”. The agent is initialized bare-handed in a freshly created world and aims to defeat the Ender dragon, which is considered the final boss of Minecraft. This task holds a unique position in our benchmark because killing the dragon means “beating the game” in the traditional sense of the phrase, and is considered the most significant achievement for a new player. This boss is optional and plenty of people choose to skip it without affecting their open-ended game experience.

“Playthrough” is technically a programmatic task, because we can check the simulator state for the defeat of the Ender dragon. However, we decide to create its own category due to the uniqueness as well as the sheer difficulty of the task. The mission requires lots of preparation, exploration, agility, and trial-and-error, which may take a regular human player many days to complete. It would be extremely long horizon (hundreds of thousands of steps) and difficult for an agent to tackle. We consider this one of the moonshot goals in MineDojo.

Appendix D Internet-Scale Database

We upload our databases to zenodo.org, which is an open repository platform operated by CERN. The data DOIs, URLs, and licenses are listed below. In this section, we describe our database properties and data collection process in details.

Database	DOI	License
YouTube	10.5281/zenodo.6641142	Creative Commons Attribution 4.0 International (CC BY 4.0)
Wiki	10.5281/zenodo.6640448	Creative Commons Attribution Non Commercial Share Alike 3.0 Unported
Reddit	10.5281/zenodo.6641114	Creative Commons Attribution 4.0 International (CC BY 4.0)

D.1 YouTube Videos and Transcripts

Minecraft is among the most streamed games on YouTube [41]. Human players have demonstrated a stunning range of creative activities and sophisticated missions that take hours to complete. We collect 33 years worth of video and 2.2B words in the accompanying English transcripts. The distribution of video duration is shown in Fig. A.3. The time-aligned transcripts enable the agent to ground free-form natural language in video pixels and learn the semantics of diverse activities without laborious human labeling.

We use the official YouTube Data API [140] to collect our database, following the procedure below:

a)

Search for channels that contain Minecraft videos using a list of keywords, e.g., “Minecraft”, “Minecraft Guide”, “Minecraft Walkthrough”, “Minecraft Beginner”. We do not directly search for videos at this step because there is a limit of total results returned by the API;
b)

Search for all the video IDs uploaded by each channel that we obtain at the previous step. There are many false positives at this step because some channels (like gaming news channel) may cover a range of topics other than Minecraft;
c)

To remove the false positives, we rely on the video category chosen by the user when the video was uploaded and filter out all the videos that do not belong to the Minecraft category;
d)

To curate a language-grounded dataset, we favor videos that have English transcripts, which can be manually uploaded by the user, automatically transcribed from audio, or automatically translated from another language by the YouTube engine. For each video, we filter it out if 1) the view count is less than 100; or 2) the aspect ratio is less 1; or 3) the duration is less than 1 minute long; or 4) marked as age-restricted.
e)

To further clean the dataset and remove potentially harmful contents, we employ the Detoxify [51] tool to process each video title and description. Detoxify is trained on Wikipedia comments to predict multiple types of toxicity like threats, obscenity, insults, and identity-based hate speech. We delete a video if the toxicity probability in any category is above 0.5.

We release all the video IDs along with metadata such as video titles, view counts, like counts, duration, and FPS. In line with prior practice [64], we do not release the actual MP4 files and transcripts due to legal concerns.

D.2 Minecraft Wiki

The Wiki pages cover almost every aspect of the game mechanics, and supply a rich source of unstructured knowledge in multimodal tables, recipes, illustrations, and step-by-step tutorials (example screenshots in Fig. A.4 and Fig. A.5). We use Selenium [103] to scrape 6,735 pages that interleave text, images, tables, and diagrams. We elaborate the details of each web element scraped by Selenium:

a)

Screenshot. Using Selenium’s built-in function, we take a full screenshot of the rendered Wiki page in order to preserve the human-readable visual formatting. We also record the bounding boxes of each salient web element on the page.
b)

Text. We hand-select several HTML tags that likely contain meaningful text data, such as p, h1, h2, ul, dl.
c)

Images and Animations. We download the raw source file of each image element (JPG, PNG, GIF, etc.), as well as the corresponding caption if available. There are also animation effects enabled by JavaScript on the Wiki. We save all image frames in the animation.
d)

Sprites. Sprite elements are micro-sized image icons that are typically embedded in text to create multimodal tutorials and explanations. We save all the sprites and locate their bounding boxes within the text too.
e)

Tables. We save the text content and bounding box of each cell that a table element contains. We store the header cells separately as they carry the semantic meaning of each column. A table can be easily reconstructed with the stored text strings and bounding boxes.

D.3 Reddit

There are more than 1M subreddits (i.e., Reddit topics) where people can discuss a wide range of themes and subjects. Prior works use Reddit data for conversational response selection [5, 139, 54] and abstractive summarization [127, 65]. The r/Minecraft subreddit contains free-form discussions of game strategies and images/videos showcases of Minecraft builds and creations (examples in Fig. A.7). The distribution of post types is shown in Fig. A.6.

To scrape the Reddit contents, we use PRAW [88], a Python wrapper on top of the official Reddit API. Our procedure is as follows:

a)

Obtain the ID and metadata (e.g. post title, number of comments, content, score) of every post in the “r/Minecraft” subreddit since it was created. For quality control, we only consider posts with scores (upvotes) $\geq 5$ and not marked as NSFW.
b)

Determine each post’s type. There are 4 native post types - text, image/video, link, and poll. We group text and poll posts together as text posts, and store their body text. For image/video and link posts, we store the source file URLs on external media hosting sites like Imgur and Gfycat. Based on the URL of each link post, we classify it as an image post, a video post or a general link post.
c)

Scrape the comments and store the parent ID of each comment so that we can reconstruct the threaded discussion.
d)

Similarly to our YouTube database (Sec. D.1), we run Detoxify [51] on the scraped Reddit contents to filter out potentially toxic and harmful posts.

We release all post IDs and their corresponding metadata. We also provide a Python function based on PRAW for researchers to download the post contents after obtaining a license key for the official Reddit API.

Appendix E MineCLIP Algorithm Details

We implement all our neural networks in PyTorch v1.11 [86]. Training MineCLIP uses the PyTorch-Lightning framework [32], pre-trained models hosted on HuggingFace [132], and the x-transformers library for Transformer variants [128].

E.1 Video-Text Pair Extraction

Similar to VideoCLIP [136], we sample 640K pairs of 16-second video snippets and time-aligned English transcripts by the following procedure:

1)

Collect a list of keywords corresponding to the supported entities, blocks, and items in Minecraft;
2)

Perform string matching over our YouTube video transcripts to obtain 640K text segments;
3)

For each matched transcript segment, randomly grow it to 16 $\sim$ 77 tokens (limited by CLIP’s context length);
4)

Randomly sample a timestamp within the start and end time of the matched transcript as the center for the video clip;
5)

Randomly grow the video clip from the center timestamp to 8 $\sim$ 16 seconds.

E.2 Architecture

MineCLIP architecture is composed of three parts:

Frame-wise image encoder $\phi_{I}$

We use the ViT-B/16 architecture [28] to compute a 512-D embedding for each RGB frame. We initialize the weights from OpenAI CLIP’s public checkpoint [92] and only finetune the last two layers during training. The input resolution is $160\times 256$ , which is different from CLIP’s default $224\times 224$ resolution. We adapt the positional embeddings via bicubic interpolation, which does not introduce any new learnable parameters.

Temporal aggregator $\phi_{a}$

Given a sequence of frame-wise RGB features, a temporal aggregator network summarizes the sequence into one video embedding. After the aggregator, we insert two extra layers of residual CLIP Adapter [38]. The residual weight is initialized such that it is very close to an identity function at the beginning of training. We consider two variants of $\phi_{a}$ :

1.

Average pooling (MineCLIP[avg]): a simple, parameter-free operator. It is fast to execute but loses the temporal information, because average pooling is permutation-invariant.
2.

Self-Attention (MineCLIP[attn]): a 2-layer transformer encoder with 512 embedding size, 8 attention heads, and Gated Linear Unit variant with Swish activation [106, 22]. The transformer sequence encoder is relatively slower, but captures more temporal information and achieves better performance in our experiments (Table 1).

Table A.2: Training hyperparameters for MineCLIP.

Hyperparameter	Value
LR schedule	Cosine with warmup [73]
Warmup steps	500
Peak LR	1.5e-4
Final LR	1e-5
Weight decay	0.2
Layerwise LR decay	0.65
Pre-trained layers LR multiplier	$0.5\times$
Batch size per GPU	64
Parallel GPUs	8
Video resolution	$160\times 256$
Number of frames	16
Image encoder	ViT-B/16 [28]

Text encoder $\phi_{G}$

We use a 12-layer 512-wide GPT model with 8 attention heads [90, 91]. The input string is converted to lower-case byte pair encoding with a 49,152 vocabulary size, and capped at 77 tokens. We exactly follow the text encoder settings in CLIP and initialize the weights from their public checkpoint. Only the last two layers of $\phi_{G}$ is finetuned during training.

E.3 Training

We train MineCLIP on the 640K video-text pairs for 2 epochs. We sample 16 RGB frames from each video uniformly, and apply temporally-consistent random resized crop [17, 33] as data augmentation. We use Cosine learning rate annealing with 500 gradient steps of warming up [73]. We apply a lower learning rate ( $\times 0.5$ ) on the pre-trained weights and layer-wise learning rate decay for better finetuning [53]. Training is performed on 1 node of $8\times$ V100 GPUs with FP16 mixed precision [76] via the PyTorch native amp module. All hyperparameters are listed in Table A.2.

Appendix F Policy Learning Details

Input: policy

\pi_{\theta}

, value function

VF(\cdot)

, SI buffer threshold

\Delta

, SI frequency

\omega

1 Initialize empty SI buffers for all tasks

\mathbf{D}_{SI}\leftarrow\{\emptyset,\forall T\in\text{training tasks}\}

;

2 Initialize a counter for simulator steps

counter\leftarrow 0

;

3 while not done do

4 Collect set of trajectories for all tasks

\{\tau_{T},\forall T\in\text{training tasks}\}

by running policy

\pi_{\theta}

in (parallel) environments;

5 forall $\mathcal{D}_{SI,T}$ do

6 if $\tau_{T}$ is successful then

\mathcal{D}_{SI,T}\leftarrow\mathcal{D}_{SI,T}\cup\tau_{T}

8 else if $\tau_{T}$ ’s episode return $\geq\mu_{\text{return}}(\mathcal{D}_{SI,T})+\Delta\times\sigma_{\text{return}}(\mathcal{D}_{SI,T})$ then

\mathcal{D}_{SI,T}\leftarrow\mathcal{D}_{SI,T}\cup\tau_{T}

11 end forall

12 Increase

counter

accordingly;

13 Update

\pi_{\theta}

following Equation 2;

14 Fit

VF(\cdot)

by regression on mean-squared error;

15 if $\mathbbm{1}(counter\bmod\omega=0)$ then

16 Determine the number of trajectories to sample from each buffer

\#_{\text{sample}}=\min(\{|\mathcal{D}_{SI,T}|,\forall T\in\text{training tasks}\})

;

17 Sample

\#_{\text{sample}}

trajectories from each buffer in a prioritized manner to construct

\mathcal{D}_{SI}

;

18 Update

\pi_{\theta}

\mathcal{D}_{SI}

with supervised objective;

21 end while

Algorithm 1 PPO-SI Interleaved Training

In this section, we elaborate how a trained MineCLIP can be adapted as a reward function with two different formulations. We then discuss the algorithm for policy learning. Finally, we demonstrate how we combine self imitation learning and on-policy learning to further improve sample efficiency.

F.1 Adapt MineCLIP as Reward Function

We investigate two ways to convert MineCLIP output to scalar reward, dubbed Direct and Delta. The ablation results for Animal-Zoo task group are presented in Table A.3.

Direct.

For a task $T$ with the goal description $G$ , MineCLIP outputs the probability $P_{G}$ that the observation video semantically corresponds to $G$ , against a set of negative goal descriptions $\mathcal{G}^{-}$ . Note that we omit timestep subscript for simplicity. As an example, for the task “shear sheep”, $G$ is “shear a sheep” and $\mathcal{G}^{-}$ may include negative prompts like “milk a cow”, “hunt a sheep”, “hunt a cow”, etc. To compute the Direct reward, we further process the raw probability using the formula $r=\max(P_{G}-\frac{1}{N_{T}},0)$ where $N_{T}$ is the number of prompts passed to MineCLIP. $\frac{1}{N_{T}}$ is the baseline probability of randomly guessing which text string corresponds to the video. We threshold $r$ at zero to avoid highly uncertain probability estimates below the random baseline. We call the variant without the post-processing Direct-Naive: $r=P_{G}$ as the reward signal for every time step.

Delta.

The Direct formulation yields strong performance when the task is concerned with moving creatures, e.g. farm animals and monsters that run around constantly. However, we discover that Direct is suboptimal if the task deals with static objects, e.g., “find a nether portal”. Simply using the raw probability from MineCLIP as reward can cause the learned agent to stare at the object of interest but fail to move closer and interact. Therefore, we propose to use an alternative formulation, Delta, to remedy this issue. Concretely, the reward value at timestep $t$ becomes $r_{t}=P_{G,t}-P_{G,t-1}$ . We empirically validate that this formulation provides better shaped reward for the task group with static entities.

F.2 Policy Network Architecture

Our policy architecture consists of three parts: an input feature encoder, a policy head, and a value function. To handle multimodal observations (Sec. B.1), the feature extractor contains several modality-specific components:

•

RGB frame: we use the frozen frame-wise image encoder $\phi_{I}$ in MineCLIP to optimize for compute efficiency and provide the agent with good visual representations from the beginning (Sec. 4.2).
•

Task goal: $\phi_{G}$ computes the text embedding of the natural language task goal.
•

Yaw and Pitch: compute $\sin(\cdot)$ and $\cos(\cdot)$ features respectively, then pass through an MLP.
•

GPS: normalize and featurize via MLP.
•

Voxel: to process the $3\times 3\times 3$ surrounding voxels, we embed discrete block names to dense vectors, flatten them, and pass through an MLP.
•

Past action: our agent is conditioned on its immediate past action, which is embedded and featurized by MLP.

Features from all modalities are concatenated, passed through another fusion MLP, and finally fed into the policy head and value function head. We use an MLP to model the policy head that maps from the input feature vectors to the action probability distribution. We use another MLP to estimate the value function, conditioned on the same input features.

F.3 RL Training

PPO.

We use the popular PPO algorithm [102] (Proximal Policy Optimization) as our RL training backbone. PPO is an on-policy method that optimizes for a surrogate objective while ensuring that the deviation from the previous policy is relatively small. PPO updates the policy network by

\underset{\theta}{\text{maximize}}\;\mathbb{E}_{s,a\sim\pi_{\theta_{\text{old}}}}L(s,a,\theta_{\text{old}},\theta),

(1)

where

L(s,a,\theta_{\text{old}},\theta)=\min\left(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}A^{\pi_{\theta_{\text{old}}}}(s,a),\text{clip}\left(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{\text{old}}}(a|s)},1-\epsilon,1+\epsilon\right)A^{\pi_{\theta_{\text{old}}}}(s,a)\right).

(2)

$A$ is an estimator of the advantage function (GAE [101] in our case) and $\epsilon$ is a hyperparameter that controls the deviation between the new policy and the old one.

Self Imitation Learning.

We apply self-imitation learning [84] (SI) to further improve sample efficiency because computing the reward with MineCLIP in the loop makes the training more expensive. Self-imitation learning is essentially supervised learning on a buffer $\mathcal{D}_{SI}$ of good trajectories generated by the agent’s past self. In our case, the trajectories are generated by the behavior policy during PPO rollouts, and only added to $\mathcal{D}_{SI}$ if it is a successful trial or if the episodic return exceeds a certain threshold. Self imitation optimizes $\pi_{\theta}$ for the objective $\mathcal{J}_{SI}=\mathbb{E}_{s,a\sim\mathcal{D}_{SI}}\log\pi_{\theta}(a|s)$ with respect to $\theta$ .

We alternate between the PPO phase and the SI phase. A pseudocode of our interleaved training procedure is given in Algorithm 1. We use a prioritized strategy to sample trajectories from the buffer $\mathcal{D}_{SI}$ . Specifically, we assign equal probability to all successful trajectories. Unsuccessful trajectories can still be sampled but with lower probabilities proportional to their episodic returns.

In Fig. A.8, we demonstrate that adding self-imitation dramatically improves the stability, performance, and sample efficiency of RL training in MineDojo.

Appendix G Experiment Details

G.1 Task Details

We experiment with three task groups with four tasks per group. We train one multi-task agent for each group. In this section, we describe each task goals, initial setup, and the manual dense-shaping reward function.

Animal Zoo:

4 Programmatic tasks on hunting or harvesting resource from animals. We spawn various animal types (pig, sheep, and cow) in the same environment to serve as distractors. It is considered a failure if the agent does not take action on the correct animal specified by the prompt.

•

Milk Cow: find and approach a cow, then obtain milk from it with an empty bucket. The prompt is milk a cow. We initialize the agent with an empty bucket to collect milk. We also spawn sheep, cow, and pig nearby the agent. The manual dense reward shaping is a navigation reward based on geodesic distance obtained from privileged LIDAR. The combined reward passed to PPO can be formulated as $r_{t}=\lambda_{\text{nav}}\max(d_{\text{min},t-1}-d_{\text{min},t},0)+\lambda_{\text{success}}\mathbbm{1}(\text{milk collected})$ , where $\lambda_{\text{nav}}=10$ and $\lambda_{\text{success}}=200$ . $d_{\text{min},t}=\min(d_{\text{min}},d_{t})$ where $d_{\text{min}}$ denotes the minimal distance to the cow that the agent has achieved so far in the episode history.
•

Hunt Cow: find and approach a cow, then hunt with a sword. The cow will run away so the agent needs to chase after it. The prompt is hunt a cow. We initialize the agent with a diamond sword. The manual dense reward shaping consists of two parts, a valid attack reward and a navigation reward based on geodesic distance obtained from privileged LIDAR. Mathematically, the reward is $r_{t}=\lambda_{\text{attack}}\mathbbm{1}(\text{valid attack})+\lambda_{\text{nav}}\max(d_{\text{min},t-1}-d_{\text{min},t},0)+\lambda_{\text{success}}\mathbbm{1}(\text{cow hunted})$ , where $\lambda_{\text{attack}}=5$ , $\lambda_{\text{nav}}=1$ , and $\lambda_{\text{success}}=200$ . We additionally reset $d_{\text{min}}$ every time the agent hits the cow to encourage the chasing behavior.
•

Shear Sheep: find and approach a sheep, then collect wool from the sheep with a shear. The prompt is shear a sheep. We initialize the agent with a shear. The manual dense reward shaping is a navigation reward based on geodesic distance obtained from the privileged LIDAR sensor, similar to “Milk Cow”.
•

Hunt Sheep: find and approach a sheep, then hunt with a sword. The sheep will run away so the agent needs to chase after it. An episode will terminate once any entity is hunted. The prompt is hunt a sheep. We initialize the agent with a diamond sword. The manual dense reward shaping consists of two parts, a valid attack reward and a navigation reward based on geodesic distance obtained from the privileged LIDAR sensor, similar to “Hunt Cow”.

Mob Combat:

fight 4 different types of hostile monsters: Spider, Zombie, Zombie Pigman (a creature in the Nether world), and Enderman (a creature in the End world). The prompt template is "Combat {monster}". For all tasks within this group, we initialize the agent with a diamond sword, a shield, and a full suite of diamond armors. The agent is spawned in the Nether for Zombie Pigman task, and in the End for Enderman. The manual dense-shaping reward can be expressed as $r_{t}=\lambda_{\text{attack}}\mathbbm{1}(\text{valid attack})+\lambda_{\text{success}}\mathbbm{1}(\texttt{\{monster\}}\text{ hunted})$ where $\lambda_{\text{attack}}=5$ and $\lambda_{\text{success}}=200$ .

Creative:

4 tasks that do not have manual dense reward shaping or code-defined success criterion.

•

Find Nether Portal: find and move close to a Nether Portal, then enter the Nether world through the portal. The prompt is find a nether portal.
•

Find Ocean: find and move close to an ocean. The prompt is find an ocean.
•

Dig Hole: dig holes in the ground. The prompt is dig a hole. We initialize the agent with an iron shovel.
•

Lay Carpet: lay down carpets to cover the wooden floor inside a house. The prompt is put carpets on the floor. We initialize the agent with a number of carpets in its inventory.

Note that we categorize “Find Nether Portal” and “Find Ocean” as Creative tasks even though they seem similar to object navigation [12]. While finding terrains and other structures is semantically well defined, it is not easy to define a function to evaluate success automatically because the simulator does not have the exact location information of these structures given a randomly generated world. In principle, we can make a sweep by querying each chunk of voxels in the world to recognize the terrains, but that would be prohibitively expensive. Therefore, we opt to use MineCLIP as the reward signal and treat these tasks as Creative.

Table A.4: Hyperparameters in RL experiments. “{state} MLP” refers to MLPs to process observations of compass, GPS, and voxel blocks. “Embed Dim” denotes the same dimension size used to embed all discrete observations into dense vectors.

NN Architecture		Training
NN Architecture		Hyperparameter	Animal-Zoo	Mob-Combat	Creative
RGB Feature Size	512	Learning Rate	$10^{-4}$	$10^{-4}$	$10^{-4}$
Task Prompt Feature Size	512	Cosine Decay Minimal LR	$5\times 10^{-6}$	$5\times 10^{-6}$	$5\times 10^{-6}$
{state} MLP Hidden Size	128	$\gamma$	$0.99$	$0.99$	$0.99$
{state} MLP Output Size	128	Entropy Weight (Stage 1)	$5\times 10^{-3}$	$5\times 10^{-3}$	$5\times 10^{-3}$
{state} MLP Hidden Depth	2	Entropy Weight (Stage 2)	$10^{-2}$	N/A	$10^{-2}$
Embed Dim	8	PPO Optimizer	Adam	Adam	Adam
Num Feature Fusion Layers	1	SI Learning Rate	$10^{-4}$	$10^{-4}$	$10^{-4}$
Feature Fusion Output Size	512	SI Cosine Decay Minimal LR	$10^{-6}$	$10^{-6}$	$10^{-6}$
Prev Action Conditioning	True	SI Epoch	10	10	10
Policy Head Hidden Size	256	SI Frequency (Env Steps)	100K	100K	100K
Policy Head Hidden Depth	3	SI Optimizer	Adam	Adam	Adam
VF Hidden Size	256	SI Buffer Threshold	$2\sigma$	$2\sigma$	$0.5\sigma$
VF Hidden Depth	3	PPO Buffer Size	100K	100K	100K
		Frame Stack	1	1	1
		VF Loss Weight	0.5	0.5	0.5
		GAE $\lambda$	0.95	0.95	0.95
		Gradient Clip	10	10	10
		PPO $\epsilon$	0.2	0.2	0.2
		Action Smooth Weight	$10^{-7}$	$10^{-7}$	$10^{-7}$
		Action Smooth Window Size	3	3	3
		MineCLIP Reward Formulation	Direct	Direct	Delta

G.2 Observation and Action Space

We use a subset of the full observation and action space listed in Sec. B.1 and B.2, because the tasks in our current experiments do not involve actions like crafting or inventory management. Our observation space consists of RGB frame, compass, GPS, and Voxels.

Our action space is a trimmed version of the full action space. It consists of movement control, camera control, “use” action, and “attack” action, which add up to 89 discrete choices. Concretely, it includes 81 actions for discrete camera control ( $9\times 9$ resulted from the Cartesian product between yaw and pitch, each ranges from $-60$ degree to $60$ degree with a discrete interval of $15$ degree). It also includes 6 movement actions (forward, forward + jump, jump, back, move left, and move right) and 2 functional actions of “use” and “attack”. Note that the “no-op” action is merged into the 81 camera actions.

G.3 RL Training

All hyperparameters used in our RL experiment are listed in Table A.4. We visualize the learned behaviors of 4 tasks in Figure 2. Demos of more tasks can be found on our website https://minedojo.org.

Action smoothing.

Due to the stochastic nature of PPO, we observe a lot of action jittering in the agent’s behavior during training. This leads to two negative effects that degrade the learning performance: 1) exploration difficulty due to inconsistent action sequence. For example, the agent may be required to take multiple consecutive attack actions in order to complete certain tasks; and 2) rapidly switching different movement and camera motions result in videos that are highly non-smooth and disorienting. This causes a domain gap from the training data of MineCLIP, which are typically smooth human gameplay videos. Therefore, the reward signal quality deteriorates significantly.

To remedy the issue, we impose an action smoothing loss to be jointly optimized with the PPO objective (Eq. 2) during training. Concretely, consider a sliding window $\mathcal{W}$ with window size $|\mathcal{W}|$ that contains $|\mathcal{W}|$ consecutive action distributions $\mathcal{W}=\{\pi_{t-|\mathcal{W}|+1},\pi_{t-|\mathcal{W}|+2},\ldots,\pi_{t}\}$ , the action smoothing loss is defined as

\mathcal{L}_{\text{smooth}}=\frac{1}{|\mathcal{W}|}\sum_{i=1}^{|\mathcal{W}|-1}KL(\pi_{t}\|\pi_{t-|\mathcal{W}|+i}),

(3)

where $KL(\cdot)$ denotes Kullback–Leibler divergence.

Multi-stage training for multi-task RL.

Due to hardware limitations, we are not able to run a large number of parallel simulators for all tasks in a task group. Therefore, we adopt a multi-stage strategy to split the tasks and train them sequentially with a single policy network. For the task groups Animal-Zoo and Creative, we split the four tasks into two stages of two parallel training tasks each. We carry over the self-imitation buffers when switching to the next stage. We also follow the recommended practice in [83] and reset the policy head at the beginning of stage 2 to encourage exploration and reduce overfitting. We adopt a similar replay buffer balancing strategy as [46] to prevent any task from dominating the training.

G.4 Evaluation

In this section, we elaborate on our human and automatic evaluation procedure for Creative tasks. We first ask the human annotators to manually label 100 successful and 100 failure trajectories. This produces a combined dataset of 200 trajectories with groundtruth binary labels to evaluate the learned reward functions. On this dataset, we run MineCLIP to produce step-wise rewards and compute a score that averages over each trajectory. We then apply K-means clustering with $K=2$ to all scores and determine a decision boundary $\delta$ from the mean of the two centroids. A trajectory with a score greater than $\delta$ is classified as successful, and vice versa for failure. In this way, we essentially convert MineCLIP to a binary classifier. The quality of MineCLIP can be measured by the F1 score of its binary classification output against the human labels. We demonstrate that MineCLIP has high agreements with humans (Table 2), and thus qualifies as an effective automatic evaluation metric for Creative tasks in the absence of human judges.

To further investigate MineCLIP’s evaluation on more complex Creative tasks, we annotate 50 YouTube video segments each for 5 more tasks that are much more semantically complex: “build a farm”, “ build a fence”, “build a house”, “ride a minecart”, and “build a swimming pool”. We then run MineCLIP evaluation on these videos against a negative set. As shown in Table A.5, though not perfect, MineCLIP generally has a positive agreement with human judgment. We note that the current MineCLIP is a proof-of-concept step in leveraging internet data for automated evaluation, and further scaling on more training data and parameters may lead to more improvements. Meanwhile, human judgment remains a useful and important alternative [93, 97].

Table A.5: MineCLIP’s evaluation on more complex Creative tasks. Numbers represent F1 scores between MineCLIP’s evaluation on tasks success and human labels. Scaled to percentage for better readability.

Tasks	Build a Farm	Build a Fence	Build a House	Ride a Minecart	Build a Swimming Pool
Ours (Attn)	${\color[rgb]{0.09,0.45,0.27}\mathbf{78.7}}$	${\color[rgb]{0.09,0.45,0.27}\mathbf{91.4}}$	${\color[rgb]{0.09,0.45,0.27}\mathbf{63.7}}$	$95.9$	$85.0$
Ours (Avg)	$73.4$	$83.1$	$37.4$	${\color[rgb]{0.09,0.45,0.27}\mathbf{96.9}}$	${\color[rgb]{0.09,0.45,0.27}\mathbf{94.7}}$
CLIP_OpenAI	$62.5$	$24.5$	$52.9$	$70.0$	$71.7$

Appendix H Limitations and Potential Societal Impact

Unlike human demonstrations [126] or offline RL datasets [35], our YouTube dataset contains only the video screen observations but not the actual control actions. This allows us to scale up the dataset tremendously, but at the same time poses a challenge to imitation learning algorithms that require observation-action pairs to learn. Our proposed algorithm, MineCLIP, side-steps this problem by learning a reward model, but we believe that directly inferring the human expert policy from YouTube is another important direction complementary to our approach. There are promising techniques that can potentially overcome this limitation, such as the Learning-from-Observation (LfO) family of algorithms [123, 122, 115, 31].

Our database is scraped from the internet, which inevitably contains offensive YouTube videos or toxic Reddit posts. While we have made our best effort to filter out these harmful contents (Sec. D.1), there can still be undesirable biases and toxicity that elude our automatic filters. Furthermore, we advocate the use of large pre-trained language models in our main paper, and MineCLIP is finetuned from the pre-trained weights of OpenAI CLIP [92]. These foundation models are known to contain harmful stereotypes and generate hateful commentary [15, 13, 40]. We ask the researchers who will use our code and database to exercise their best judgment during new model development to avoid any negative social impact.

Appendix I Datasheet

We present a Datasheet [39] for documentation and responsible usage of our internet knowledge databases.

I.1 Motivation

For what purpose was the dataset created?

We create this internet-scale multimodal knowledge base to facilitate research towards open-ended, generally capable embodied agents.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

This knowledge base was created by Linxi Fan (Nvidia), Guanzhi Wang (Caltech), Yunfan Jiang (Stanford), Ajay Mandlekar (Nvidia), Yuncong Yang (Columbia), Haoyi Zhu (SJTU), Andrew Tang (Columbia), De-An Huang (Nvidia), Yuke Zhu (Nvidia and UT Austin), and Anima Anandkumar (Nvidia and Caltech).

I.2 Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

Yes, the dataset is publicly available on the internet.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?

All datasets can be downloaded from https://zenodo.org/. Please refer to this table of URL, DOI, and licensing:

Database	DOI	License
YouTube	10.5281/zenodo.6641142	Creative Commons Attribution 4.0 International (CC BY 4.0)
Wiki	10.5281/zenodo.6640448	Creative Commons Attribution Non Commercial Share Alike 3.0 Unported
Reddit	10.5281/zenodo.6641114	Creative Commons Attribution 4.0 International (CC BY 4.0)

Have any third parties imposed IP-based or other restrictions on the data associated with the instances?

No.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?

No.

I.3 Maintenance

Who will be supporting/hosting/maintaining the dataset?

The authors will be supporting, hosting, and maintaining the dataset.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

Please contact Linxi Fan (linxif@nvidia.com), Guanzhi Wang (guanzhi@caltech.edu), and Yunfan Jiang (yunfanj@cs.stanford.edu).

Is there an erratum?

No. We will make announcements if there is any.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?

Yes. New updates will be posted on https://minedojo.org.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)?

N/A.

Will older versions of the dataset continue to be supported/hosted/maintained?

Yes, old versions will be permanently accessible on zenodo.org.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?

Yes, please refer to https://minedojo.org.

I.4 Composition

What do the instances that comprise the dataset represent?

For YouTube videos, our data is in JSON format with video URLs and metadata. We do not provide the raw MP4 files for legal concerns. For Wiki, we provide the text, images, tables, and diagrams embedded on the web pages. For Reddit, our data is in JSON format with post IDs and metadata, similar to YouTube. Users can reconstruct the Reddit dataset by running our script after obtaining an official Reddit API license key.

How many instances are there in total (of each type, if appropriate)?

There are more than 730K YouTube videos with 2.2B words of transcripts, 6,735 Wiki pages with 2.2M bounding boxes of visual elements, and more than 340K Reddit posts with 6.6M comments.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

We provide all instances in our Zenodo data repositories.

Is there a label or target associated with each instance?

No.

Is any information missing from individual instances?

No.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)?

We provide metadata for each YouTube video link and Reddit post ID.

Are there recommended data splits (e.g., training, development/validation, testing)?

No. The entire database is intended for pre-training.

Are there any errors, sources of noise, or redundancies in the dataset?

Please refer to Sec. D

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?

We follow prior works [64] and only release the video URLs of YouTube videos due to legal concerns. Researchers need to acquire the MP4 and transcript files separately. Similarly, we only release the post IDs for the Minecraft Reddit database, but we also provide a script that can reconstruct the full Reddit dataset given a free official license key.

Does the dataset contain data that might be considered confidential?

No.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?

We have made our best efforts to detoxify the contents via an automated procedure. Please refer to Sec. D.

I.5 Collection Process

The collection procedure, preprocessing, and cleaning are explained in details in Sec. D.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

All data collection, curation, and filtering are done by MineDojo coauthors.

Over what timeframe was the data collected?

The data was collected between Dec. 2021 and May 2022.

I.6 Uses

Has the dataset been used for any tasks already?

Yes, we have used the MineDojo YouTube database for agent pre-training. Please refer to Sec. 5 and Sec. G for algorithmic and training details.

What (other) tasks could the dataset be used for?

Our knowledge base is primarily intended to facilitate research in open-ended, generally capable embodied agents. However, it can also be broadly applicable to research in video understanding, document understanding, language modeling, multimodal learning, and so on.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?

No.

Are there tasks for which the dataset should not be used?

We strongly oppose any research that intentionally generates harmful or toxic contents using our YouTube, Wiki, and Reddit data.

Group 组	Tasks 任务	Ours (Attn) 我们的（注意）	Ours (Avg) 我们的（平均）	Manual Reward 手动奖励	Sparse-only 稀疏	CLIP_OpenAI 剪辑
	Milk Cow 奶牛	${\color[rgb]{0.09,0.45,0.27}\mathbf{64.5\pm 37.1}}$	$6.5\pm 3.5$	$62.8\pm 40.1$	$0.0\pm 0.0$	$0.0\pm 0.0$
	Hunt Cow 猎牛	${\color[rgb]{0.09,0.45,0.27}\mathbf{83.5\pm 7.1\hphantom{0}}}$	$0.0\pm 0.0$	$48.3\pm 35.9$	$0.3\pm 0.4$	$0.0\pm 0.0$
	Shear Sheep 剪羊毛	$12.1\pm 9.1\hphantom{0}$	$0.6\pm 0.2$	${\color[rgb]{0.09,0.45,0.27}\mathbf{52.3\pm 33.2}}$	$0.0\pm 0.0$	$0.0\pm 0.0$
	Hunt Sheep 猎羊	$8.1\pm 4.1$	$0.0\pm 0.0$	${\color[rgb]{0.09,0.45,0.27}\mathbf{41.9\pm 33.0}}$	$0.3\pm 0.4$	$0.0\pm 0.0$
	Combat Spider 战斗蜘蛛	$80.5\pm 13.0$	$60.1\pm 42.5$	${\color[rgb]{0.09,0.45,0.27}\mathbf{87.5\pm 4.6\hphantom{0}}}$	$47.8\pm 33.8$	$0.0\pm 0.0$
	Combat Zombie 战斗僵尸	$47.3\pm 10.6$	${\color[rgb]{0.09,0.45,0.27}\mathbf{72.3\pm 6.4\hphantom{0}}}$	$49.8\pm 26.9$	$\hphantom{0}8.8\pm 12.4$	$0.0\pm 0.0$
	Combat Pigman 战斗猪人	$1.6\pm 2.3$	$0.0\pm 0.0$	${\color[rgb]{0.09,0.45,0.27}\mathbf{13.6\pm 9.8\hphantom{0}}}$	$0.0\pm 0.0$	$0.0\pm 0.0$
	Combat Enderman 战斗末影人	$0.0\pm 0.0$	$0.0\pm 0.0$	$0.3\pm 0.2$	$0.0\pm 0.0$	$0.0\pm 0.0$
	Find Nether Portal 寻找下界传送门	$37.4\pm 40.8$	${\color[rgb]{0.09,0.45,0.27}\mathbf{89.8\pm 5.7\hphantom{0}}}$	N/A	N/A	$26.3\pm 32.6$
	Find Ocean 寻找海洋	$33.4\pm 45.6$	${\color[rgb]{0.09,0.45,0.27}\mathbf{54.3\pm 40.7}}$	N/A	N/A	$\hphantom{0}9.9\pm 14.1$
	Dig Hole 挖洞	${\color[rgb]{0.09,0.45,0.27}\mathbf{91.6\pm 5.9\hphantom{0}}}$	$88.1\pm 13.3$	N/A	N/A	$0.0\pm 0.0$
	Lay Carpet 铺地毯	$97.6\pm 1.9\hphantom{0}$	${\color[rgb]{0.09,0.45,0.27}\mathbf{98.8\pm 1.0\hphantom{0}}}$	N/A	N/A	$0.0\pm 0.0$

Tasks 任务	Ours (Attn), train 我们的（注意），列车	Ours (Attn), unseen test 我们（注意），看不见的测试	CLIP_OpenAI, train 剪辑，训练	CLIP_OpenAI, unseen test CLIP，未见过的测试
Milk Cow 奶牛	$64.5\pm 37.1$	${\color[rgb]{0.09,0.45,0.27}\mathbf{64.8\pm 31.3(+\hphantom{0}0.8\%)}}$	$90.0\pm 0.4$	$29.2\pm 3.7\hphantom{0}(-67.6\%)$
Hunt Cow 猎牛	$83.5\pm 7.1\hphantom{0}$	${\color[rgb]{0.09,0.45,0.27}\mathbf{55.9\pm 7.2\hphantom{0}(-32.9\%)}}$	$72.7\pm 3.5$	$16.7\pm 1.6\hphantom{0}(-77.0\%)$
Combat Spider 战斗蜘蛛	$80.5\pm 13.0$	${\color[rgb]{0.09,0.45,0.27}\mathbf{62.1\pm 29.7(-22.9\%)}}$	$79.5\pm 2.5$	$54.2\pm 9.6\hphantom{0}(-31.8\%)$
Combat Zombie 战斗僵尸	$47.3\pm 10.6$	${\color[rgb]{0.09,0.45,0.27}\mathbf{39.9\pm 25.3(-15.4\%)}}$	$50.2\pm 7.5$	$30.8\pm 14.4(-38.6\%)$

Group 组	Tasks 任务	Single Agent on All Tasks 单一代理人处理所有任务	Original 原始的	Performance Change 性能变化
	Milk Cow 奶牛	${\color[rgb]{0.09,0.45,0.27}\mathbf{91.5\pm 0.7}}$	$64.5\pm 37.1$	$\uparrow$
	Hunt Cow 猎牛	$46.8\pm 3.7$	${\color[rgb]{0.09,0.45,0.27}\mathbf{83.5\pm 7.1\hphantom{0}}}$	$\downarrow$
	Shear Sheep 剪羊毛	${\color[rgb]{0.09,0.45,0.27}\mathbf{73.5\pm 0.8}}$	$12.1\pm 9.1\hphantom{0}$	$\uparrow$
	Hunt Sheep 猎羊	${\color[rgb]{0.09,0.45,0.27}\mathbf{27.0\pm 1.0}}$	$8.1\pm 4.1$	$\uparrow$
	Combat Spider 战斗蜘蛛	$72.1\pm 1.3$	${\color[rgb]{0.09,0.45,0.27}\mathbf{80.5\pm 13.0}}$	$\downarrow$
	Combat Zombie 战斗僵尸	$27.1\pm 2.7$	${\color[rgb]{0.09,0.45,0.27}\mathbf{47.3\pm 10.6}}$	$\downarrow$
	Combat Pigman 战斗猪人	${\color[rgb]{0.09,0.45,0.27}\mathbf{6.5\pm 1.2}}$	$1.6\pm 2.3$	$\uparrow$
	Combat Enderman 战斗末影人	$0.0\pm 0.0$	$0.0\pm 0.0$	$=$
	Find Nether Portal 寻找下界传送门	${\color[rgb]{0.09,0.45,0.27}\mathbf{99.1\pm 0.4}}$	$37.4\pm 40.8$	$\uparrow$
	Find Ocean 寻找海洋	${\color[rgb]{0.09,0.45,0.27}\mathbf{95.1\pm 1.5}}$	$33.4\pm 45.6$	$\uparrow$
	Dig Hole 挖洞	$85.8\pm 1.2$	${\color[rgb]{0.09,0.45,0.27}\mathbf{91.6\pm 5.9\hphantom{0}}}$	$\downarrow$
	Lay Carpet 铺地毯	$96.5\pm 0.8$	${\color[rgb]{0.09,0.45,0.27}\mathbf{97.6\pm 1.9}}\hphantom{0}$	$=$

Group	Tasks	Direct	Direct-Naive	Delta
	Milk Cow	${\color[rgb]{0.09,0.45,0.27}\mathbf{64.5\pm 37.1}}$	$8.6\pm 1.2$	$7.6\pm 5.2$
	Hunt Cow	${\color[rgb]{0.09,0.45,0.27}\mathbf{83.5\pm 7.1\hphantom{0}}}$	$0.0\pm 0.0$	$0.0\pm 0.0$
	Shear Sheep	${\color[rgb]{0.09,0.45,0.27}\mathbf{12.1\pm 9.1\hphantom{0}}}$	$0.8\pm 0.6$	$1.8\pm 1.5$
	Hunt Sheep	${\color[rgb]{0.09,0.45,0.27}\mathbf{8.1\pm 4.1}}$	$0.1\pm 0.2$	$0.0\pm 0.0$

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale KnowledgeMineDojo：利用互联网规模的知识构建开放式的具身化代理