π₀: Our First Generalist Policy
π ₀ : 我们的第一项通才政策

Published 发布

October 31, 2024 2024 年 10 月 31 日

Email 电子邮件

research@physicalintelligence.company

Paper 纸张

We are living through an AI revolution: the past decade witnessed practically useful AI assistants, AI systems that can generate photorealistic images and videos, and even models that can predict the structure of proteins. But in spite of all these advances, human intelligence dramatically outpaces AI when it comes to the physical world. To paraphrase Moravec’s paradox, winning a game of chess or discovering a new drug represent “easy” problems for AI to solve, but folding a shirt or cleaning up a table requires solving some of the most difficult engineering problems ever conceived. To build AI systems that have the kind of physically situated versatility that people possess, we need a new approach — we need to make AI systems embodied so that they can acquire physical intelligence.
我们正处于一场人工智能革命中：过去十年见证了实用的 AI 助手、能够生成逼真图像和视频的 AI 系统，甚至能够预测蛋白质结构的模型。尽管取得了所有这些进步，但在物理世界方面，人类智能仍然远远超过 AI。用莫拉维克悖论来概括，赢得一场棋局或发现一种新药对 AI 来说都是“容易”解决的问题，但叠衣服或清理桌子则需要解决一些最复杂的工程问题。要构建具有人类所拥有的物理位置多样性的 AI 系统，我们需要一种新的方法——我们需要使 AI 系统具有身体智能。

Over the past eight months, we’ve developed a general-purpose robot foundation model that we call π₀ (pi-zero). We believe this is a first step toward our long-term goal of developing artificial physical intelligence, so that users can simply ask robots to perform any task they want, just like they can ask large language models (LLMs) and chatbot assistants. Like LLMs, our model is trained on broad and diverse data and can follow various text instructions. Unlike LLMs, it spans images, text, and actions and acquires physical intelligence by training on embodied experience from robots, learning to directly output low-level motor commands via a novel architecture. It can control a variety of different robots, and can either be prompted to carry out the desired task, or fine-tuned to specialize it to challenging application scenarios. An extended article on our work can be found here.
过去八个月里，我们开发了一个通用的机器人基础模型，我们称之为π ₀ （pi-zero）。我们相信这是实现我们长期目标的第一步，即开发人工物理智能，让用户可以像要求大型语言模型(LLMs)和聊天机器人助手一样，简单地要求机器人执行任何他们想要的任务。像LLMs一样，我们的模型在广泛和多样化的数据上进行了训练，并能遵循各种文本指令。与LLMs不同，它涵盖了图像、文本和动作，并通过在机器人具身经验上进行训练获得物理智能，通过一种新颖的架构直接输出低级运动命令。它可以控制各种不同的机器人，既可以被提示执行所需的任务，也可以微调以适应具有挑战性的应用场景。关于我们工作的扩展文章可以在这里找到。

The promise of generalist robot policies
机器人通用政策的前景

Today’s robots are narrow specialists. Industrial robots are programmed for repetitive motions in choreographed settings, repeatedly making the same weld in the same spot on an assembly line or dropping the same item into the same box. Even such simple behaviors require extensive manual engineering, and more complex behaviors in messy real-world environments such as homes are simply infeasible. AI could change that, allowing robots to learn and follow user instructions, so that programming a new behavior is as simple as telling the robot what you want done, and the robot can itself figure out how to adapt its behavior to its environment. But this requires data. Language models and other foundation models mine data from the web, utilizing a significant fraction of all available documents. There is no such treasure trove of robot data, so to enable a robot to learn a new skill, large amounts of data need to be collected with that particular robot and for that particular application.
今天的机器人是狭窄的专家。工业机器人被编程在编排的环境中重复运动，反复在装配线上在同一位置进行相同的焊接或在同一盒子里放下相同的物品。即使这样的简单行为也需要大量的手动工程，而在混乱的真实世界环境，如家庭中的更复杂行为几乎不可行。人工智能可以改变这一点，使机器人能够学习和遵循用户指令，因此编程新的行为就像告诉机器人你想做什么一样简单，机器人自己可以想出如何适应其环境。但这需要数据。语言模型和其他基础模型从网络中挖掘数据，利用了所有可用文档的很大一部分。没有这样的机器人数据宝库，因此要使机器人学会一项新技能，需要收集大量与该特定机器人和该特定应用相关的数据。

If we could train a single generalist robot policy that can perform a wide range of different skills and control a wide range of different robots, we would overcome this challenge: such a model would need only a little bit of data from each robot and each application. Just as a person can learn a new skill quickly by drawing on a lifetime’s worth of experience, such a generalist robot policy could be specialized to new tasks with only modest amounts of data. This would not be the first time that a generalist model beat a specialist at the specialist’s own task: language models have superseded more specialized language processing systems precisely because they can better solve those downstream specialist tasks by drawing on their diverse and general purpose pretraining. In the same way that LLMs provide a foundation model for language, these generalist robot policies will provide a robot foundation model for physical intelligence.
如果我们能够训练一个能够执行广泛技能并控制各种不同机器人的通用机器人策略，我们将克服这一挑战：这样的模型只需要从每个机器人和每个应用中获取少量数据。就像一个人可以通过借鉴一生的经验快速学习新技能一样，这样的通用机器人策略可以通过仅使用少量数据来专门化新任务。这并不是第一次通用模型在专家自己的任务上击败专家：语言模型之所以取代了更专业的语言处理系统，正是因为它们可以通过利用其多样化和通用的预训练更好地解决那些下游的专家任务。就像LLMs为语言提供基础模型一样，这些通用机器人策略将为物理智能提供机器人基础模型。

To get there, we will need to solve major technical challenges. Our first step is π₀, a prototype model that combines large-scale multi-task and multi-robot data collection with a new network architecture to enable the most capable and dexterous generalist robot policy to date. While we believe this is only a small early step toward developing truly general-purpose robot models, we think it represents an exciting step that provides a glimpse of what is to come.
为了到达那里，我们需要解决重大的技术挑战。我们的第一步是π ₀ ，一个结合大规模多任务和多机器人数据收集以及新网络架构的原型模型，以实现迄今为止最强大、最灵活的通用机器人策略。虽然我们相信这仅仅是开发真正通用机器人模型的小小早期一步，但我们认为这代表了一个令人兴奋的步骤，为即将到来的一切提供了一个窥视。

A cross-embodiment training mixture
跨实体训练混合

Our model uses Internet-scale vision-language pre-pretraining, open-source robot manipulation datasets, and our own datasets consisting of dexterous tasks from 8 distinct robots.

π₀ uses Internet-scale vision-language pre-pretraining, open-source robot manipulation datasets, and our own datasets consisting of dexterous tasks from 8 distinct robots. The model can then perform a wide variety of tasks, via either direct prompting or fine-tuning.
π ₀ 使用互联网规模的视觉-语言预预训练、开源机器人操作数据集以及我们自己的数据集，包括来自 8 个不同机器人的灵巧任务。该模型随后可以通过直接提示或微调执行各种任务。

Our first prototype generalist robot policy is trained on the largest robot interaction dataset to date. The full training mixture includes both open-source data and a large and diverse dataset of dexterous tasks that we collected across 8 distinct robots.
我们的首个通用型机器人原型策略是在迄今为止最大的机器人交互数据集上训练的。完整的训练混合数据包括开源数据以及我们在 8 个不同机器人上收集的大量且多样化的灵巧任务数据集。

Our dataset contains diverse tasks, with each task exhibiting a wide variety of motion primitives, many different objects, and various scenes.
我们的数据集包含各种任务，每个任务都展示了广泛的运动原语、许多不同的物体和多种场景。

The tasks in this dataset exercise different dimensions of robot dexterity while covering the range of real tasks that these robots might be asked to perform, from bussing dishes to packing items into envelopes, folding clothing, routing cables, assembling boxes, plugging in power plugs, packing food into to-go boxes, and picking up and throwing out trash. Our goal in selecting these tasks is not to solve any particular application, but to start to provide our model with a general understanding of physical interactions — an initial foundation for physical intelligence.
该数据集中的任务锻炼了机器人灵巧性的不同维度，同时涵盖了这些机器人可能被要求执行的真实任务范围，从洗碗到将物品装入信封、折叠衣物、布线、组装盒子、插入电源插头、将食物装入外卖盒以及捡起和扔掉垃圾。我们选择这些任务的目标不是解决任何特定应用，而是开始为我们的模型提供对物理交互的一般理解——物理智能的初步基础。

Inheriting Internet-scale semantic understanding
继承互联网规模的语义理解

Beyond training on many different robots, π₀ inherits semantic knowledge and visual understanding from Internet-scale pretraining by starting from a pre-trained vision-language model (VLM). VLMs are trained to model text and images on the web — widely used VLMs include GPT-4V and Gemini. We use a smaller 3 billion parameter VLM as a starting point, and adapt it for real-time dexterous robot control.
超越在许多不同机器人上的训练，π ₀ 通过从预训练的视觉语言模型（VLM）开始，继承了语义知识和视觉理解。VLMs 被训练来模拟网络上的文本和图像——广泛使用的 VLM 包括 GPT-4V 和 Gemini。我们使用一个参数更小的 30 亿参数 VLM 作为起点，并对其进行调整以实现实时灵巧的机器人控制。

VLMs effectively transfer semantic knowledge from the web, but they are trained to output only discrete language tokens. Dexterous robot manipulation requires π₀ to output motor commands at a high frequency, up to 50 times per second. To provide this level of dexterity, we developed a novel method to augment pre-trained VLMs with continuous action outputs via flow matching, a variant of diffusion models. Starting from diverse robot data and a VLM pre-trained on Internet-scale data, we train our vision-language-action flow matching model, which we can then post-train on high-quality robot data to solve a range of downstream tasks.
VLMs 有效地将语义知识从网络中迁移，但它们被训练以仅输出离散的语言标记。灵活的机器人操作需要π ₀ 以高频率输出电机命令，每秒高达 50 次。为了提供这种级别的灵活性，我们开发了一种新颖的方法，通过流匹配（一种扩散模型的变体）来增强预训练的 VLMs 的连续动作输出。从多样化的机器人数据和在互联网规模数据上预训练的 VLMs 开始，我们训练了我们的视觉-语言-动作流匹配模型，然后我们可以在高质量的机器人数据上对其进行后训练，以解决一系列下游任务。

Our vision-language-action model uses a novel flow matching formulation, which augments a vision-language model pre-trained on Internet-scale data with continuous outputs. This enables high-frequency dexterous control, making it particularly well-suited for fine-tuning for complex robot manipulation tasks, such as folding laundry or assembling boxes.
我们的视觉-语言-行动模型采用了一种新颖的流匹配公式，该公式通过连续输出增强了一个在互联网规模数据上预训练的视觉-语言模型。这使得它可以进行高频灵活控制，特别适合于微调复杂机器人操作任务，如折叠衣物或组装盒子。

Post-training for dexterous manipulation
训练后灵巧操作

More complex and dexterous tasks may require the model to be fine-tuned to specialize it to downstream challenges. Fine-tuning the model with high-quality data for a challenging task, such as folding laundry, is analogous to the post-training process employed by LLM designers. Pre-training teaches the model about the physical world, while fine-tuning forces it to perform a particular task well. Let’s take a look at some of these tasks.
更复杂和灵活的任务可能需要将模型微调以专门针对下游挑战。使用高质量数据对具有挑战性的任务（如叠衣服）进行模型微调，类似于LLM设计师采用的训练后过程。预训练使模型了解物理世界，而微调则迫使它出色地完成特定任务。让我们看看这些任务中的一些。

After post-training, the robot can unload the dryer, bring the clothes over to the table, and fold the clothes into a stack. The video is uncut, from a single policy operating fully autonomously.
训练后，机器人可以卸下烘干机，把衣服拿到桌子上，并将衣服叠成一摞。视频未剪辑，展示了一个单一策略完全自主运行的情况。

Laundry. We fine-tuned π₀ to fold laundry, using either a mobile robot or a fixed pair of arms. The goal is to get the clothing into a neat stack. This task is exceptionally difficult for robots (...and some humans): while a single t-shirt laid flat on the table can sometimes be folded just by repeating a pre-scripted set of motions, a pile of tangled laundry can be crumpled in many different ways, so it is not enough to simply move the arms through the same motion. To our knowledge, no prior robot system has been demonstrated to perform this task at this level of complexity.
洗衣。我们微调了π ₀ 来折叠衣物，使用移动机器人或固定的一对机械臂。目标是让衣物整齐堆叠。这项任务对机器人（甚至一些人类）来说特别困难：当一件 T 恤平铺在桌子上时，有时只需重复一系列预编的动作就可以折叠，但一堆纠缠的衣物可以以多种不同的方式皱缩，所以仅仅通过重复相同的动作来移动机械臂是不够的。据我们所知，还没有先前的机器人系统在如此复杂的水平上展示过这项任务。

Notably, by training on diverse data, we find that the robot is able to recover when someone tries to intervene in a variety of different ways.
值得注意的是，通过在多样化的数据上训练，我们发现当有人试图以各种不同的方式干预时，机器人能够恢复。

Table bussing. We also fine-tuned the model to bus a table. This requires the robot to pick up dishes and trash on the table, putting any dishes, cutlery, or cups into a bussing bin, and putting trash into the trash bin. This task requires the robot to handle a dizzying variety of items. One of the exciting consequences of training π₀ on large and diverse datasets was the range of emergent strategies that the robot employed: instead of simply grasping each item in turn, the model could stack multiple dishes to put them into the bin together, or shake off trash from a plate into the garbage before placing the plate into the bussing bin.
餐桌清理。我们还微调了模型以清理餐桌。这要求机器人从桌子上拿起盘子和垃圾，将任何盘子、餐具或杯子放入餐具箱，并将垃圾放入垃圾桶。这项任务要求机器人处理各种令人眼花缭乱的物品。在大型和多样化的数据集上训练π ₀ 的一个令人兴奋的后果是，机器人采用了各种涌现策略：而不是简单地依次抓取每个物品，模型可以将多个盘子堆叠起来一起放入箱子，或者将盘子上的垃圾抖落到垃圾桶中，然后再将盘子放入餐具箱。

Assembling a box. Here, the robot has to take a flattened cardboard box and build it, folding the sides and then tucking in the flaps. This is very difficult, because each fold and tuck might fail in unexpected ways, so the robot needs to watch its progress and adjust as it goes. It also needs to brace the box with both arms, even using the table, so that the partially folded box doesn’t come apart.
组装一个盒子。在这里，机器人需要取一个压扁的纸箱并组装它，折叠侧面然后塞入盖子。这非常困难，因为每个折叠和塞入都可能会以意想不到的方式失败，所以机器人需要观察其进度并在进行中调整。它还需要用两只手臂支撑盒子，甚至使用桌子，以确保部分折叠的盒子不会散开。

Evaluating and comparing π₀ to prior models
评估和比较 π ₀ 与先前模型

We compared π₀ to other robot foundation models that have been proposed in the academic literature on our tasks: OpenVLA, a 7B parameter VLA model that uses discretized actions, and Octo, a 93M parameter model that uses diffusion outputs. These tasks are very difficult compared to those that are typically used in academic experiments — for example, the tasks in the OpenVLA evaluation typically consist of single stage behaviors (e.g., “put eggplant into pot”), whereas our simplest bussing task consisting of sorting multiple objects into either a garbage bin or a bussing bin, and our more complex tasks might require multiple stages, manipulation of deformable objects, and the ability to deploy one of many possible strategies given the current configuration of the environment. These tasks are evaluated according to a scoring rubric that assigns a score of 1.0 for a fully successful completion, with “partial credit” for partially correct execution (e.g., bussing half the objects leads to a score of 0.5). The average scores across 5 evaluation tasks are shown below, comparing the full π₀ pre-trained model, π₀-small, which is a 470M parameter model that does not use VLM pre-training, OpenVLA, and Octo. Although OpenVLA and Octo can attain non-zero performance on the easiest of these tasks (“Bussing Easy”), π₀ is by far the best-performing model across all of the tasks. The small version, π₀-small, attains the second best performance, but there is more than a 2x improvement in performance from using our full-size architecture with VLM pre-training.
我们比较了π ₀ 与我们在学术文献中提出的其他机器人基础模型：OpenVLA，一个使用离散动作的 70 亿参数 VLA 模型，以及 Octo，一个使用扩散输出的 9300 万参数模型。与通常在学术实验中使用的任务相比，这些任务非常困难——例如，OpenVLA 评估中的任务通常由单阶段行为组成（例如，“将茄子放入锅中”），而我们的最简单任务是将多个物体分类到垃圾桶或垃圾收集箱中，而我们的更复杂任务可能需要多个阶段、对可变形物体的操作，以及根据当前环境配置部署多种可能策略的能力。这些任务根据评分标准进行评估，完全成功完成得 1.0 分，部分正确执行得“部分分数”（例如，收集一半的物体得 0.5 分）。以下是在 5 个评估任务中的平均分数，比较了全π ₀ 预训练模型、π ₀ -small（一个 470M 参数模型，不使用 VLM 预训练）、OpenVLA 和 Octo。尽管 OpenVLA 和 Octo 在这些任务中最容易的（“Bussing Easy”）上可以获得非零性能，但π ₀ 在所有任务中都是表现最好的模型。小型版本π ₀ -small 获得了第二好的性能，但使用我们的全尺寸架构和 VLM 预训练，性能提高了 2 倍以上。

Average scores for π₀, π₀-small, OpenVLA, and Octo for evaluation on 5 test tasks. Across all of the tasks, π₀ consistently attains good performance, and outperforms both the small variant and the prior models.
平均分数为 π ₀ ，π ₀ -small，OpenVLA 和 Octo 在 5 个测试任务上的评估。在所有任务中，π ₀ 一贯表现出色，并优于小型变体和先前模型。

We include detailed videos from our rigorous empirical evaluation below, with examples of successful and failed episodes for both our direct prompting (out-of-box) experiments and the fine-tuning evaluation. Complete results from all experiments can be found in the full article.
我们下面包括了我们严格实证评估的详细视频，其中包含了我们直接提示（开箱即用）实验和微调评估中成功和失败的案例。所有实验的完整结果可以在全文中找到。

Task 任务

Status 状态

Success 成功

Fail 失败

Where do we go from here?
从这里我们该怎么办？

Our mission at Physical Intelligence is to develop foundation models that can control any robot to perform any task. Our experiments so far show that such models can control a variety of robots and perform tasks that no prior robot learning system has done successfully, such as folding laundry from a hamper or assembling a cardboard box. But generalist robot policies are still in their infancy, and we have a long way to go. The frontiers of robot foundation model research include long-horizon reasoning and planning, autonomous self-improvement, robustness, and safety. We expect that the coming year will see major advances along all of these directions, but the initial results paint a promising picture for the future of robot foundation models: highly capable generalist policies that inherit semantic understanding from Internet-scale pretraining, incorporate data from many different tasks and robot platforms, and enable unprecedented dexterity and physical capability.
我们的使命是在物理智能领域开发能够控制任何机器人执行任何任务的基座模型。迄今为止的实验表明，这样的模型可以控制各种机器人并执行先前机器人学习系统未能成功完成的任务，例如从洗衣篮中折叠衣物或组装纸箱。但通用机器人策略仍处于起步阶段，我们还有很长的路要走。机器人基座模型研究的前沿包括长期推理和规划、自主自我改进、鲁棒性和安全性。我们预计，在未来的一年里，所有这些方向都将取得重大进展，但初步结果为机器人基座模型的未来描绘了一幅光明的图景：高度能力的通用策略，继承了互联网规模预训练的语义理解，整合了来自许多不同任务和机器人平台的资料，并实现了前所未有的灵巧和物理能力。

We also think that succeeding at this will require not only new technologies and more data, but a collective effort involving the entire robotics community. We already have collaborations underway with a number of companies and robotics labs, both to refine hardware designs for teleoperation and autonomy, and incorporate data from our partners into our pre-trained models so that we can provide access to models adapted to their specific platforms.
我们同样认为，要在这方面取得成功，不仅需要新技术和更多数据，还需要整个机器人社区的共同努力。我们已经在与多家公司和机器人实验室进行合作，旨在改进远程操作和自主性的硬件设计，并将合作伙伴的数据纳入我们的预训练模型中，以便我们能够提供适用于他们特定平台的模型。

If you are interested in collaborating, please reach out. We are particularly excited to work with companies scaling up data collection with robots deployed for real-world applications, who are looking to collaborate on autonomy.
如果您有兴趣进行合作，请与我们联系。我们特别期待与那些使用机器人进行实际应用的数据收集规模化的公司合作，这些公司正在寻求在自主性方面的合作。

We are also hiring! If you'd be interested in joining us please get in touch.
我们也在招聘！如果您有兴趣加入我们，请与我们联系。

For researchers interested in our work, collaborations, or other queries, please write to research@physicalintelligence.company.
对于对我们的工作、合作或其他查询感兴趣的研究人员，请发送邮件至 research@physicalintelligence.company。

π0: Our First Generalist Policyπ 0 : 我们的第一项通才政策

The promise of generalist robot policies机器人通用政策的前景

A cross-embodiment training mixture跨实体训练混合

Inheriting Internet-scale semantic understanding继承互联网规模的语义理解

Post-training for dexterous manipulation训练后灵巧操作

Evaluating and comparing π0 to prior models评估和比较 π 0 与先前模型

Where do we go from here?从这里我们该怎么办？