这是用户在 2024-12-29 17:17 为 https://arxiv.org/html/2406.09246v3#bib.bib10 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: CC BY 4.0  授权协议: CC BY 4.0
arXiv:2406.09246v3 [cs.RO] 05 Sep 2024
arXiv:2406.09246v3 [cs.RO] 2024 年 9 月 5 日
\mdfdefinestyle  \mdfdefine样式

codeboxbackgroundcolor=gray!25, linewidth=0pt,
codeboxbackgroundcolor=gray!25, 线宽=0pt,

OpenVLA:
An Open-Source Vision-Language-Action Model
OpenVLA:开源视觉-语言-行动模型

Moo Jin Kim∗,1  Karl Pertsch∗,1,2  Siddharth Karamcheti∗,1,3
Ted Xiao4    Ashwin Balakrishna3    Suraj Nair3    Rafael Rafailov1    Ethan Foster1    Grace Lam
Pannag Sanketi4
   Quan Vuong5,†    Thomas Kollar3    Benjamin Burchfiel3    Russ Tedrake3,6    Dorsa Sadigh1
Sergey Levine2
   Percy Liang1    Chelsea Finn1
https://openvla.github.io
Abstract  抽象
footnotetext: : denotes equal contribution
:表示同等贡献

Correspondence to: moojink@stanford.edu, pertsch@berkeley.edu, skaramcheti@stanford.edu
通信地址:moojink@stanford.edu、pertsch@berkeley.edu skaramcheti@stanford.edu

1Stanford University, 2UC Berkeley, 3Toyota Research Institute, 4Google Deepmind, 5Physical Intelligence, 6MIT, Work done in part while at Google Deepmind
斯坦福大学、2加州大学伯克利分校、3丰田研究所、4Google Deepmind、5物理智能、6麻省理工学院在 Google Deepmind 期间部分完成的工作

Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4% We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.
在互联网规模的视觉语言数据和各种机器人演示的基础上进行预训练的大型策略有可能改变我们教授机器人新技能的方式:我们可以微调这种视觉-语言-行动 (VLA) 模型,以获得稳健、可推广的视觉运动控制策略,而不是从头开始训练新行为。然而,在机器人技术中广泛采用 VLA 一直具有挑战性,因为 1) 现有的 VLA 在很大程度上是封闭的,不向公众开放,以及 2) 之前的工作未能探索为新任务有效微调 VLA 的方法,这是采用的关键组成部分。为了应对这些挑战,我们推出了 OpenVLA,这是一种 7B 参数开源 VLA,经过 970k 各种真实世界机器人演示的训练。OpenVLA 基于 Llama 2 语言模型构建,并结合了融合 DINOv2 和 SigLIP 的预训练特征的可视化编码器。作为增加的数据多样性和新模型组件的产物,OpenVLA 在通才操作方面表现出强大的结果,在 29 个任务和多个机器人实施例中,绝对任务成功率比 RT-2-X (55B) 等封闭模型高出 16.5%,参数减少了 7 倍。我们进一步表明,我们可以有效地针对新设置微调 OpenVLA,在涉及多个对象的多任务环境中具有特别强的泛化结果和强大的语言基础能力,并且比 Diffusion Policy 等从头开始的表达性模仿学习方法高出 20.4%我们还探索了计算效率;作为单独的贡献,我们表明 OpenVLA 可以通过现代低秩自适应方法在消费类 GPU 上进行微调,并通过量化高效服务,而不会影响下游成功率。 最后,我们发布了模型检查点、微调笔记本和内置支持在 Open X-Embodiment 数据集上大规模训练 VLA 的 PyTorch 代码库。

1 Introduction  1 介绍

A key weakness of learned policies for robotic manipulation is their inability to generalize beyond their training data: while existing policies trained for individual skills or language instructions have the capacity to extrapolate behaviors to new initial conditions such as object positions or lighting [2, 3], they lack robustness to scene distractors or novel objects [4, 5] and struggle to execute unseen task instructions [6, 7]. Yet beyond robotics, existing foundation models for vision and language such as CLIP [8], SigLIP [9], and Llama 2 [10] are capable of these types of generalization and more, stemming from the priors captured by their Internet-scale pretraining datasets. While reproducing this scale of pretraining for robotics is still an open challenge — even the largest robot manipulation datasets [1, 11] only have 100K to 1M examples – this imbalance suggests an opportunity: using existing foundation models for vision and language as a core building block for training robotic policies that can generalize to objects, scenes, and tasks beyond their training data.
机器人操作的学习策略的一个关键弱点是它们无法在训练数据之外进行泛化:虽然为个人技能或语言指令训练的现有策略能够将行为外推到新的初始条件,例如对象位置或照明 [23],但它们对场景干扰物或新奇对象缺乏鲁棒性 [45] 并努力执行看不见的任务指令 [67]。然而,除了机器人技术之外,现有的视觉和语言基础模型,如 CLIP [8]、SigLIP [9] 和 Llama 2 [10],能够进行这些类型的泛化以及更多,这源于其互联网规模的预训练数据集捕获的先验。虽然为机器人复制这种规模的预训练仍然是一个公开的挑战——即使是最大的机器人操作数据集 [111] 也只有 100K 到 1M 个示例——但这种不平衡表明了一个机会:使用现有的视觉和语言基础模型作为训练机器人策略的核心构建块,这些策略可以推广到训练数据之外的对象、场景和任务。

Towards this goal, existing work has explored integrating pretrained language and vision-language models for robotic representation learning [12, 13, 14] and as a component in modular systems for task planning and execution [15, 16]. More recently, they have been used for directly learning vision-language-action models [VLAs; 7, 1, 17, 18] for control. VLAs provide a direct instantiation of using pretrained vision-and-language foundation models for robotics, directly fine-tuning visually-conditioned language models (VLMs) such as PaLI [19, 20] to generate robot control actions. By building off of strong foundation models trained on Internet-scale data, VLAs such as RT-2 [7] demonstrate impressive robustness results, as well as an ability to generalize to novel objects and tasks, setting a new standard for generalist robot policies. Yet, there are two key reasons preventing the widespread use of existing VLAs: 1) current models [7, 1, 17, 18] are closed, with limited visibility into model architecture, training procedures, and data mixture, and 2) existing works do not provide best practices for deploying and adapting VLAs to new robots, environments, and tasks — especially on commodity hardware (e.g., consumer-grade GPUs). We argue that to develop a rich foundation for future research and development, robotics needs open-source, generalist VLAs that support effective fine-tuning and adaptation, akin to the existing ecosystem around open-source language models [21, 22, 23, 24].
为了实现这一目标,现有工作已经探索了将预训练语言和视觉语言模型集成用于机器人表示学习 [121314],并作为模块化系统的一个组成部分进行任务规划和执行 [1516]。最近,它们已被用于直接学习视觉-语言-动作模型 [VLA;711718] 进行控制。VLA 提供了使用预先训练的机器人视觉和语言基础模型的直接实例化,直接微调视觉条件语言模型 (VLM)(如 PaLI [1920])以生成机器人控制动作。通过建立在互联网规模数据上训练的强大基础模型,RT-2 [7] 等 VLA 展示了令人印象深刻的鲁棒性结果,以及泛化到新对象和任务的能力,为通才机器人策略设定了新标准。然而,有两个关键原因阻止了现有 VLA 的广泛使用:1) 当前模型 [711718]封闭的,对模型架构、训练程序和数据混合的可见性有限,以及 2) 现有工作没有提供部署和调整 VLA 的最佳实践到新的机器人、环境和任务,尤其是在商用硬件(例如消费级 GPU)上。我们认为,为了为未来的研究和开发奠定丰富的基础,机器人技术需要开源的、通用的 VLA,这些 VLA 支持有效的微调和适应,类似于围绕开源语言模型的现有生态系统 [21222324]。

To this end, we introduce OpenVLA, a 7B-parameter open-source VLA that establishes a new state of the art for generalist robot manipulation policies.111OpenVLA uses multiple pretrained model components: SigLIP [9] and DinoV2 [25] vision encoders and a Llama 2 [10] language model backbone. For all three models, weights are open, but not their training data or code. We release training data, code and model weights for reproducing OpenVLA on top of these components.
OpenVLA 使用多个预训练模型组件:SigLIP [9] 和 DinoV2 [25] 视觉编码器以及 Llama 2 [10] 语言模型主干。对于所有三个模型,权重都是开放的,但其训练数据或代码不是开放的。我们发布了训练数据、代码和模型权重,用于在这些组件之上重现 OpenVLA。

为此,我们推出了 OpenVLA,这是一个 7B 参数的开源 VLA,它建立了一个新的技术水平 对于通用的机器人操作策略。
OpenVLA consists of a pretrained visually-conditioned language model backbone that captures visual features at multiple granularities, fine-tuned on a large, diverse dataset of 970k robot manipulation trajectories from the Open-X Embodiment [1] dataset — a dataset that spans a wide range of robot embodiments, tasks, and scenes. As a product of increased data diversity and new model components, OpenVLA outperforms the 55B-parameter RT-2-X model [7, 1], the prior state-of-the-art VLA, by 16.5% absolute success rate across 29 evaluation tasks on the WidowX and Google Robot embodiments. We additionally investigate efficient fine-tuning strategies for VLAs, a new contribution not explored in prior work, across 7 diverse manipulation tasks spanning behaviors from object pick-and-place to cleaning a table. We find that fine-tuned OpenVLA policies clearly outperform fine-tuned pretrained policies such as Octo [5]. Compared to from-scratch imitation learning with diffusion policies [3], fine-tuned OpenVLA shows substantial improvement on tasks involving grounding language to behavior in multi-task settings with multiple objects. Following these results, we are the first to demonstrate the effectiveness of compute-efficient fine-tuning methods leveraging low-rank adaptation [LoRA; 26] and model quantization [27] to facilitate adapting OpenVLA models on consumer-grade GPUs instead of large server nodes without compromising performance. As a final contribution, we open-source all models, deployment and fine-tuning notebooks, and the OpenVLA codebase for training VLAs at scale, with the hope that these resources enable future work exploring and adapting VLAs for robotics.
OpenVLA 由一个预先训练的视觉条件语言模型主干组成,该模型主干以多种粒度捕获视觉特征,并在来自 Open-X 实施例 [1] 数据集的 970k 机器人操作轨迹的大型多样化数据集上进行微调,该数据集涵盖广泛的机器人实施例、任务和场景。作为增加的数据多样性和新模型组件的产物,OpenVLA 在 WidowX 和 Google Robot 实施例的 29 项评估任务中,绝对成功率比 55B 参数 RT-2-X 模型 [71] 高出 16.5%,这是之前最先进的 VLA。我们还研究了 VLA 的有效微调策略,这是一项在之前的工作中没有探讨的新贡献,涉及 7 种不同的操作任务,涵盖从对象拾取和放置到清洁桌子的行为。我们发现,微调的 OpenVLA 策略明显优于微调的预训练策略,例如 Octo [5]。与使用扩散策略 [3] 的从头开始模仿学习相比,微调的 OpenVLA 在涉及将语言接地到具有多个对象的多任务设置中的行为的任务上显示出显着改进。根据这些结果,我们是第一个证明利用低秩适应 [LoRA;26] 和模型量化 [27],以便于在消费级 GPU 而不是大型服务器节点上适应 OpenVLA 模型,而不会影响性能。 作为最后的贡献,我们开源了所有模型、部署和微调笔记本以及用于大规模训练 VLA 的 OpenVLA 代码库,希望这些资源能够在未来探索和调整 VLA 用于机器人技术的工作。

2 Related Work  阿拉伯数字 相关工作

Visually-Conditioned Language Models
视觉条件语言模型

Visually-conditioned language models (VLMs), which are trained on Internet-scale data to generate natural language from input image(s) and language prompts, have been adopted for myriad applications from visual question answering [28, 29, 30, 31] to object localization [32, 33]. One of the key advances fueling recent VLMs are model architectures that bridge features from pretrained vision encoders [8, 9, 25] with pretrained language models [10, 23, 34, 35, 36], directly building on advances in both computer vision and natural language modelling to create powerful multimodal models. While early work explored various architectures for cross-attending between vision and language features [37, 38, 39, 40, 41], new open-source VLMs [42, 43, 20, 44] have converged on a simpler “patch-as-token” approach, in which patch features from pretrained visual transformers are treated as tokens, and are then projected into the input space of a language model. This simplicity makes it easy to repurpose existing tools for training language models at scale for VLM training. We employ these tools in our work to scale VLA training, and specifically use VLMs from Karamcheti et al. [44] as our pretrained backbone, as they are trained from multi-resolution visual features, fusing low-level spatial information from DINOv2 [25] with higher-level semantics from SigLIP [9] to aid in visual generalization.
视觉条件语言模型 (VLM) 在互联网规模的数据上进行训练,从输入图像和语言提示中生成自然语言,已被用于从视觉问答 [28293031] 到对象定位 [3233] 的无数应用。推动最近 VLM 的关键进步之一是模型架构,它将预训练视觉编码器 [8925] 的功能与预训练语言模型 [1023343536] 联系起来,直接建立在计算机视觉和自然语言建模的进步之上,以创建强大的多模态模型。虽然早期的工作探索了视觉和语言特征之间交叉参与的各种架构 [3738394041],但新的开源 VLM [42432044] 已经融合在一种更简单的 “补丁作为令牌 ”方法上,其中来自预训练视觉转换器的补丁特征被视为 token,然后被投影到语言模型的输入空间中。这种简单性使得重新利用现有工具变得容易,用于大规模训练语言模型以进行 VLM 训练。我们在工作中使用这些工具来扩展VLA训练,特别是使用Karamcheti等 人[44]的VLM作为我们的预训练支柱,因为它们是从多分辨率视觉特征训练的,将DINOv2 [25]的低级空间信息与SigLIP [9]的高级语义融合在一起,以帮助视觉泛化。

Generalist Robot Policies
通才机器人策略

A recent trend in robotics works towards training multi-task “generalist” robot policies [45, 46, 47, 6, 2, 48, 49] on large diverse robot datasets [50, 51, 45, 52, 53, 54, 55, 6, 2, 56, 49, 11, 1], spanning many different robot embodiments [57, 53, 58, 59, 60, 61, 62, 63, 64, 65, 1, 5, 66]. Notably, Octo [5] trains a generalist policy that can control multiple robots out-of-the-box and allows for flexible fine-tuning to new robot setups. A key difference between these approaches and OpenVLA is the model architecture. Prior works like Octo typically compose pretrained components such as language embeddings or visual encoders with additional model components initialized from scratch [6, 2, 5], learning to “stitch” them together during the course of policy training. Unlike these works, OpenVLA adopts a more end-to-end approach, directly fine-tuning VLMs to generate robot actions by treating them as tokens in the language model vocabulary. Our experimental evaluation shows that this simple yet scalable pipeline substantially boosts performance and generalization ability over prior generalist policies.
机器人技术的最新趋势是在大型多样化机器人数据集 [505145525354556256, 49111] 上训练多任务“通才”机器人策略 [4546476248, 49],跨越许多不同的机器人实施例 [575358596061626364651566]。值得注意的是,Octo [5] 训练了一个通用策略,该策略可以开箱即用地控制多个机器人,并允许灵活地微调新的机器人设置。这些方法与 OpenVLA 之间的一个关键区别是模型架构。像 Octo 这样的先前工作通常由预先训练的组件(如语言嵌入或视觉编码器)与从头开始初始化的附加模型组件组成 [625],学习在策略训练过程中将它们“拼接”在一起。与这些工作不同,OpenVLA 采用了一种更加端到端的方法,通过将 VLM 视为语言模型词汇表中的标记,直接微调 VLM 以生成机器人动作。我们的实验评估表明,与以前的通用策略相比,这种简单但可扩展的管道大大提高了性能和泛化能力。

Vision-Language-Action Models
视觉-语言-行动模型

A number of works have explored the use of VLMs for robotics, e.g., for visual state representations [12, 13], object detection [67], high-level planning [16], and for providing a feedback signal [68, 69, 70, 71]. Others integrate VLMs directly into end-to-end visuomotor manipulation policies [14, 15], but incorporate significant structure into the policy architecture or require calibrated cameras, which limits their applicability. A number of recent works have explored similar recipes to ours and directly fine-tuned large pretrained VLMs for predicting robot actions [7, 1, 72, 73, 17, 18, 74]. Such models are often referred to as vision-language-action models (VLAs), since they fuse robot control actions directly into VLM backbones. This has three key benefits: (1) it performs alignment of pretrained vision and language components on a large, Internet-scale vision-language dataset, (2) the use of a generic architecture, not custom-made for robot control, allows us to leverage the scalable infrastructure underlying modern VLM training [75, 76, 77] and scale to training billion-parameter policies with minimal code modifications, and (3) it provides a direct pathway for robotics to benefit from the rapid improvements in VLMs. Existing works on VLAs either focus on training and evaluating in single robot or simulated setups [72, 73, 78, 74] and thus lack generality, or are closed and do not support efficient fine-tuning to new robot setups [7, 1, 17, 18].
许多工作已经探索了自动立体货柜在机器人技术中的应用,例如,用于视觉状态表示 [1213]、对象检测 [67]、高级规划 [16] 以及提供反馈信号 [68697071]。其他一些将 VLM 直接集成到端到端的视觉运动操作策略中 [1415],但将重要结构纳入策略架构或需要校准的摄像头,这限制了它们的适用性。最近的一些工作探索了与我们类似的配方,并直接微调了用于预测机器人动作的大型预训练自动立体货柜 [717273171874]。这种模型通常被称为视觉-语言-行动模型 (VLA),因为它们将机器人控制动作直接融合到 VLM 主干中。这有三个主要好处:(1) 它在大型 Internet 规模的视觉语言数据集上执行预训练视觉和语言组件的对齐,(2) 使用通用架构,而不是为机器人控制定制的架构,使我们能够利用现代 VLM 训练背后的可扩展基础设施 [757677]并扩展到以最少的代码修改训练数十亿个参数的策略,以及 (3) 它为机器人技术提供了一条从自动立体货柜的快速改进中受益的直接途径。 关于 VLA 的现有工作要么集中在单个机器人或模拟设置中的训练和评估 [72737874],因此缺乏通用性,要么是封闭的,不支持对新机器人设置进行有效的微调 [711718]。

Most closely related, RT-2-X [1] trains a 55B-parameter VLA policy on the Open X-Embodiment dataset and demonstrates state-of-the-art generalist manipulation policy performance. However, our work differs from RT-2-X in multiple important aspects: (1) by combining a strong open VLM backbone with a richer robot pretraining dataset, OpenVLA outperforms RT-2-X in our experiments while being an order of magnitude smaller; (2) we thoroughly investigate fine-tuning of OpenVLA models to new target setups, while RT-2-X does not investigate the fine-tuning setting; (3) we are the first to demonstrate the effectiveness of modern parameter-efficient fine-tuning and quantization approaches for VLAs; and (4) OpenVLA is the first generalist VLA that is open-source and thus supports future research on VLA training, data mixtures, objectives, and inference.
最密切相关的是 RT-2-X [1] 在 Open X-Embodiment 数据集上训练 55B 参数的 VLA 策略,并展示了最先进的通才操作策略性能。然而,我们的工作在多个重要方面与 RT-2-X 不同:(1) 通过将强大的开放式 VLM 主干与更丰富的机器人预训练数据集相结合,OpenVLA 在我们的实验中优于 RT-2-X,同时小了一个数量级;(2) 我们彻底研究了 OpenVLA 模型对新目标设置的微调,而 RT-2-X 没有研究微调设置;(3) 我们是第一个证明现代参数高效微调和量化方法对 VLA 的有效性的人;(4) OpenVLA 是第一个开源的通用 VLA,因此支持未来对 VLA 训练、数据混合、目标和推理的研究。

3 The OpenVLA Model
3 OpenVLA 模型

We introduce the OpenVLA model, a 7B-parameter vision-language-action model (VLA) trained on 970k robot demonstrations from the Open X-Embodiment dataset [1]. There are many, largely unexplored, questions around best practices for developing VLA models, e.g., what are the best model backbones, datasets, and hyperparameters to use for training. Below, we detail our approach for developing OpenVLA and summarize our key learnings. Concretely, we first provide a brief overview of modern VLMs, which form the backbone of OpenVLA (Section 3.1); then describe our basic training recipe and dataset (Section 3.2 and Section 3.3); discuss key design decisions (Section 3.4); and provide details of the used infrastructure for training and inference (Section 3.5).
我们介绍了 OpenVLA 模型,这是一个 7B 参数视觉-语言-动作模型 (VLA),在 Open X-Embodiment 数据集 [1] 中的 970k 机器人演示上进行了训练。围绕开发 VLA 模型的最佳实践,存在许多基本未探索的问题,例如,用于训练的最佳模型主干、数据集和超参数是什么。下面,我们详细介绍了开发 OpenVLA 的方法并总结了我们的主要经验。具体来说,我们首先简要概述了构成 OpenVLA 支柱的现代 VLM(3.1  );然后描述我们的基本训练配方和数据集(3.2  3.3  );讨论关键设计决策(3.4  );并提供用于训练和推理的基础设施的详细信息(3.5  )。

Refer to caption
Figure 1: OpenVLA model architecture. Given an image observation and a language instruction, the model predicts 7-dimensional robot control actions. The architecture consists of three key components: (1) a vision encoder that concatenates Dino V2 [25] and SigLIP [79] features, (2) a projector that maps visual features to the language embedding space, and (3) the LLM backbone, a Llama 2 7B-parameter large language model [10].
图 1:OpenVLA 模型架构。给定图像观察和语言指令,该模型预测 7 维机器人控制动作。该架构由三个关键组件组成:(1) 连接 Dino V2 [25] 和 SigLIP [79] 特征的视觉编码器,(2) 将视觉特征映射到语言嵌入空间的投影仪,以及 (3) LLM 主干,一个 Llama 2 7B 参数大型语言模型 [10]。

3.1 Preliminaries: Vision-Language Models
3.1 Preliminaries: 视觉语言模型

The architecture of most recent VLMs [42, 43, 20, 44] consists of three main parts (see Fig. 1): (1) a visual encoder that maps image inputs to a number of “image patch embeddings”, (2) a projector that takes the output embeddings of the visual encoder and maps them into the input space of a language model, and (3) a large language model (LLM) backbone. During VLM training, the model is trained end-to-end with a next text token prediction objective on paired or interleaved vision and language data curated from various Internet sources.
最近的自动立体货柜 [42432044] 的结构由三个主要部分组成(见图 D)。 1):(1) 一个可视化编码器,用于将图像输入映射到许多“图像补丁嵌入”,(2) 一个投影仪,用于获取可视化编码器的输出嵌入并将它们映射到语言模型的输入空间,以及 (3) 一个大型语言模型 (LLM) 主干。在 VLM 训练期间,模型对来自各种 Internet 来源的成对或交错视觉和语言数据进行端到端训练,并带有下一个文本标记预测目标。

In this work, we build on the Prismatic-7B VLM [44]. Prismatic follows the same standard architecture described above, with a 600M-parameter visual encoder, a small 2-layer MLP projector, and a 7B-parameter Llama 2 language model backbone [10]. Notably, Prismatic uses a two-part visual encoder, consisting of pretrained SigLIP [79] and DinoV2 [25] models. Input image patches are passed separately through both encoders and the resulting feature vectors are concatenated channel-wise. In contrast to the more commonly used vision encoders such as CLIP- [80] or SigLIP-only encoders, the addition of DinoV2 features has been shown to be helpful for improved spatial reasoning [44], which can be particularly helpful for robot control.
在这项工作中,我们以 Prismatic-7B VLM [44] 为基础。Prismatic 遵循上述相同的标准架构,具有一个 600M 参数的可视编码器、一个小型的 2 层 MLP 投影仪和一个 7B 参数的 Llama 2 语言模型主干 [10]。值得注意的是,Prismatic 使用了一个由两部分组成的视觉编码器,由预训练的 SigLIP [79] 和 DinoV2 [25] 模型组成。输入图像补丁分别通过两个编码器传递,生成的特征向量按通道连接。与更常用的视觉编码器(如 CLIP-[80] 或仅 SigLIP 编码器)相比,DinoV2 功能的添加已被证明有助于改进空间推理 [44],这对机器人控制特别有帮助。

SigLIP, DinoV2, and Llama 2 do not release details about their training data, which likely consists of trillions of tokens of Internet-sourced image-text, image-only, and text-only data respectively. The Prismatic VLM is fine-tuned on top of these components using the LLaVA 1.5 data mixture [43], which contains a total of approximately 1M image-text and text-only data samples from open-source datasets [81, 82, 29, 83, 42].
SigLIP、DinoV2 和 Llama 2 没有发布有关其训练数据的详细信息,这些数据可能分别由数万亿个源自 Internet 的图像文本、纯图像和纯文本数据的令牌组成。棱镜式自动立体货柜在这些组件之上使用LLaVA 1.5数据混合物[43]进行微调,该混合数据包含来自开源数据集[8182298342]的总共约1M图像文本和纯文本数据样本。

3.2 OpenVLA Training Procedure
3.2 OpenVLA 训练程序

To train OpenVLA, we fine-tune a pretrained Prismatic-7B VLM backbone for robot action prediction (see Fig. 1). We formulate the action prediction problem as a “vision-language” task, where an input observation image and a natural language task instruction are mapped to a string of predicted robot actions [7]. To enable the VLM’s language model backbone to predict robot actions, we represent the actions in the output space of the LLM by mapping continuous robot actions to discrete tokens used by the language model’s tokenizer. Following Brohan et al. [7], we discretize each dimension of the robot actions separately into one of 256 bins. For each action dimension, we set the bin width to uniformly divide the interval between the 1st and 99th quantile of the actions in the training data. Using quantiles instead of the min-max bounds Brohan et al. [7] used allows us to ignore outlier actions in the data that could otherwise drastically expand the discretization interval and reduce the effective granularity of our action discretization.
为了训练 OpenVLA,我们微调了预训练的 Prismatic-7B VLM 主干用于机器人动作预测(见图 D)。 1).我们将动作预测问题表述为 “视觉语言” 任务,其中输入观察图像和自然语言任务指令被映射到一串预测的机器人动作 [7]。为了使 VLM 的语言模型主干能够预测机器人动作,我们通过将连续的机器人动作映射到语言模型的分词器使用的离散标记来表示 LLM。按照 Brohan 等 人 [7] 的帮助,我们将机器人动作的每个维度分别离散到 256 个箱中的一个。对于每个操作维度,我们设置箱宽以均匀划分训练数据中操作的第 1和第 99分位数之间的间隔。使用分位数而不是 Brohan 等 人 [7] 使用的最小-最大边界,可以让我们忽略数据中的异常值操作,否则这些操作可能会大大扩大离散化区间并降低操作离散化的有效粒度。

Using this discretization, we obtain N discrete integers [0255]\in[0\dots 255]∈ [ 0 … 255 ] for an NNitalic_N-dimensional robot action. Unfortunately, the tokenizer used by OpenVLA’s language backbone, the Llama tokenizer [10], only reserves 100 “special tokens” for tokens newly introduced during fine-tuning, which is too few for the 256 tokens of our action discretization. Instead, we again opt for simplicity and follow Brohan et al. [7]’s approach by simply overwriting the 256 least used tokens in the Llama tokenizer’s vocabulary (which corresponds to the last 256 tokens) with our action tokens. Once the actions are processed into a sequence of tokens, OpenVLA is trained with a standard next-token prediction objective, evaluating the cross-entropy loss on the predicted action tokens only. We discuss key design decisions for implementing this training procedure in Section 3.4. Next, we describe the robot dataset we use for OpenVLA training.
使用这种离散化,我们获得了 N 个离散整数 [0255]absentdelimited-[]0255\in[0\dots 255]∈ [ 0 … 255 ] 对于 NNitalic_N -dimensional robot 操作。不幸的是,OpenVLA 的语言主干 Llama 分词器 [10] 使用的分词器只为微调过程中新引入的分词保留了 100 个“特殊词元”,这对于我们动作离散化的 256 个词元来说太少了。相反,我们再次选择简单性,并遵循 Brohan 等 人 [7] 的方法,简单地用我们的动作标记覆盖 Llama 分词器词汇表中最少使用的 256 个标记(对应于最后 256 个标记)。 一旦操作被处理成一系列标记,OpenVLA 就会使用标准的下一个标记预测目标进行训练,仅评估预测操作标记的交叉熵损失。我们将在 Section 3.4 中讨论实现此训练过程的关键设计决策。接下来,我们描述用于 OpenVLA 训练的机器人数据集。

3.3 Training Data
3.3 训练数据

The goal in constructing the OpenVLA training dataset is to capture a large diversity of robot embodiments, scenes, and tasks. This enables the final model to control various robots out of the box and admits efficient fine-tuning to new robot setups. We leverage the Open X-Embodiment dataset [1] (OpenX) as a base to curate our training dataset. The full OpenX dataset, at the time of writing, consists of more than 70 individual robot datasets, with more than 2M robot trajectories, that were pooled into a coherent and easy-to-use data format in a large community effort. To make training on this data practical, we apply multiple steps of data curation to the raw dataset.
构建 OpenVLA 训练数据集的目标是捕获大量多样化的机器人实施例、场景和任务。这使得最终模型能够开箱即用地控制各种机器人,并允许对新的机器人设置进行有效的微调。我们利用 Open X-Embodiment 数据集 [1] (OpenX) 作为基础来管理我们的训练数据集。在撰写本文时,完整的 OpenX 数据集由 70 多个单独的机器人数据集组成,具有超过 2M 的机器人轨迹,这些数据集在大型社区工作中汇集成连贯且易于使用的数据格式。为了使这些数据的训练切实可行,我们将数据管理的多个步骤应用于原始数据集。

The goals of this curation are to ensure (1) a coherent input and output space across all training datasets, and (2) a balanced mix of embodiments, tasks, and scenes in the final training mixture.222Octo [5] demonstrated training across datasets with heterogeneous sensory inputs. While very promising, we leave an investigation of VLA training across heterogeneous sensor modalities and action spaces to future work.
Octo [5] 演示了具有异构感官输入的跨数据集的训练。虽然非常有前途,但我们将跨异构传感器模态和动作空间的 VLA 训练的调查留给未来的工作。

这种策展的目标是确保 (1) 所有训练数据集的连贯输入和输出空间,以及 (2) 最终训练混合物中实施例、任务和场景的平衡组合。
To address (1), we follow [1, 5] and restrict our training dataset to contain only manipulation datasets with at least one 3rd person camera and use single-arm end-effector control. For (2), we leverage the data mixture weights of Octo [5] for all datasets that pass the first round of filtering. Octo heuristically down-weights or removes less diverse datasets and up-weights datasets with larger task and scene diversity; see Octo Model Team et al. [5] for details.
为了解决 (1),我们遵循 [15] 并将我们的训练数据集限制为仅包含至少具有一个 3 人称相机的操作数据集,并使用单臂末端执行器控制。对于 (2),我们将 Octo [5] 的数据混合权重用于通过第一轮过滤的所有数据集。Octo 启发式地降低或删除不太多样化的数据集,并增加具有较大任务和场景多样性的数据集的权重;有关详细信息,请参见 Octo Model Team et al. [5]。

We also experimented with incorporating a few additional datasets into our training mixture that were added to the OpenX dataset since the release of Octo, including the DROID dataset [11], although at a conservative mixture weight of 10%. In practice, we found that the action token accuracy on DROID remained low throughout training, suggesting a larger mixture weight or model may be required to fit its diversity in the future. To not jeopardize the quality of the final model, we removed DROID from the data mixture for the final third of training. We provide a complete overview of the used datasets and mixture weights in Appendix A.
我们还尝试将一些额外的数据集合并到我们的训练混合物中,这些数据集自 Octo 发布以来被添加到 OpenX 数据集中,包括 DROID 数据集 [11],尽管保守的混合权重为 10%。在实践中,我们发现 DROID 上的动作标记准确性在整个训练过程中仍然很低,这表明将来可能需要更大的混合权重或模型来适应其多样性。为了不损害最终模型的质量,我们在训练的最后三分之一从数据混合中删除了 DROID。我们在附录 A 中提供了所用数据集和混合权重的完整概述。

3.4 OpenVLA Design Decisions
3,4 OpenVLA 设计决策

When developing the OpenVLA model, we explored various design decisions in smaller-scale experiments before starting the final model training run. Concretely, we trained and evaluated OpenVLA models on BridgeData V2 [6] for our initial experiments, instead of training on the full OpenX mixture, to increase iteration speed and reduce computational cost. We summarize key learnings from these explorations below.
在开发 OpenVLA 模型时,我们在开始最终模型训练运行之前,在较小规模的实验中探索了各种设计决策。具体来说,我们在 BridgeData V2 [6] 上训练和评估了 OpenVLA 模型进行初始实验,而不是在完整的 OpenX 混合物上进行训练,以提高迭代速度并降低计算成本。我们在下面总结了从这些探索中获得的主要经验。

VLM Backbone. Initially, we experimented with multiple VLM backbones. Apart from Prismatic [44], we tested fine-tuning IDEFICS-1 [84] and LLaVA [85] for robot action prediction. We found that LLaVA and IDEFICS-1 performed comparably on tasks with only one object in the scene, but LLaVA demonstrated stronger language grounding in tasks that involved multiple objects in the scene and required the policy to manipulate the correct object, i.e., the object specified in the language instruction. Concretely, LLaVA improved upon IDEFICS-1 by 35% in absolute success rate, averaged across five language grounding tasks in a BridgeData V2 sink environment. The fine-tuned Prismatic VLM policy achieved further improvements, outperforming the LLaVA policy by roughly 10% in absolute success rate across both simple single-object tasks and multi-object, language grounding tasks. We attribute this performance delta to improved spatial reasoning capabilities afforded by the fused SigLIP-DinoV2 backbones (see Section 3.1). In addition to the performance enhancements, Prismatic also provides a modular and easy-to-use codebase, so we ultimately chose it to be the backbone for the OpenVLA model.
VLM 主干。最初,我们试验了多个 VLM 主干。除了棱柱形 [44],我们还测试了微调 IDEFICS-1 [84] 和 LLaVA [85] 用于机器人动作预测。我们发现 LLaVA 和 IDEFICS-1 在场景中只有一个对象的任务上表现相当,但 LLaVA 在涉及场景中多个对象的任务中表现出更强的语言基础,并且需要策略来操作正确的对象,即语言指令中指定的对象。具体来说,LLaVA 的绝对成功率比 IDEFICS-1 提高了 35%,平均在 BridgeData V2 sink 环境中完成五个语言基础任务。经过微调的 Prismatic VLM 策略实现了进一步的改进,在简单的单对象任务和多对象语言接地任务中,绝对成功率比 LLaVA 策略高出约 10%。我们将这种性能增量归因于融合的 SigLIP-DinoV2 主干网提供的改进的空间推理能力(参见3.1  )。除了性能增强之外,Prismatic 还提供了一个模块化且易于使用的代码库,因此我们最终选择它作为 OpenVLA 模型的支柱。

Image Resolution. The resolution of input images has significant impact on the computational requirements of VLA training, since higher-resolution images result in more image patch tokens and thus longer context lengths that quadratically increase training compute. We compared VLAs with 224×224224\times 224224 × 224px and 384×384384\times 384384 × 384px inputs, but found no performance difference in our evaluations, while the latter takes 3x longer to train. We thus opt for a resolution of 224×224224\times 224224 × 224px for the final OpenVLA model. Note that on many VLM benchmarks, increased resolution does improve performance [44, 86, 87], but we did not see this trend (yet) for VLAs.
图像分辨率。输入图像的分辨率对 VLA 训练的计算要求有重大影响,因为更高分辨率的图像会导致更多的图像补丁标记,从而产生更长的上下文长度,从而二次方增加训练计算。我们将 VLA 与 224×224224224224\times 224224 × 224 px 和 384×384384384384\times 384384 × 384 px 输入进行了比较,但在我们的评估中没有发现性能差异,而后者的训练时间是 3 倍。因此,我们为最终的 OpenVLA 模型选择了 224×224224224224\times 224224 × 224 px 的分辨率。请注意,在许多 VLM 基准测试中,提高分辨率确实提高了性能 [448687],但我们还没有看到 VLA 的这种趋势。

Fine-Tuning Vision Encoder. Prior work on VLMs found that freezing vision encoders during VLM training typically leads to higher performance [44]. Intuitively, a frozen vision encoder may better preserve the robust features learned from its Internet-scale pretraining. However, we found fine-tuning the vision encoder during VLA training to be crucial for good VLA performance. We hypothesize that the pretrained vision backbone may not capture sufficient fine-grained spatial details about important parts of the scene to enable precise robotic control.
微调 Vision 编码器。先前对 VLM 的研究发现,在 VLM 训练期间冻结视觉编码器通常会导致更高的性能 [44]。直观地说,冻结的视觉编码器可以更好地保留从其互联网规模的预训练中学到的稳健功能。然而,我们发现在 VLA 训练期间微调视觉编码器对于良好的 VLA 性能至关重要。我们假设预训练的视觉主干可能无法捕获有关场景重要部分的足够细粒度空间细节,以实现精确的机器人控制。

Training Epochs. Typical LLM or VLM training runs complete at most one or two epochs through their training dataset. In contrast, we found it important for VLA training to iterate through the training dataset significantly more times, with real robot performance continually increasing until training action token accuracy surpasses 95%. Our final training run completes 27 epochs through its training dataset.
训练 epochs。典型的 LLM 或 VLM 训练运行通过其训练数据集最多完成一两个 epoch。相比之下,我们发现 VLA 训练必须对训练数据集进行多次迭代,真实机器人的性能不断提高,直到训练动作标记准确率超过 95%。我们的最终训练运行通过其训练数据集完成了 27 个 epoch。

Learning Rate. We swept the learning rate across multiple orders of magnitude for VLA training, and achieved the best results using a fixed learning rate of 2e-5 (the same learning rate used during VLM pretraining [44]). We did not find learning rate warmup to provide benefits.
学习率。我们将 VLA 训练的学习率扫过多个数量级,并使用 2e-5 的固定学习率(与 VLM 预训练期间使用的学习率相同 [44])取得了最佳结果。我们没有发现学习率预热有好处。

3.5 Infrastructure for Training and Inference
3,5 用于训练和推理的基础设施

The final OpenVLA model is trained on a cluster of 64 A100 GPUs for 14 days, or a total of 21,500 A100-hours, using a batch size of 2048. During inference, OpenVLA requires 15GB of GPU memory when loaded in bfloat16 precision (i.e., without quantization) and runs at approximately 6Hz on one NVIDIA RTX 4090 GPU (without compilation, speculative decoding, or other inference speed-up tricks). We can further reduce the memory footprint of OpenVLA during inference via quantization, without compromising performance in real-world robotics tasks, as shown in Section 5.4. We report inference speed on various consumer- and server-grade GPUs in Fig. 5. For convenience, we implement a remote VLA inference server to allow real-time remote streaming of action predictions to the robot – removing the requirement of having access to a powerful local compute device to control the robot. We release this remote inference solution as part of our open-source code release (Section 4).
最终的 OpenVLA 模型在包含 64 个 A100 GPU 的集群上训练了 14 天,或总共 21500 个 A100 小时,使用批处理大小 2048。在推理过程中,OpenVLA 以 bfloat16 精度加载时需要 15GB 的 GPU 内存(即无量化),并在一个 NVIDIA RTX 4090 GPU 上以大约 6Hz 的速度运行(没有编译、推测解码或其他推理加速技巧)。我们可以通过量化在推理期间进一步减少 OpenVLA 的内存占用,而不会影响实际机器人任务的性能,如5.4  所示。我们在图 2 中报告了各种消费级和服务器级 GPU 的推理速度 5.为方便起见,我们实施了远程 VLA 推理服务器,以允许将动作预测实时远程流式传输到机器人,而无需访问强大的本地计算设备来控制机器人。我们将此远程推理解决方案作为开源代码版本(第 4  )的一部分发布。

4 The OpenVLA Codebase
4 OpenVLA 代码库

Along with our model, we release the OpenVLA codebase, a modular PyTorch codebase for training VLA models (see https://openvla.github.io). It scales from fine-tuning VLAs on individual GPUs to training billion-parameter VLAs on multi-node GPU clusters, and supports modern techniques for large transformer model training such as automatic mixed precision (AMP, PyTorch [75]), FlashAttention [76], and fully sharded data parallelism (FSDP, Zhao et al. [77]). Out of the box, the OpenVLA codebase has full support for training on the Open X dataset, integrates with HuggingFace’s [21] AutoModel class, and supports LoRA fine-tuning [26] and quantized model inference [88, 27].
除了我们的模型,我们还发布了 OpenVLA 代码库,这是一个用于训练 VLA 模型的模块化 PyTorch 代码库(参见 https://openvla.github.io)。它从单个 GPU 上的微调 VLA 扩展到在多节点 GPU 集群上训练十亿个参数的 VLA,并支持大型转换器模型训练的现代技术,例如自动混合精度(AMP、PyTorch [75])、FlashAttention [76] 和完全分片数据并行性(FSDP、Zhao 等 人 [77])。开箱即用,OpenVLA 代码库完全支持在 Open X 数据集上进行训练,与 HuggingFace 的 [21]AutoModel 类集成,并支持 LoRA 微调 [26] 和量化模型推理 [8827]。

5 Experiments  5 实验

The goal of our experimental evaluations is to test OpenVLA’s ability to serve as a powerful multi-robot control policy out of the box, as well as be a good initialization for fine-tuning to new robot tasks. Concretely, we aim to answer the following questions:
我们实验评估的目标是测试 OpenVLA 作为开箱即用的强大多机器人控制策略的能力,以及成为微调新机器人任务的良好初始化。具体来说,我们的目标是回答以下问题:

  1. 1.

    How does OpenVLA compare to prior generalist robot policies, when evaluating on multiple robots and various types of generalization?


    1. 在评估多个机器人和各种类型的泛化时,OpenVLA 与之前的通才机器人策略相比如何?
  2. 2.

    Can OpenVLA be effectively fine-tuned on a new robot setup and task, and how does it compare to state-of-the-art data-efficient imitation learning approaches?


    2. OpenVLA 能否针对新的机器人设置和任务进行有效微调,它与最先进的数据高效模仿学习方法相比如何?
  3. 3.

    Can we use parameter-efficient fine-tuning and quantization to reduce the computational requirements for training and inference of OpenVLA models and make them more accessible? What are the performance-compute trade-offs?


    3. 我们能否使用参数高效的微调和量化来降低 OpenVLA 模型训练和推理的计算要求,使其更易于访问?性能与计算的权衡是什么?

5.1 Direct Evaluations on Multiple Robot Platforms
5,1 在多个机器人平台上进行直接评估

Refer to caption
Figure 2: BridgeData V2 WidowX robot evaluation tasks and results. We evaluate OpenVLA and prior state-of-the-art generalist robot policies on a comprehensive suite of tasks covering several axes of generalization, as well as tasks that specifically assess language conditioning ability. OpenVLA achieves highest overall performance and even outperforms closed-source model RT-2-X in all categories except for semantic generalization. Average success rates ±\pm± StdErr are computed across 170 total rollouts per approach. See Table 4 for detailed results.
图 2:BridgeData V2 WidowX 机器人评估任务和结果。我们在涵盖多个泛化轴的全面任务套件以及专门评估语言条件反射能力的任务上评估 OpenVLA 和之前最先进的通才机器人策略。OpenVLA 实现了最高的整体性能,甚至在除语义泛化之外的所有类别中都优于闭源模型 RT-2-X。平均成功率 ±plus-or-minus\pm± StdErr 是按每种方法总共 170 次推出计算的。有关详细结果,请参阅  4

Robot Setups and Tasks. We evaluate OpenVLA’s performance “out-of-the-box” on two robot embodiments: the WidowX robot from the BridgeData V2 evaluations [6] (see LABEL:fig:teaser, left) and the mobile manipulation robot from the RT-1 and RT-2 evaluations [2, 7] (“Google robot”; see LABEL:fig:teaser, middle). Both platforms have been extensively used in prior works for evaluating generalist robot policies [2, 7, 1, 5]. We define a comprehensive set of evaluation tasks in each environment that covers various axes of generalization, such as visual (unseen backgrounds, distractor objects, colors/appearances of objects); motion (unseen object positions/orientations); physical (unseen object sizes/shapes); and semantic (unseen target objects, instructions, and concepts from the Internet) generalization. We also assess language conditioning ability in scenes with multiple objects, testing whether the policy can manipulate the correct target object, as specified in the user’s prompt. See bottom row of Fig. 2 and Fig. 3 for example task images in the BridgeData V2 and Google robot evaluations, respectively. Overall, we evaluated each method in 170 rollouts (17 tasks with 10 trials each) for BridgeData V2 experiments and 60 rollouts (12 tasks with 5 trials each) for Google robot experiments. A detailed breakdown of all tasks and how they differ from the training data is in Appendix B. All evaluations in this and the following sections are conducted as A/B evaluations, using the same tasks with the same sets of initial robot and object states, to ensure fair comparison.
机器人设置和任务。我们在两个机器人实施例上评估了OpenVLA的“开箱即用”性能:BridgeData V2评估中的WidowX机器人[6](参见LABEL:fig:teaser,左)和RT-1和RT-2评估中的移动操纵机器人[2,7] (“谷歌机器人”;参见LABEL:fig:teaser,中)。这两个平台在以前的工作中都被广泛用于评估通才机器人策略 [2715]。我们在每个环境中定义了一套全面的评估任务,涵盖各种泛化轴,例如视觉(看不见的背景、干扰对象、对象的颜色/外观);运动(看不见的对象位置/方向);物理(看不见的物体大小/形状);和语义(来自 Internet 的看不见的目标对象、指令和概念)泛化。我们还评估了具有多个对象的场景中的语言调节能力,测试策略是否可以操作用户提示中指定的正确目标对象。见图 . 2图 . 3 例如,分别在 BridgeData V2 和 Google 机器人评估中执行任务图像。总体而言,我们在 BridgeData V2 实验的 170 次推出(17 个任务,每个任务 10 次试验)和 Google 机器人实验的 60 次推出(12 个任务,每个任务 5 次试验)中评估了每种方法。附录 B 中提供了所有任务的详细分类以及它们与训练数据的区别。 本节和以下部分中的所有评估均作为 A/B 评估进行,使用相同的任务和相同的初始机器人和对象状态集,以确保公平比较。

Comparisons. We compare OpenVLA’s performance to three prior generalist manipulation policies: RT-1-X [1], RT-2-X [1], and Octo [5]. RT-1-X (35M parameters) and Octo (93M parameters) are transformer policies trained from scratch on subsets of the OpenX dataset; Octo is the state-of-the-art model among open-source manipulation policies. RT-2-X (55B parameters) is a state-of-the-art, closed-source VLA that leverages Internet-pretrained vision and language backbones.
比较。我们将 OpenVLA 的性能与之前的三种通用操作策略进行了比较:RT-1-X [1]RT-2-X [1]Octo [5]。RT-1-X(35M 参数)和 Octo(93M 参数)是在 OpenX 数据集的子集上从头开始训练的转换器策略;Octo 是开源操作策略中最先进的模型。RT-2-X(55B 参数)是一种最先进的闭源 VLA,它利用互联网预训练的视觉和语言主干。

The results are summarized in Fig. 2 for BridgeData V2 evaluations and Fig. 3 for Google robot evaluations (per-task breakdown in Appendix, Table 4 and Table 6). We find that both RT-1-X and Octo struggle on the tested tasks, often failing to manipulate the correct object, especially when distractors are present, and in some cases causing the robot to wave its arm around aimlessly. Note that our evaluations test even larger degrees of generalization than the evaluations performed in those prior works to challenge the Internet-pretrained VLA models. Thus, lower performance of models without Internet pretraining is expected. RT-2-X clearly outperforms both RT-1-X and Octo, demonstrating the benefits of large, pretrained VLMs for robotics.
结果总结在图 1 中。 2 用于 BridgeData V2 评估和图 2。 3 代表 Google 机器人评估(附录、 4 6 中按任务细分)。我们发现 RT-1-X 和 Octo 都在测试任务上苦苦挣扎,经常无法操纵正确的对象,尤其是在存在干扰物的情况下,在某些情况下会导致机器人漫无目的地挥动手臂。请注意,我们的评估测试的泛化程度甚至比之前工作中进行的评估更大,以挑战 Internet 预训练的 VLA 模型。因此,在没有 Internet 预训练的情况下,模型的性能预计会降低。RT-2-X 的性能明显优于 RT-1-X 和 Octo,展示了大型预训练 VLM 在机器人方面的优势。

Refer to caption
Figure 3: Google robot evaluation results. We evaluate generalist robot policies on in-distribution and out-of-distribution (OOD) tasks on the mobile manipulator used in RT-1 and RT-2 evaluations [2, 7]. We find that OpenVLA and RT-2-X attain comparable performance and significantly outperform RT-1-X and Octo overall. Average success rates ±\pm± StdErr are computed across 60 total rollouts per approach. See Table 6 for detailed results.
图 3:Google 机器人评估结果。我们评估了 RT-1 和 RT-2 评估 中使用的移动机械臂上用于分配内和分配外 (OOD) 任务的通才机器人策略 [27]。我们发现 OpenVLA 和 RT-2-X 的性能相当,并且整体性能明显优于 RT-1-X 和 Octo。平均成功率 ±plus-or-minus\pm± StdErr 是按每种方法总共 60 次推出计算的。有关详细结果,请参阅  6

Notably, OpenVLA performs comparably to RT-2-X on Google robot evaluations and significantly outperforms RT-2-X on BridgeData V2 evaluations despite being an order of magnitude smaller (7B vs. 55B parameters). Qualitatively, we find that both RT-2-X and OpenVLA exhibit markedly more robust behaviors than the other tested models, such as approaching the correct object when distractor objects are present, properly orienting the robot’s end-effector to align with the orientation of the target object, and even recovering from mistakes such as insecurely grasping objects (see https://openvla.github.io for qualitative rollout examples). RT-2-X achieves higher performance in semantic generalization tasks, as shown in Fig. 2, which is expected given that it uses larger-scale Internet pretraining data and is co-fine-tuned with both robot action data and Internet pretraining data to better preserve the pretraining knowledge, rather than being fine-tuned solely on robot data, like OpenVLA. However, OpenVLA performs comparably or better in all other task categories in both BridgeData V2 and Google robot evaluations. The performance difference can be attributed to a combination of factors: we curated a much larger training dataset for OpenVLA with 970k trajectories (vs. 350k for RT-2-X); we performed more careful cleaning of the training dataset and, e.g., filtered out all-zero actions in the Bridge dataset (see Appendix C for a detailed discussion); and OpenVLA uses a fused vision encoder that combines pretrained semantic and spatial features. See Appendix D for ablation analyses of these components.
值得注意的是,OpenVLA 在 Google 机器人评估上的表现与 RT-2-X 相当,并且在 BridgeData V2 评估上明显优于 RT-2-X,尽管它小了一个数量级(7B 对 55B 参数)。从定性上讲,我们发现 RT-2-X 和 OpenVLA 都表现出比其他测试模型明显更稳健的行为,例如在存在干扰物体时接近正确的物体,正确定位机器人的末端执行器以与目标物体的方向对齐,甚至从错误中恢复,例如不安全地抓住物体(见 https://openvla.github.io有关定性推出示例)。RT-2-X 在语义泛化任务中实现了更高的性能,如图 2 所示 2 中,这是意料之中的,因为它使用更大规模的 Internet 预训练数据,并与机器人动作数据和 Internet 预训练数据共同微调,以更好地保留预训练知识,而不是像 OpenVLA 那样仅根据机器人数据进行微调。但是,OpenVLA 在 BridgeData V2 和 Google 机器人评估的所有其他任务类别中的表现相当或更好。性能差异可归因于多种因素的组合:我们为 OpenVLA 策划了一个更大的训练数据集,具有 970k 轨迹(与 RT-2-X 的 350k 相比);我们对训练数据集进行了更仔细的清理,例如,过滤掉了 Bridge 数据集中的全零操作(详细讨论见附录 C);OpenVLA 使用融合视觉编码器,该编码器结合了预训练的语义空间特征。有关这些成分的消融分析,请参见附录 D

5.2 Data-Efficient Adaptation to New Robot Setups
5,2 数据高效地适应新的机器人设置

While prior works mainly focused on directly evaluating VLAs “out-of-the-box” [16, 7, 1], effective fine-tuning of VLA models to new tasks and robot setups is largely unexplored, yet is key for their widespread adoption. In this section, we investigate OpenVLA’s ability to be quickly adapted to a new real-world robot setup. (See Appendix E for fine-tuning experiments in simulation.)
虽然以前的工作主要集中在直接评估“开箱即用”的 VLA [1671],但 VLA 模型根据新任务和机器人设置进行有效微调在很大程度上尚未探索,但这是其广泛采用的关键。在本节中,我们研究了 OpenVLA 快速适应新的真实机器人设置的能力。(有关仿真中的微调实验,请参阅附录 E

Robot setups and tasks. We test a simple fine-tuning recipe for the OpenVLA model: full fine-tuning of all model parameters, using small datasets with 10–150 demonstrations of a target task (see Fig. 4; we explore parameter-efficient fine-tuning approaches in Section 5.3). We test OpenVLA in two setups: Franka-Tabletop, a stationary, table-mounted Franka Emika Panda 7-DoF robot arm; and Franka-DROID, the Franka robot arm setup from the recently released DROID dataset [11], mounted on a movable standing desk. The setups use 5Hz and 15 Hz non-blocking controllers, respectively. We choose Franka robot arms as the target embodiment for our fine-tuning experiments since they are widely used in the robot learning community and thus a likely “target” of OpenVLA fine-tuning. We test on setups with different control frequencies to test OpenVLA’s applicability to a range of use cases.
机器人设置和任务。我们测试了 OpenVLA 模型的简单微调方法:使用具有 10-150 个目标任务演示的小数据集对所有模型参数进行完全微调(见图 D)。 4;我们在 Section 5.3 中探讨了参数高效的微调方法)。我们在两种设置中测试 OpenVLA:Franka-Tabletop,一种固定的、安装在桌面上的 Franka Emika Panda 7-DoF 机械臂;以及 Franka-DROID,来自最近发布的 DROID 数据集 [11] 的 Franka 机械臂设置,安装在可移动的站立式办公桌上。这些设置分别使用 5 Hz 和 15 Hz 非阻塞控制器。我们选择 Franka 机械臂作为我们微调实验的目标实施例,因为它们广泛用于机器人学习社区,因此可能是 OpenVLA 微调的“目标”。我们在具有不同控制频率的设置上进行测试,以测试 OpenVLA 对一系列用例的适用性。

Refer to caption
Figure 4: Adapting to new robot setups. We evaluate the state-of-the-art Diffusion Policy trained from scratch on seven Franka Emika Panda tasks (10–150 demonstrations each), as well as generalist robot policies Octo and OpenVLA fine-tuned on the same data. Diffusion Policy exhibits strong performance on narrow single-instruction tasks, while Octo and OpenVLA perform better on diverse fine-tuning tasks involving multiple instructions and distractor objects. Overall, OpenVLA achieves highest aggregate performance across both setups, suggesting that it is an effective default for learning a policy on a downstream task. Average success rates ±\pm± StdErr are computed across 129 rollouts per approach (99 for Franka-Tabletop tasks and 30 for Franka-DROID tasks). See Table 7 for detailed results.
图 4:适应新的机器人设置。我们评估了在七项 Franka Emika Panda 任务(每项任务 10-150 个演示)上从头开始训练的最先进的扩散策略,以及根据相同数据微调的通用机器人策略 Octo 和 OpenVLA。Diffusion Policy 在狭窄的单指令任务上表现出强大的性能,而 Octo 和 OpenVLA 在涉及多个指令和干扰对象的各种微调任务上表现更好。总体而言,OpenVLA 在两种设置中都实现了最高的聚合性能,这表明它是学习下游任务策略的有效默认值。平均成功率 ±plus-or-minus\pm± StdErr 是按每种方法的 129 次部署计算的(Franka-Tabletop 任务为 99 次,Franka-DROID 任务为 30 次)。有关详细结果,请参阅  7

Comparisons. We compare to Diffusion Policy [3], a state-of-the-art data-efficient imitation learning approach, trained from scratch. We also compare to Diffusion Policy (matched), a version of Diffusion Policy that matches the input and output specifications of OpenVLA.333The full Diffusion Policy uses a two-step observation history with both images and proprioceptive state, and performs receding horizon control by predicting a chunk of TTitalic_T future actions and executing the first XXitalic_X actions in open-loop fashion before predicting the next chunk (for 15Hz control, we set T=16,X=8T=16,X=8italic_T = 16 , italic_X = 8 like in the DROID prior work [11]; for 5Hz control, we reduce the chunk sizes to T=8,X=3T=8,X=3italic_T = 8 , italic_X = 3). It is also the only method in Section 5.2 that predicts absolute Cartesian coordinates to control the robot; all other methods use relative position control. Diffusion Policy (matched) uses a single image as input, has no proprioceptive information and no observation history, and predicts a single relative position control action without action chunking.
完整的扩散策略使用包含图像和本体感觉状态的两步观察历史,并通过预测 TTitalic_T 未来动作的块并在预测下一个块之前以开环方式执行第一个 XXitalic_X 动作来执行后退水平控制(对于 15Hz 控制,我们设置 T=16,X=8formulae-sequence168T=16,X=8italic_T = 16 , italic_X = 8 类似于 DROID 之前的工作 [11];对于 5Hz 控制,我们将块大小减小到 T=8,X=3formulae-sequence83T=8,X=3italic_T = 8 , italic_X = 3 )。它也是 Section 5.2 中唯一预测绝对笛卡尔坐标来控制机器人的方法;所有其他方法都使用相对位置控制。扩散策略(匹配)使用单个图像作为输入,没有本体感觉信息,也没有观察历史,并且预测单个相对位置控制动作,没有动作分块。

比较。我们与 Diffusion Policy [3] 进行了比较,这是一种最先进的数据高效模仿学习方法,从头开始训练。我们还与 Diffusion Policy (matched) 进行了比较,后者是与 OpenVLA.3 的输入和输出规范相匹配的 Diffusion Policy 版本
Additionally, we evaluate Octo [5] fine-tuned on the target dataset, since it is currently the best generalist policy that supports fine-tuning (fine-tuning of RT-2-X is not supported through its inference API). We also fine-tune OpenVLA on the same target dataset, and the resulting policy is denoted by OpenVLA. Finally, as an ablation experiment, we compare to OpenVLA (scratch), where we directly fine-tune the underlying base Prismatic VLM on the target robot setup – rather than fine-tuning the OpenX-pretrained OpenVLA model – to assess the benefit of large-scale robot pretraining.
此外,我们还在目标数据集上评估了 Octo [5] 的微调,因为它是目前支持微调的最佳通才策略(不支持通过其推理 API 对 RT-2-X 进行微调)。我们还在同一目标数据集上微调 OpenVLA,生成的策略由 OpenVLA 表示。最后,作为消融实验,我们与 OpenVLA(划痕)进行了比较,在 OpenVLA(划痕)中,我们直接在目标机器人设置上微调底层基础棱柱形 VLM,而不是微调 OpenX 预训练的 OpenVLA 模型,以评估大规模机器人预训练的好处。

We present the results in Fig. 4 (per-task breakdown in Appendix, Table 7). We find that both versions of Diffusion Policy are competitive with or outperform the generalist policies Octo and OpenVLA on narrower single-instruction tasks like “Put Carrot in Bowl” and “Pour Corn into Pot”, but the pretrained generalist policies perform better in more diverse fine-tuning tasks that involve multiple objects in the scene and require language conditioning. OpenX pretraining for Octo and OpenVLA enables the models to better adapt to these more diverse tasks where language grounding is important; we see evidence for this in the lower performance of OpenVLA (scratch).
我们在图 1 中展示了结果。 4(附录, 7 中每个任务的细分)。我们发现,在较窄的单指令任务(如 “Put Carrot in Bowl ”和 “Pour Corn into Pot”)上,这两个版本的扩散策略都与通用策略 Octo 和 OpenVLA 具有竞争力或优于通用策略,但预训练的通用策略在涉及场景中多个对象并需要语言调节的更多样化的微调任务中表现更好。适用于 Octo 和 OpenVLA 的 OpenX 预训练使模型能够更好地适应这些更多样化的任务,其中语言基础很重要;我们在 OpenVLA (scratch) 的较低性能中看到了这一点。

Overall, we find that OpenVLA achieves the highest average performance. Notably, most prior works achieve strong performance only in either narrow single-instruction or diverse multi-instruction tasks, resulting in widely varying success rates. OpenVLA is the only approach that achieves at least 50% success rate across all tested tasks, suggesting that it can be a strong default option for imitation learning tasks, particularly if they involve a diverse set of language instructions. For narrower but highly dexterous tasks, Diffusion Policy still shows smoother and more precise trajectories; incorporating action chunking and temporal smoothing, as implemented in Diffusion Policy, may help OpenVLA attain the same level of dexterity and may be a promising direction for future work (see Section 6 for a detailed discussion of current limitations).
总体而言,我们发现 OpenVLA 实现了最高的平均性能。值得注意的是,大多数以前的工作仅在狭窄的单指令多样化的多指令任务中取得了强大的性能,导致成功率差异很大。OpenVLA 是唯一在所有测试任务中实现至少 50% 成功率的方法,这表明它可以成为模仿学习任务的强大默认选项,特别是当它们涉及一组不同的语言指令时。对于较窄但高度灵巧的任务,Diffusion Policy 仍显示更平滑、更精确的轨迹;正如 Diffusion Policy 中实现的那样,结合动作分块和时间平滑可能有助于 OpenVLA 达到相同水平的敏捷性,并且可能是未来工作的一个有前途的方向(有关当前限制的详细讨论,请参阅6  )。

5.3 Parameter-Efficient Fine-Tuning
5,3 参数高效的微调

The full fine-tuning runs of OpenVLA in the previous section used 8 A100 GPUs for 5-15 hours per task (depending on the dataset size) to achieve high performance. While this is substantially less compute than what is required for VLA pretraining, in this section we explore even more compute- and parameter-efficient fine-tuning approaches and investigate their effectiveness.
上一节中 OpenVLA 的完整微调运行使用了 8 个 A100 GPU,每个任务持续 5-15 小时(取决于数据集大小)以实现高性能。虽然这比 VLA 预训练所需的计算量要少得多,但在本节中,我们将探索更具计算和参数效率的微调方法,并研究其有效性。

Table 1: Parameter-efficient fine-tuning evaluation. LoRA fine-tuning achieves the best performance-compute trade-off, matching full fine-tuning performance while training only 1.4% of the model parameters. Mean success ±\pm± StdErr computed across 33 rollouts per approach on select Franka-Tabletop tasks (see Table 8 for details).
表 1:参数高效的微调评估。LoRA 微调实现了最佳性能-计算权衡,在仅训练 1.4% 的模型参数的同时匹配完整的微调性能。在选定的 Franka-Tabletop 任务中,每种方法的 33 次部署计算的平均成功 ±plus-or-minus\pm± 率 StdErr(有关详细信息,请参见 8)。

: Sharded across 2 GPUs with FSDP [77].
:使用 FSDP 在 2 个 GPU 上分片 [77]。
Strategy  策略 Success Rate  成功率 Train Params (×106\times 10^{6}× 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT)
训练参数 ( ×106absentsuperscript106\times 10^{6}× 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT
VRAM (batch 16)  VRAM(第 16 批)
Full FT  全 FT 69.7 ±\pm± 7.2 % 7,188.1 163.3 GB*
Last layer only  仅最后一层 30.3 ±\pm± 6.1 % 465.1 51.4 GB  51.4 吉字节
Frozen vision  冻结的视力 47.0 ±\pm± 6.9 % 6,760.4 156.2 GB*
Sandwich  三明治 62.1 ±\pm± 7.9 % 914.2 64.0 GB  64.0吉字节
LoRA, rank=32  LoRA,排名 = 32 68.2 ±\pm± 7.5% 97.6 59.7 GB  59.7 吉字节
           rank=64  等级 = 64 68.2 ±\pm± 7.8% 195.2 60.5 GB  60.5 吉字节

Concretely, we compare the following fine-tuning approaches: full fine-tuning updates all weights during fine-tuning, as described in Section 5.2; last layer only fine-tunes only the last layer of OpenVLA’s transformer backbone and the token embedding matrix; frozen vision freezes the vision encoder but fine-tunes all other weights; sandwich fine-tuning unfreezes the vision encoder, token embedding matrix, and last layer; and LoRA uses the popular low-rank adaptation technique of Hu et al. [26] with multiple rank values rritalic_r, applied to all linear layers of the model.
具体来说,我们比较了以下微调方法:完全微调在微调期间更新所有权重,如 5.2  所述;last layer 仅微调 OpenVLA 的 transformer 主干和令牌嵌入矩阵的最后一层;Frozen Vision 冻结 Vision 编码器,但微调所有其他权重;Sandwich Fine-Tuning 解冻 Vision Encoder、Token Embedding Matrix 和 Last Layer;LoRA使用了胡等 [26]流行的低秩自适应技术,具有多个秩值 rritalic_r ,应用于模型的所有线性层。

We report fine-tuning success rates across multiple Franka-Tabletop tasks, as well as training parameter count and GPU memory requirements, in Table 1.444In Section 5.3 and Section 5.4, we experiment with a version of the OpenVLA model that is pretrained with a smaller robot data mixture (the same OpenX dataset mixture as Octo) and has a slightly smaller architecture which only uses a SigLIP [79] vision backbone instead of the fused DinoSigLIP encoder. We find that this simpler architecture still achieves strong performance in both fine-tuning tasks and “out-of-the-box” tasks.
5.3 节和第 5.4 节  中,我们试验了一个 OpenVLA 模型版本,该版本使用较小的机器人数据混合(与 Octo 相同的 OpenX 数据集混合)进行预训练,并且具有稍小的架构,仅使用 SigLIP [79] 视觉主干而不是融合的 DinoSigLIP 编码器。我们发现,这种更简单的架构在微调任务和 “开箱即用” 任务中仍然实现了强大的性能。

我们在 1.4 中报告了多个 Franka-Tabletop 任务的微调成功率,以及训练参数数量和 GPU 内存要求
We find that only fine-tuning the network’s last layer or freezing the vision encoder leads to poor performance, suggesting that further adaptation of the visual features to the target scene is crucial. In contrast, “sandwich fine-tuning” achieves better performance since it fine-tunes the vision encoder, and it consumes less GPU memory since it does not fine-tune the full LLM backbone. Lastly, LoRA achieves the best trade-off between performance and training memory consumption, outperforming “sandwich fine-tuning” and matching full fine-tuning performance while fine-tuning only 1.4% of the parameters. We find that the LoRA rank has negligible effect on policy performance and thus recommend using a default rank of r=32r=32italic_r = 32. With LoRA, we can fine-tune OpenVLA on a new task within 10-15 hours on a single A100 GPU – an 8x reduction in compute compared to full fine-tuning.
我们发现,仅微调网络的最后一层或冻结视觉编码器会导致性能不佳,这表明视觉特征进一步适应目标场景至关重要。相比之下,“三明治微调”实现了更好的性能,因为它微调了视觉编码器,并且它消耗的 GPU 内存更少,因为它没有微调完整的 LLM 主干。最后,LoRA 在性能和训练内存消耗之间实现了最佳平衡,性能优于 “三明治微调” 并匹配完整的微调性能,同时仅微调 1.4% 的参数。我们发现 LoRA 排名对策略性能的影响可以忽略不计,因此建议使用默认排名 r=3232r=32italic_r = 32 。借助 LoRA,我们可以在单个 A100 GPU 上在 10-15 小时内对新任务的 OpenVLA 进行微调,与完全微调相比,计算量减少了 8 倍。

5.4 Memory-Efficient Inference via Quantization
5,4 通过量化实现内存高效推理

[Uncaptioned image]
Figure 5: OpenVLA inference speed for various GPUs. Both bfloat16 and int4 quantization achieve high throughput, especially on GPUs with Ada Lovelace architecture (RTX 4090, H100). Further speed-ups are possible with modern LLM inference frameworks like TensorRT-LLM [89]. \spadesuit: Model sharded across two GPUs to fit.
图 5: 各种 GPU 的 OpenVLA 推理速度。bfloat16 和 int4 量化都实现了高吞吐量,尤其是在采用 Ada Lovelace 架构(RTX 4090、H100)的 GPU 上。使用现代 LLM 推理框架(如 TensorRT-LLM [89] )可以进一步加速。 \spadesuit :跨两个 GPU 分片的模型以适应。
Precision  精度 Bridge Success  Bridge 成功 VRAM
bfloat16 71.3 ±\pm± 4.8% 16.8 GB  16.8 吉字节
int8 58.1 ±\pm± 5.1% 10.2 GB  10.2 吉字节
int4 71.9 ±\pm± 4.7% 7.0 GB  7.0 吉字节
Table 2: Performance with quantized inference. 4-bit quantization matches the performance of bfloat16 inference (our default approach) while reducing the GPU memory footprint by more than half. Mean success ±\pm± StdErr computed across 8 representative BridgeData V2 tasks [6] and 80 rollouts per approach (see Table 5 for details).
表 2 :量化推理的性能。4 位量化与 bfloat16 推理(我们的默认方法)的性能相匹配,同时将 GPU 内存占用减少一半以上。根据 8 个代表性的 BridgeData V2 任务 [6] 和每种方法 80 次部署计算的平均成功 ±plus-or-minus\pm± StdErr(有关详细信息,请参见 5)。

OpenVLA, a 7B-parameter model, consumes more memory at inference time than prior open-source generalist policies such as Octo, which has <<<100M parameters. We follow best-practices from LLM serving by saving and loading OpenVLA in bfloat16 precision for inference (our default approach), which cuts the memory footprint in half, allowing us to serve OpenVLA on GPUs with only 16GB of GPU memory. In this section, we test whether we can further reduce the required memory for policy inference and broaden accessibility of VLA policies, by using modern quantization techniques developed for serving LLMs [88, 27]. These approaches load the weights of the network at lower precision, thereby trading off reduced memory requirements for potentially reduced inference speed and accuracy.
OpenVLA 是一种 7B 参数模型,与之前的开源通用策略(如具有 <<< 100M 参数的 Octo)相比,它在推理时消耗的内存更多。我们遵循 LLM,以 bfloat16 精度保存和加载 OpenVLA 进行推理(我们的默认方法),这将内存占用减少一半,使我们能够在只有 16GB GPU 内存的 GPU 上提供 OpenVLA。在本节中,我们测试了我们是否可以通过使用为服务 LLMs [8827] 来进一步减少策略推理所需的内存并扩大 VLA 策略的可访问性。 这些方法以较低的精度加载网络的权重,从而以降低的内存需求为代价,从而可能降低推理速度和准确性。

Concretely, we investigate serving the OpenVLA model with 8-bit and 4-bit precision on 8 representative BridgeData V2 tasks. We report memory footprint and rollout performance in Table 2. We also report achievable control frequencies on various consumer- and server-grade GPUs in Fig. 5. We observe that 8-bit quantization slows down inference across most GPUs, due to the overhead of the added quantization operations. 4-bit inference achieves higher throughput, since reduced GPU memory transfer compensates for the quantization overhead.
具体来说,我们研究了在 8 个代表性 BridgeData V2 任务上以 8 位和 4 位精度为 OpenVLA 模型提供服务。我们在 2 中报告了内存占用和推出性能。我们还在图 1 中报告了各种消费级和服务器级 GPU 上可实现的控制频率 5.我们观察到,由于增加的量化操作的开销,8 位量化会减慢大多数 GPU 的推理速度。4 位推理可实现更高的吞吐量,因为减少的 GPU 内存传输可以补偿量化开销。

As a result of the reduced inference speed, we observe a substantial performance decrease with 8-bit quantization: on the A5000 GPU we use for our evaluations, we can only run the model at 1.2Hz, which significantly changes the system dynamics compared to the training dataset for the 5Hz non-blocking controller used in the BridgeData V2 tasks.555We attribute the performance loss to low inference speed, since both 8-bit and 4-bit quantization achieve comparable token accuracy to bfloat16 inference when evaluated offline on training data. See Section D.4 for supporting details.
我们将性能损失归因于低推理速度,因为在对训练数据进行离线评估时,8 位和 4 位量化都实现了与 bfloat16 推理相当的标记精度。有关支持的详细信息,请参见 Section D.4

由于推理速度降低,我们观察到 8 位量化的性能大幅下降:在我们用于评估的 A5000 GPU 上,我们只能以 1.2Hz 的速度运行模型,与 BridgeData V2 任务中使用的 5Hz 非阻塞控制器的训练数据集相比,这显着改变了系统动力学5。
Notably, 4-bit quantization results in similar performance as bfloat16 half-precision inference despite requiring less than half the amount of GPU memory. 4-bit quantized models can run at 3Hz on the A5000, thus more closely matching the system dynamics during data collection.
值得注意的是,4 位量化的性能与 bfloat16 半精度推理相似,尽管需要的 GPU 内存量不到一半。4 位量化模型可以在 A5000 上以 3Hz 运行,从而在数据收集过程中更紧密地匹配系统动态。

6 Discussion and Limitations
6 讨论和限制

In this work, we presented OpenVLA, a state-of-the-art, open-source vision-language-action model that obtains strong performance for cross-embodiment robot control out-of-the-box. We also demonstrated that OpenVLA can be easily adapted to new robot setups via parameter-efficient fine-tuning techniques.
在这项工作中,我们提出了 OpenVLA,这是一种最先进的开源视觉-语言-动作模型,它为开箱即用的跨实施机器人控制提供了强大的性能。我们还证明,OpenVLA 可以通过参数高效的微调技术轻松适应新的机器人设置。

The current OpenVLA model has several limitations. First, it currently only supports single-image observations. In reality, real-world robot setups are heterogeneous, with a wide range of possible sensory inputs [5]. Expanding OpenVLA to support multiple image and proprioceptive inputs as well as observation history is an important avenue for future work. Exploring the use of VLMs pretrained on interleaved image and text data may facilitate such flexible-input VLA fine-tuning.
当前的 OpenVLA 模型有几个限制。首先,它目前仅支持单图像观测。实际上,现实世界的机器人设置是异构的,具有广泛的可能传感输入 [5]。扩展 OpenVLA 以支持多个图像和本体感觉输入以及观察历史是未来工作的重要途径。探索使用在交错图像和文本数据上预训练的 VLM 可能有助于这种灵活输入的 VLA 微调。

Secondly, improving the inference throughput of OpenVLA is critical to enable VLA control for high-frequency control setups such as ALOHA [90], which runs at 50Hz. This will also enable testing VLAs on more dexterous, bi-manual manipulation tasks than what we investigated in this work. Exploring the use of action chunking or alternative inference-time optimization techniques such as speculative decoding [91] offer potential remedies.
其次,提高 OpenVLA 的推理吞吐量对于为高频控制设置(如以 50Hz 运行的 ALOHA [90])启用 VLA 控制至关重要。这也将使 VLA 能够在比我们在这项工作中调查的更灵巧的双手动操作任务上进行测试。探索使用动作分块或替代推理时间优化技术,如推测解码[91],提供了潜在的补救措施。

Additionally, there is room for further performance improvements. While OpenVLA outperforms prior generalist policies, it does not yet offer very high reliability on the tested tasks, typically achieving <90% success rate.
此外,还有进一步性能改进的空间。虽然 OpenVLA 优于以前的通用策略,但它在测试任务上尚未提供非常高的可靠性,通常达到 <90% 的成功率。

Finally, due to compute limitations, many VLA design questions remain underexplored: What effect does the size of the base VLM have on VLA performance? Does co-training on robot action prediction data and Internet-scale vision-language data substantially improve VLA performance? What visual features are best-suited for VLA models? We hope that the release of the OpenVLA model and codebase will enable the community to jointly investigate these questions.
最后,由于计算限制,许多 VLA 设计问题仍未得到充分探索:基本 VLM 的大小对 VLA 性能有什么影响?机器人动作预测数据和 Internet 规模的视觉语言数据的联合训练是否显着提高了 VLA 性能?哪些视觉特征最适合 VLA 模型?我们希望 OpenVLA 模型和代码库的发布将使社区能够共同研究这些问题。

Acknowledgments  确认

We are grateful to the Toyota Research Institute for providing significant funding and compute resources required to carry out this research. We also thank the Stanford Center for Research on Foundation Models for providing additional compute resources and Google DeepMind for alpha access to the RT-2-X API for our evaluations. We acknowledge additional support from Volkswagen, Physical Intelligence, ONR grants N00014-22-1-2621 and N00014-22-1-2293, the National Science Foundation through IIS-2246811, and DARPA ANSR.
我们感谢 Toyota Research Institute 为开展这项研究提供了所需的大量资金和计算资源。我们还感谢斯坦福大学基础模型研究中心提供额外的计算资源,并感谢 Google DeepMind 提供对 RT-2-X API 的 alpha 版访问,以便我们进行评估。我们感谢大众汽车、物理智能、ONR 赠款 N00014-22-1-2621 和 N00014-22-1-2293、美国国家科学基金会通过 IIS-2246811 和 DARPA ANSR 的额外支持。

References  引用

  • Open X-Embodiment Collaboration et al. [2023]
    Open X-Embodiment Collaboration 等人 [2023]
    Open X-Embodiment Collaboration, A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, A. Raffin, A. Wahid, B. Burgess-Limerick, B. Kim, B. Schölkopf, B. Ichter, C. Lu, C. Xu, C. Finn, C. Xu, C. Chi, C. Huang, C. Chan, C. Pan, C. Fu, C. Devin, D. Driess, D. Pathak, D. Shah, D. Büchler, D. Kalashnikov, D. Sadigh, E. Johns, F. Ceola, F. Xia, F. Stulp, G. Zhou, G. S. Sukhatme, G. Salhotra, G. Yan, G. Schiavi, H. Su, H.-S. Fang, H. Shi, H. B. Amor, H. I. Christensen, H. Furuta, H. Walke, H. Fang, I. Mordatch, I. Radosavovic, I. Leal, J. Liang, J. Kim, J. Schneider, J. Hsu, J. Bohg, J. Bingham, J. Wu, J. Wu, J. Luo, J. Gu, J. Tan, J. Oh, J. Malik, J. Tompson, J. Yang, J. J. Lim, J. Silvério, J. Han, K. Rao, K. Pertsch, K. Hausman, K. Go, K. Gopalakrishnan, K. Goldberg, K. Byrne, K. Oslund, K. Kawaharazuka, K. Zhang, K. Majd, K. Rana, K. Srinivasan, L. Y. Chen, L. Pinto, L. Tan, L. Ott, L. Lee, M. Tomizuka, M. Du, M. Ahn, M. Zhang, M. Ding, M. K. Srirama, M. Sharma, M. J. Kim, N. Kanazawa, N. Hansen, N. Heess, N. J. Joshi, N. Suenderhauf, N. D. Palo, N. M. M. Shafiullah, O. Mees, O. Kroemer, P. R. Sanketi, P. Wohlhart, P. Xu, P. Sermanet, P. Sundaresan, Q. Vuong, R. Rafailov, R. Tian, R. Doshi, R. Martín-Martín, R. Mendonca, R. Shah, R. Hoque, R. Julian, S. Bustamante, S. Kirmani, S. Levine, S. Moore, S. Bahl, S. Dass, S. Song, S. Xu, S. Haldar, S. Adebola, S. Guist, S. Nasiriany, S. Schaal, S. Welker, S. Tian, S. Dasari, S. Belkhale, T. Osa, T. Harada, T. Matsushima, T. Xiao, T. Yu, T. Ding, T. Davchev, T. Z. Zhao, T. Armstrong, T. Darrell, V. Jain, V. Vanhoucke, W. Zhan, W. Zhou, W. Burgard, X. Chen, X. Wang, X. Zhu, X. Li, Y. Lu, Y. Chebotar, Y. Zhou, Y. Zhu, Y. Xu, Y. Wang, Y. Bisk, Y. Cho, Y. Lee, Y. Cui, Y. hua Wu, Y. Tang, Y. Zhu, Y. Li, Y. Iwasawa, Y. Matsuo, Z. Xu, and Z. J. Cui. Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864, 2023.
    开放 X 实施体合作, A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, A. Raffin, A. Wahid, B. Burgess-Limerick, B. Kim, B. Schölkopf, B. Ichter, C. Lu, C. Xu, C. Finn, C. Xu, C. Chi, C. Huang, C. Chan, C. Pan, C. Fu, C. Devin, D. Driess, D. Pathak, D. Shah, D. Büchler, D. Kalashnikov, D. Sadigh, E. Johns, F. Ceola, F. Xia, F. Stulp, G. 周, G. S. Sukhatme, G. Salhotra, G. Yan, G. Schiavi, H. Su, H.-S.Fang, H. Shi, H. B. Amor, H. I. Christensen, H. Furuta, H. Walke, H. Fang, I. Mordatch, I. Radosavovic, I. Leal, J. Liang, J. Kim, J. Schneider, J. Hsu, J. Bohg, J. Bingham, J. Wu, J. Wu, J. Luo, J. Gu, J. Tan, J. Oh, J. Malik, J. Tompson, J. Yang, J. J. Lim, J. Silvério, J. Han, K. Rao, K. Pertsch, K. Hausman, K. Go, K. Gopalakrishnan, K. Goldberg, K. Byrne, K. Oslund, K. Kawaharazuka, K. Zhang, K. Majd, K. Rana, K. Srinivasan, L. Y. Chen, L. Pinto, L. Tan, L. Ott, L. Lee, M. Tomizuka, M. Du, M. Ahn, M. Zhang, M. Ding, M. K. Srirama, M. Sharma, M. J. Kim, N. Kanazawa, N. Hansen, N. Heess, NJ Joshi, N. Suenderhauf, N. D. Palo, N. M. M. Shafiullah, O. Mees, O. Kroemer, PR Sanketi, P. Wohlhart, P. Xu, P. Sermanet, P. Sundaresan, Q. Vuong, R. Rafailov, R. Tian, R. Doshi, R. Martín-Martín, R. Mendonca, R. Shah, R. Hoque, R. Julian, S. Bustamante, S. Kirmani, S. Levine, S. Moore, S. Bahl, S. Dass, S. Song, S. Xu, S. Haldar, S. Adebola, S. Guist, S. Nasiriany, S. Schaal, S. Welker, S. Tian, S. Dasari, S. Belkhale, T. Osa, T. Harada, T. Matsushima, T. Xiao, T. Yu, T. Ding, T. Davchev, T. Z. Zhao, T. Armstrong, T. Darrell, V. Jain, V. Vanhoucke, W. Zhan, W. 周, W. Burgard, X. Chen, X. Wang, X. Zhu, X. Li, Y. Lu, Y. Chebotar, Y. 周, Y. Zhu, Y. Xu, Y. Wang, Y. Bisk, Y. Cho, Y. Lee, Y. Cui, Y. 华 Wu, Y.唐, Y. Zhu, Y. Li, Y. Iwasawa, Y. Matsuo, Z. Xu, 和 Z. J. Cui.Open X-Embodiment:机器人学习数据集和 RT-X 模型。 https://arxiv.org/abs/2310.08864,2023 年。
  • Brohan et al. [2022]  Brohan 等人 [2022] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, Y. Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J. Quiambao, K. Rao, M. Ryoo, G. Salazar, P. Sanketi, K. Sayed, J. Singh, S. Sontakke, A. Stone, C. Tan, H. Tran, V. Vanhoucke, S. Vega, Q. Vuong, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu, and B. Zitkovich. Rt-1: Robotics transformer for real-world control at scale. In arXiv preprint arXiv:2212.06817, 2022.
    A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H.Lee, S. Levine, Y. Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J. Quiambao, K. Rao, M. Ryoo, G. Salazar, P. Sanketi, K. Sayed, J. Singh, S. Sontakke, A. Stone, C. Tan, H. Tran, V. Vanhoucke, S. Vega, Q. Vuong, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu 和 B. Zitkovich。Rt-1:用于大规模实际控制的机器人转换器。 arXiv 预印本 arXiv:2212.06817,2022 年。
  • Chi et al. [2023]  Chi 等人 [2023] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023.
    C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, 和 S. Song。 扩散政策:通过动作扩散进行视觉运动政策学习。 机器人学论文集:科学与系统 (RSS) 中,2023 年。
  • Xie et al. [2023]  Xie et al. [2023] A. Xie, L. Lee, T. Xiao, and C. Finn. Decomposing the generalization gap in imitation learning for visual robotic manipulation. arXiv preprint arXiv:2307.03659, 2023.
    A. Xie、L. Lee、T. Xiao 和 C. Finn。 分解视觉机器人操作模仿学习中的泛化差距。 arXiv 预印本 arXiv:2307.03659,2023 年。
  • Octo Model Team et al. [2023]
    Octo Model Team 等人 [2023]
    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y. Tan, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
    Octo 模型团队,D. Ghosh、H. Walke、K. Pertsch、K. Black、O. Mees、S. Dasari、J. Hejna、C. Xu、J. Luo、T. Kreiman、Y. Tan、D. Sadigh、C. Finn 和 S. Levine。 Octo:一种开源的通才机器人策略。 https://octo-models.github.io,2023 年。
  • Walke et al. [2023]  Walke 等人 [2023] H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V. Myers, K. Fang, C. Finn, and S. Levine. Bridgedata v2: A dataset for robot learning at scale, 2023.
    H. Walke、K. Black、A. Lee、MJ Kim、M. Du、C. Zheng、T. Zhao、P. Hansen-Estruch、Q. Vuong、A. He、V. Myers、K. Fang、C. Finn 和 S. Levine。 Bridgedata v2:用于大规模机器人学习的数据集,2023 年。
  • Brohan et al. [2023]  Brohan 等人 [2023] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y. Lu, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Reymann, M. Ryoo, G. Salazar, P. Sanketi, P. Sermanet, J. Singh, A. Singh, R. Soricut, H. Tran, V. Vanhoucke, Q. Vuong, A. Wahid, S. Welker, P. Wohlhart, J. Wu, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu, and B. Zitkovich. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818, 2023.
    A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, L. Lee, T.-W.E. Lee, S. Levine, Y. Lu, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Reymann, M. Ryoo, G. Salazar, P. Sanketi, P. Sermanet, J. Singh, A. Singh, R. Soricut, H. Tran, V. Vanhoucke, Q. Vuong, A. Wahid, S. Welker, P. Wohlhart, J. Wu, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu, 和 B. Zitkovich。Rt-2:视觉-语言-动作模型将 Web 知识转移到机器人控制。 arXiv 预印本 arXiv:2307.15818,2023 年。
  • Radford et al. [2021]  Radford 等人 [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), volume 139, pages 8748–8763, 2021.
    A. Radford、JW Kim、C. Hallacy、A. Ramesh、G. Goh、S. Agarwal、G. Sastry、A. Askell、P. Mishkin、J. Clark、G. Krueger 和 I. Sutskever。 从自然语言监督中学习可转移的视觉模型。 国际机器学习会议 (ICML) 中,第 139 卷,第 8748-8763 页,2021 年。
  • Zhai et al. [2023]  翟等人 [2023] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In International Conference on Computer Vision (ICCV), 2023.
    X. Zhai、B. Mustafa、A. Kolesnikov 和 L. Beyer。 用于语言图像预训练的 S 形丢失。 计算机视觉国际会议 (ICCV) 中,2023 年。
  • Touvron et al. [2023]  Touvron 等人 [2023] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
    H. Touvron、L. Martin、K. Stone、P. Albert、A. Almahairi、Y. Babaei、N. Bashlykov、S. Batra、P. Bhargava、S. Bhosale 等 人。 Llama 2:开放的基础和微调的聊天模型。 arXiv 预印本 arXiv:2307.09288,2023 年。
  • Khazatsky et al. [2024]  Khazatsky 等人 [2024] A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y. J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y. Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. Lu, J. Mercat, A. Rehman, P. R. Sanketi, A. Sharma, C. Simpson, Q. Vuong, H. R. Walke, B. Wulfe, T. Xiao, J. H. Yang, A. Yavary, T. Z. Zhao, C. Agia, R. Baijal, M. G. Castro, D. Chen, Q. Chen, T. Chung, J. Drake, E. P. Foster, J. Gao, D. A. Herrera, M. Heo, K. Hsu, J. Hu, D. Jackson, C. Le, Y. Li, K. Lin, R. Lin, Z. Ma, A. Maddukuri, S. Mirchandani, D. Morton, T. Nguyen, A. O’Neill, R. Scalise, D. Seale, V. Son, S. Tian, E. Tran, A. E. Wang, Y. Wu, A. Xie, J. Yang, P. Yin, Y. Zhang, O. Bastani, G. Berseth, J. Bohg, K. Goldberg, A. Gupta, A. Gupta, D. Jayaraman, J. J. Lim, J. Malik, R. Martín-Martín, S. Ramamoorthy, D. Sadigh, S. Song, J. Wu, M. C. Yip, Y. Zhu, T. Kollar, S. Levine, and C. Finn. Droid: A large-scale in-the-wild robot manipulation dataset. 2024.
    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y. J. 马, P. T. 米勒, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y. Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. Lu, J. Mercat, A. Rehman, P. R. Sanketi, A. Sharma, C. Simpson, Q. Vuong, H. R. Walke, B. Wulfe, T. Xiao, J. H. Yang, A. Yavary, T. Z. Zhao, C. Agia, R. Baijal, M. G. Castro, D. Chen, Q. Chen, T. Chung, J. Drake, E. P. Foster, J. Gao, D. A. Herrera, M. Heo, K. Hsu, J. 胡, D. Jackson, C. Le, Y. Li, K. Lin, R. Lin, Z. 马, A. Maddukuri, S. Mirchandani, D. Morton, T. Nguyen, A. O'Neill, R. Scalise, D. Seale, V. Son, S. Tian, E. Tran, A. E. Wang, Y. Wu, A. Xie, J. Yang, P. Yin, Y. Zhang, O. Bastani, G. Berseth, J. Bohg, K. Goldberg, A. Gupta, A. Gupta, D. Jayaraman, J. J. Lim, J. Malik, R. Martín-Martín, S. Ramamoorthy, D. Sadigh, S. Song, J. Wu, M. C. Yip, Y. Zhu, T. Kollar, S. Levine, 和 C. Finn.Droid:一个大规模的野外机器人操作数据集。 2024 年。
  • Nair et al. [2022]  Nair 等人 [2022] S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta. R3m: A universal visual representation for robot manipulation. In CoRL, 2022.
    S. Nair、A. Rajeswaran、V. Kumar、C. Finn 和 A. Gupta。 R3m:机器人操作的通用视觉表示。 CoRL 中,2022 年。
  • Karamcheti et al. [2023]  Karamcheti 等人 [2023] S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language-driven representation learning for robotics. ArXiv, abs/2302.12766, 2023. URL https://api.semanticscholar.org/CorpusID:257205716.
    S. Karamcheti、S. Nair、ASH Chen、T. Kollar、C. Finn、D. Sadigh 和 P. Liang。 机器人的语言驱动表示学习。 ArXiv,abs/2302.12766,2023 年。 URL https://api.semanticscholar.org/CorpusID:257205716
  • Shridhar et al. [2022]  Shridhar 等人 [2022] M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Conference on robot learning, pages 894–906. PMLR, 2022.
    M. Shridhar、L. Manuelli 和 D. Fox。 Cliport:机器人操作的途径和途径。 机器人学习会议上,第 894-906 页。PMLR,2022 年。
  • Stone et al. [2023]  Stone 等人 [2023] A. Stone, T. Xiao, Y. Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, B. Zitkovich, F. Xia, C. Finn, et al. Open-world object manipulation using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023.
    A. Stone, T. Xiao, Y. Lu, K. Gopalakrishnan, K.-H.Lee, Q. Vuong, P. Wohlhart, B. Zitkovich, F. Xia, C. Finn, et al. 使用预先训练的视觉语言模型进行开放世界对象操作。 arXiv 预印本 arXiv:2303.00905,2023 年。
  • Driess et al. [2023]  Driess 等人 [2023] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
    D. Driess、F. Xia、MS Sajjadi、C. Lynch、A. Chowdhery、B. Ichter、A. Wahid、J. Tompson、Q. Vuong、T. Yu 等 人。 Palm-e:一种具身的多模态语言模型。 arXiv 预印本 arXiv:2303.03378,2023 年。
  • et al. [2024]  等人 [2024] A. S. et al. Introducing rfm-1: Giving robots human-like reasoning capabilities, 2024. URL https://covariant.ai/insights/introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/.
    A. S. 等 人。 介绍 rfm-1:赋予机器人类似人类的推理能力,2024 年。 URL https://covariant.ai/insights/introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/
  • Wayve [2024]  韦夫 [2024] Wayve. Lingo-2: Driving with natural language. 2024. URL https://wayve.ai/thinking/lingo-2-driving-with-language/.
    韦夫。 Lingo-2:使用自然语言驾驶。 2024. 网址 https://wayve.ai/thinking/lingo-2-driving-with-language/
  • Chen et al. [2022]  Chen et al. [2022] X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. M. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. V. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. Pali: A jointly-scaled multilingual language-image model. ArXiv, abs/2209.06794, 2022. URL https://api.semanticscholar.org/CorpusID:252222320.
    X. Chen, X. Wang, S. Changpinyo, AJ Piergiovanni, P. Padlewski, D. M. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue , A. V. Thapliyal, J. Bradbury、W. Kuo、M. Seyedhosseini、C. Jia、BK Ayan、C. Riquelme、A. Steiner、A. Angelova、X. Zhai、N. Houlsby 和 R. Soricut。Pali:一种联合缩放的多语言语言图像模型。 ArXiv,abs/2209.06794,2022 年。 URL https://api.semanticscholar.org/CorpusID:252222320
  • Chen et al. [2023]  Chen et al. [2023] X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J. Wu, P. Voigtlaender, B. Mustafa, S. Goodman, I. M. Alabdulmohsin, P. Padlewski, D. M. Salz, X. Xiong, D. Vlasic, F. Pavetic, K. Rong, T. Yu, D. Keysers, X.-Q. Zhai, and R. Soricut. PaLI-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023.
    X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J. Wu, P. Voigtlaender, B. Mustafa, S. Goodman, IM Alabdulmohsin, P. Padlewski, DM Salz, X. Xiong, D. Vlasic, F. Pavetic, K. Rong, T. Yu, D. Keysers, X.-Q.Zhai 和 R. Soricut。PaLI-3 视觉语言模型:更小、更快、更强大。 arXiv 预印本 arXiv:2310.09199,2023 年。
  • Wolf et al. [2020]  Wolf 等人 [2020] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, and … Transformers: State-of-the-art natural language processing. In Proceedings of the 6th International Conference on Learning Representations, 2020. URL https://arxiv.org/abs/1910.03771.
    T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, 还有...... Transformers:最先进的自然语言处理。 6 届学习表征国际会议论文集,2020 年。 URL https://arxiv.org/abs/1910.03771
  • Touvron et al. [2023]  Touvron 等人 [2023] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
    H. Touvron, T. Lavril, G. Izacard, X. Martinet, MALachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama:开放高效的基础语言模型。 arXiv 预印本 arXiv:2302.13971,2023 年。
  • Jiang et al. [2023]  江 et al. [2023] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
    A. Q. 江, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. 西北风 7b. arXiv 预印本 arXiv:2310.06825,2023 年。
  • Team et al. [2024]  Team et al. [2024] G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
    G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, MS Kale, J. Love 等 人。 Gemma:基于 Gemini 研究和技术的开放模型。 arXiv 预印本 arXiv:2403.08295,2024 年。
  • Oquab et al. [2023]  Oquab 等人 [2023] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
    M. Oquab、T. Darcet、T. Moutakanni、H. Vo、M. Szafraniec、V. Khalidov、P. Fernandez、D. Haziza、F. Massa、A. El-Nouby 等 人。 Dinov2:在没有监督的情况下学习强大的视觉特征。 arXiv 预印本 arXiv:2304.07193,2023 年。
  • Hu et al. [2021]  胡 et al. [2021] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
    E. J. 胡, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, 和 W. Chen. Lora:大型语言模型的低秩改编。 arXiv 预印本 arXiv:2106.09685,2021 年。
  • Dettmers et al. [2024]  Dettmers 等人 [2024] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
    T. Dettmers、A. Pagnoni、A. Holtzman 和 L. Zettlemoyer。 Qlora:量化 llms。 神经信息处理系统进展, 36, 2024.
  • Goyal et al. [2017]  Goyal 等人 [2017] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Computer Vision and Pattern Recognition (CVPR), 2017.
    Y. Goyal、T. Khot、D. Summers-Stay、D. Batra 和 D. Parikh。 让 VQA 中的 V 有意义:提升图像理解在视觉问答中的作用。 计算机视觉和模式识别 (CVPR) 中,2017 年。
  • Hudson and Manning [2019]
    哈德森和曼宁 [2019]
    D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Computer Vision and Pattern Recognition (CVPR), 2019.
     DA Hudson 和 CD Manning。 GQA:用于真实世界视觉推理和作文问答的新数据集。 计算机视觉和模式识别 (CVPR),2019 年。
  • Singh et al. [2019]  Singh 等人 [2019] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards VQA models that can read. In Computer Vision and Pattern Recognition (CVPR), 2019.
    A. Singh, V. Natarajan, M. Shah, Y. 江, X. Chen, D. Batra, D. Parikh, 和 M. Rohrbach. 面向可读取的 VQA 模型。 计算机视觉和模式识别 (CVPR),2019 年。
  • Bigham et al. [2010]  Bigham 等人 [2010] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh. VizWiz: nearly real-time answers to visual questions. In User Interface Software and Technology (UIST), pages 333–342, 2010.
     JP Bigham、C. Jayant、H. Ji、G. Little、A. Miller、RC Miller、R. Miller、A. Tatarowicz、B. White、S. White 和 T. Yeh。 VizWiz:对视觉问题的近乎实时的回答。 用户界面软件和技术 (UIST) 中,第 333-342 页,2010 年。
  • Kazemzadeh et al. [2014]  Kazemzadeh 等人 [2014] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. In Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, 2014.
    S. Kazemzadeh、V. Ordonez、M. Matten 和 T. Berg。 ReferItGame:指自然场景照片中的对象。 自然语言处理的经验方法 (EMNLP) 中,第 787-798 页,2014 年。
  • Yu et al. [2016]  Yu et al. [2016] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In European Conference on Computer Vision (ECCV), 2016.
    L. Yu、P. Poirson、S. Yang、AC Berg 和 TL Berg。 在引用表达式中对上下文进行建模。 欧洲计算机视觉会议 (ECCV),2016 年。
  • Mesnard et al. [2024]  Mesnard 等人 [2024] T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, P. Tafti, L. Hussenot, P. G. Sessa, A. Chowdhery, A. Roberts, A. Barua, A. Botev, A. Castro-Ros, A. Slone, A. Héliou, A. Tacchetti, A. Bulanova, A. Paterson, B. Tsai, B. Shahriari, C. L. Lan, C. A. Choquette-Choo, C. Crepy, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya, E. Ni, E. Noland, G. Yan, G. Tucker, G.-C. Muraru, G. Rozhdestvenskiy, H. Michalewski, I. Tenney, I. Grishchenko, J. Austin, J. Keeling, J. Labanowski, J.-B. Lespiau, J. Stanway, J. Brennan, J. Chen, J. Ferret, J. Chiu, J. Mao-Jones, K. Lee, K. Yu, K. Millican, L. L. Sjoesund, L. Lee, L. Dixon, M. Reid, M. Mikuła, M. Wirth, M. Sharman, N. Chinaev, N. Thain, O. Bachem, O. Chang, O. Wahltinez, P. Bailey, P. Michel, P. Yotov, R. Chaabouni, R. Comanescu, R. Jana, R. Anil, R. McIlroy, R. Liu, R. Mullins, S. L. Smith, S. Borgeaud, S. Girgin, S. Douglas, S. Pandya, S. Shakeri, S. De, T. Klimenko, T. Hennigan, V. Feinberg, W. Stokowiec, Y. hui Chen, Z. Ahmed, Z. Gong, T. Warkentin, L. Peran, M. Giang, C. Farabet, O. Vinyals, J. Dean, K. Kavukcuoglu, D. Hassabis, Z. Ghahramani, D. Eck, J. Barral, F. Pereira, E. Collins, A. Joulin, N. Fiedel, E. Senter, A. Andreev, and K. Kenealy. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
    T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, P. Tafti, L. Hussenot, P. G. Sessa, A. Chowdhery, A. Roberts, A. Barua, A. Botev, A. Castro-Ros, A. Slone, A. Héliou, A. Tacchetti, A. Bulanova, A. Paterson, B. Tsai, B. Shahriari, C. L. Lan, C. A. Choquette-Choo, C. Crepy, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya, E. Ni, E. Noland, G. Yan, G. Tucker, G.-C.Muraru, G. Rozhdestvenskiy, H. Michalewski, I. Tenney, I. Grishchenko, J. Austin, J. Keeling, J. Labanowski, J.-B.莱斯皮奥, J. 斯坦威, J. 布伦南, J. 陈, J. 费雷特, J. 邱, J. 毛琼斯, K. 李, K. Yu, K. 米利肯, L. L. Sjoesund, L. Lee, L. Dixon, M. Reid, M. Mikuła, M. Wirth, M. Sharman, N. Chinaev, N. Thain, O. Bachem, O. Chang, O. Wahltinez, P. Bailey, P. Michel, P. Yotov, R. Chaabouni, R. Comanescu, R. Jana, R. Anil, R. McIlroy, R. Liu, R. Mullins, S. L. Smith, S. Borgeaud, S. Girgin, S. Douglas, S. Pandya, S. Shakeri, S. De, T. Klimenko, T. Hennigan, V. Feinberg, W. Stokowiec, Y. hui Chen, Z. Ahmed, Z. Gong, T. Warkentin, L. Peran, M. Giang, C. Farabet, O. Vinyals, J. Dean, K. Kavukcuoglu, D. Hassabis, Z. Ghahramani, D. Eck、J. Barral、F. Pereira、E. Collins、A. Joulin、N. Fiedel、E. Senter、A. Andreev 和 K. Kenealy。Gemma:基于 Gemini 研究和技术的开放模型。 arXiv 预印本 arXiv:2403.08295,2024 年。
  • Li et al. [2023]  Li et al. [2023] Y. Li, S. Bubeck, R. Eldan, A. D. Giorno, S. Gunasekar, and Y. T. Lee. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
    Y. Li、S. Bubeck、R. Eldan、AD Giorno、S. Gunasekar 和 YT Lee。 教科书就是你所需要的 II:PHI-1.5 技术报告。 arXiv 预印本 arXiv:2309.05463,2023 年。
  • Bai et al. [2023]  Bai et al. [2023] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
    J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. 邓, Y. Fan, W. Ge, Y. Han, F. Huang, et al. Qwen 技术报告。 arXiv 预印本 arXiv:2309.16609,2023 年。
  • Li et al. [2022]  Li et al. [2022] J. Li, D. Li, C. Xiong, and S. C. H. Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning (ICML), 2022.
    J. Li、D. Li、C. Xiong 和 SCH  Hoi。 BLIP:用于统一视觉-语言理解和生成的引导语言-图像预训练。 机器学习国际会议 (ICML) 中,2022 年。
  • Li et al. [2023]  Li et al. [2023] J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning (ICML), 2023.
    J. Li、D. Li、S. Savarese 和 SCH  Hoi。 BLIP-2 :使用冻结图像编码器和大型语言模型引导语言图像预训练。 机器学习国际会议 (ICML) 中,2023 年。
  • Dai et al. [2023]  Dai 等人 [2023] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. A. Li, P. Fung, and S. C. H. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. A. Li, P. Fung, 和 S. C. H. Hoi. InstructBLIP:面向具有指令调优的通用视觉语言模型。 arXiv 预印本 arXiv:2305.06500,2023 年。
  • Tan and Bansal [2019]  Tan 和 Bansal [2019] H. H. Tan and M. Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Empirical Methods in Natural Language Processing (EMNLP), 2019.
    H. H. Tan 和 M. Bansal。 LXMERT:从 transformer 中学习跨模态编码器表示。 自然语言处理的经验方法 (EMNLP) 中,2019 年。
  • Laurençon et al. [2023]  Laurençon 等人 [2023] H. Laurençon, L. Saulnier, L. Tronchon, S. Bekman, A. Singh, A. Lozhkov, T. Wang, S. Karamcheti, A. M. Rush, D. Kiela, M. Cord, and V. Sanh. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. In Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks), 2023.
    H. Laurençon、L. Saulnier、L. Tronchon、S. Bekman、A. Singh、A. Lozhkov、T. Wang、S. Karamcheti、AM Rush、D. Kiela、M. Cord 和 V. Sanh。 OBELICS:交错图像文本文档的开放 Web 规模过滤数据集。 神经信息处理系统跟踪数据集和基准(NeurIPS 数据集和基准)中,2023 年。
  • Liu et al. [2023a]  Liu et al. [2023a] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023a.
    H. Liu、C. Li、Q. Wu 和 Y. J. Lee。 可视化指令调整。 神经信息处理系统进展 (NeurIPS),2023a。
  • Liu et al. [2023b]  Liu et al. [2023b] H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
    H. Liu、C. Li、Y. Li 和 Y. J. Lee。 通过可视化指令调整改进了基线。 arXiv 预印本 arXiv:2310.03744, 2023b。
  • Karamcheti et al. [2024]  Karamcheti 等人 [2024] S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865, 2024.
    S. Karamcheti、S. Nair、A. Balakrishna、P. Liang、T. Kallar 和 D. Sadigh。 棱柱形 VLMS:研究视觉条件语言模型的设计空间。 arXiv 预印本 arXiv:2402.07865,2024 年。
  • Kalashnikov et al. [2018]
    Kalashnikov 等人 [2018]
    D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
    D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. QT-Opt:用于基于视觉的机器人操作的可扩展深度强化学习。 arXiv 预印本 arXiv:1806.10293,2018 年。
  • Kalashnkov et al. [2021]  Kalashnkov 等人 [2021] D. Kalashnkov, J. Varley, Y. Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv, 2021.
    D. Kalashnkov、J. Varley、Y. Chebotar、B. Swanson、R. Jonschkowski、C. Finn、S. Levine 和 K. Hausman。 Mt-opt:大规模的连续多任务机器人强化学习。 arXiv,2021 年。
  • Ebert et al. [2021]  Ebert 等人 [2021] F. Ebert, Y. Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021.
    F. Ebert、Y. Yang、K. Schmeckpeper、B. Bucher、G. Georgakis、K. Daniilidis、C. Finn 和 S. Levine。 桥接数据:通过跨域数据集促进机器人技能的泛化。 arXiv 预印本 arXiv:2109.13396,2021 年。
  • Ehsani et al. [2023]  Ehsani 等人 [2023] K. Ehsani, T. Gupta, R. Hendrix, J. Salvador, L. Weihs, K.-H. Zeng, K. P. Singh, Y. Kim, W. Han, A. Herrasti, et al. Imitating shortest paths in simulation enables effective navigation and manipulation in the real world. arXiv preprint arXiv:2312.02976, 2023.
    K. Ehsani, T. Gupta, R. Hendrix, J. Salvador, L. Weihs, K.-H.Zeng, KP Singh, Y. Kim, W. Han, A. Herrasti, et al. 在模拟中模拟最短路径可以在现实世界中实现有效的导航和操作。 arXiv 预印本 arXiv:2312.02976,2023 年。
  • Bharadhwaj et al. [2023]  Bharadhwaj 等人 [2023] H. Bharadhwaj, J. Vakil, M. Sharma, A. Gupta, S. Tulsiani, and V. Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023.
    H. Bharadhwaj、J. Vakil、M. Sharma、A. Gupta、S. Tulsiani 和 V. Kumar。 Roboagent:通过语义增强和动作分块实现机器人操作的泛化和效率。 arXiv 预印本 arXiv:2309.01918,2023 年。
  • Pinto and Gupta [2016]  平托和古普塔 [2016] L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016.
    L. Pinto 和 A. Gupta。 超大自我监督:从 50k 次尝试和 700 个机器人小时中学习掌握。 2016 年 IEEE 机器人与自动化国际会议 (ICRA),第 3406-3413 页。IEEE,2016 年。
  • Mandlekar et al. [2018]  Mandlekar 等人 [2018] A. Mandlekar, Y. Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–893. PMLR, 2018.
    A. Mandlekar, Y. Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, et al. Roboturk:一个通过模仿学习机器人技能的众包平台。 机器人学习会议中,第 879-893 页。PMLR,2018 年。
  • Gupta et al. [2018]  Gupta 等人 [2018] A. Gupta, A. Murali, D. P. Gandhi, and L. Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems, 31, 2018.
    A. Gupta、A. Murali、DP Gandhi 和 L. Pinto。 家庭机器人学习:提高泛化并减少数据集偏差。 神经信息处理系统进展, 31, 2018.
  • Dasari et al. [2019]  Dasari 等人 [2019] S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. CoRL, 2019.
    S. Dasari、F. Ebert、S. Tian、S. Nair、B. Bucher、K. Schmeckpeper、S. Singh、S. Levine 和 C. Finn。 Robonet:大规模多机器人学习。 CoRL,2019 年。
  • Cabi et al. [2019]  Cabi 等人 [2019] S. Cabi, S. G. Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zolna, Y. Aytar, D. Budden, M. Vecerik, O. Sushkov, D. Barker, J. Scholz, M. Denil, N. de Freitas, and Z. Wang. Scaling data-driven robotics with reward sketching and batch reinforcement learning. RSS, 2019.
    S. Cabi、SG Colmenarejo、A. Novikov、K. Konyushkova、S. Reed、R. Jeong、K. Zolna、Y. Aytar、D. Budden、M. Vecerik、O. Sushkov、D. Barker、J. Scholz、M. Denil、N. de Freitas 和 Z. Wang。 通过奖励草图和批量强化学习扩展数据驱动的机器人。 RSS,2019 年。
  • Jang et al. [2022]  Jang 等人 [2022] E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
    E. Jang、A. Irpan、M. Khansari、D. Kappler、F. Ebert、C. Lynch、S. Levine 和 C. Finn。 Bc-z:具有机器人模仿学习的零镜头任务泛化。 机器人学习会议中,第 991-1002 页。PMLR,2022 年。
  • Fang et al. [2023]  Fang et al. [2023] H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 3:5, 2023.
    H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, 和 C. Lu. Rh20t:一个全面的机器人数据集,用于一次性学习各种技能。 迈向通才机器人:可扩展技能的学习范式 Acquisition@ CoRL2023,3:5,2023。
  • Devin et al. [2017]  Devin 等人 [2017] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine. Learning modular neural network policies for multi-task and multi-robot transfer. In Proceedings of IEEE International Conference on Robotics and Automation, 2017.
    C. Devin、A. Gupta、T. Darrell、P. Abbeel 和 S. Levine。 学习用于多任务和多机器人传输的模块化神经网络策略。 IEEE 机器人与自动化国际会议论文集,2017 年。
  • Hu et al. [2022]  胡 et al. [2022] E. S. Hu, K. Huang, O. Rybkin, and D. Jayaraman. Know thyself: Transferable visual control policies through robot-awareness. In International Conference on Learning Representations, 2022.
    E. S. 胡、K. Huang、O. Rybkin 和 D. Jayaraman。 了解你自己:通过机器人感知实现可转移的视觉控制策略。 学习表征国际会议上,2022 年。
  • Yang et al. [2023]  Yang 等人 [2023] J. H. Yang, D. Sadigh, and C. Finn. Polybot: Training one policy across robots while embracing variability. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=HEIRj51lcS.
    JH Yang、D. Sadigh 和 C. Finn。 Polybot:跨机器人训练一个策略,同时拥抱可变性。 2023 年第 7 届机器人学习年会上。 URL https://openreview.net/forum?id=HEIRj51lcS
  • Reed et al. [2022]  Reed 等人 [2022] S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-maron, M. Giménez, Y. Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas. A generalist agent. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.
    S. Reed, K. Zolna, E. Parisotto, SG Colmenarejo, A. Novikov, G. Barth-maron, M. Giménez, Y. Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar 和 N. de Freitas。一个多面手。 机器学习研究汇刊,2022 年。 国际标准书号 2835-8856。
  • Salhotra et al. [2023]  Salhotra 等人 [2023] G. Salhotra, I.-C. A. Liu, and G. Sukhatme. Bridging action space mismatch in learning from demonstrations. arXiv preprint arXiv:2304.03833, 2023.
    G. Salhotra、I.-C. A. Liu 和 G. Sukhatme。 弥合从演示中学习的行动空间不匹配。 arXiv 预印本 arXiv:2304.03833,2023 年。
  • Radosavovic et al. [2023]
    Radosavovic 等人 [2023]
    I. Radosavovic, B. Shi, L. Fu, K. Goldberg, T. Darrell, and J. Malik. Robot learning with sensorimotor pre-training. In Conference on Robot Learning, 2023.
    I. Radosavovic、B. Shi、L. Fu、K. Goldberg、T. Darrell 和 J. Malik。 通过感觉运动预训练进行机器人学习。 机器人学习会议,2023 年。
  • Shah et al. [2023]  Shah 等人 [2023] D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023.
    D. Shah、A. Sridhar、A. Bhorkar、N. Hirose 和 S. Levine。 Gnm:一种通用的导航模型,用于驱动任何机器人。 2023 年 IEEE 机器人与自动化国际会议 (ICRA),第 7226-7233 页。IEEE,2023 年。
  • Bousmalis et al. [2023]  Bousmalis 等人 [2023] K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y. Zhou, A. Gupta, A. Raju, et al. Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023.
    K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y. 周, A. Gupta, A. Raju, et al. Robocat:一种用于机器人操纵的自我改进的基础代理。 arXiv 预印本 arXiv:2306.11706,2023 年。
  • Shah et al. [2023]  Shah 等人 [2023] D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine. ViNT: A foundation model for visual navigation. In 7th Annual Conference on Robot Learning, 2023. URL https://arxiv.org/abs/2306.14846.
    D. Shah、A. Sridhar、N. Dashora、K. Stachowicz、K. Black、N. Hirose 和 S. Levine。 ViNT:视觉导航的基础模型。 2023 年第 7 届机器人学习年会上。 URL https://arxiv.org/abs/2306.14846
  • Yang et al. [2024]  Yang 等人 [2024] J. Yang, C. Glossop, A. Bhorkar, D. Shah, Q. Vuong, C. Finn, D. Sadigh, and S. Levine. Pushing the limits of cross-embodiment learning for manipulation and navigation. arXiv preprint arXiv:2402.19432, 2024.
    J. Yang、C. Glossop、A. Bhorkar、D. Shah、Q. Vuong、C. Finn、D. Sadigh 和 S. Levine。 突破跨体现学习的极限,用于操作和导航。 arXiv 预印本 arXiv:2402.19432,2024 年。
  • Gadre et al. [2023]  Gadre 等人 [2023] S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023.
     ZY Gadre、M. Wortsman、G. Ilharco、L. Schmidt 和 S. Song。 牧场上的奶牛:语言驱动的零镜头对象导航的基线和基准。 IEEE/CVF 计算机视觉和模式识别会议论文集中,第 23171-23181 页,2023 年。
  • Du et al. [2023]  Du et al. [2023] Y. Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi. Vision-language models as success detectors. arXiv preprint arXiv:2303.07280, 2023.
    Y. Du、K. Konyushkova、M. Denil、A. Raju、J. Landon、F. Hill、N. de Freitas 和 S. Cabi。 视觉语言模型作为成功检测器。 arXiv 预印本 arXiv:2303.07280,2023 年。
  • Ma et al. [2023]  马 et al. [2023] Y. J. Ma, V. Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language-image representations and rewards for robotic control. In International Conference on Machine Learning, pages 23301–23320. PMLR, 2023.
    Y. J. 马、V. Kumar、A. Zhang、O. Bastani 和 D. Jayaraman。 Liv:机器人控制的语言图像表示和奖励。 机器学习国际会议中,第 23301-23320 页。PMLR,2023 年。
  • Zhang et al. [2023]  Zhang et al. [2023] X. Zhang, Y. Ding, S. Amiri, H. Yang, A. Kaminski, C. Esselink, and S. Zhang. Grounding classical task planners via vision-language models. arXiv preprint arXiv:2304.08587, 2023.
    X. Zhang、Y. Ding、S. Amiri、H. Yang、A. Kaminski、C. Esselink 和 S. Zhang。 通过视觉语言模型为经典任务规划器奠定基础。 arXiv 预印本 arXiv:2304.08587,2023 年。
  • Sontakke et al. [2024]  Sontakke 等人 [2024] S. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti. Roboclip: One demonstration is enough to learn robot policies. Advances in Neural Information Processing Systems, 36, 2024.
    S. Sontakke、J. Zhang、S. Arnold、K. Pertsch、E. Bıyık、D. Sadigh、C. Finn 和 L. Itti。 Roboclip:一次演示就足以学习机器人策略。 神经信息处理系统进展, 36, 2024.
  • Huang et al. [2024]  Huang et al. [2024] J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y. Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang. An embodied generalist agent in 3d world. In Proceedings of the International Conference on Machine Learning (ICML), 2024.
    J. 黄, S. Yong , X. 马, X. 灵虎, P. Li, Y. Wang, Q. Li, S.-C.Zhu, B. Jia, 和 S. Huang。3D 世界中的具身多面手。 机器学习国际会议 (ICML) 论文集,2024 年。
  • Li et al. [2023]  Li et al. [2023] X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y. Jing, W. Zhang, H. Liu, et al. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023.
    X. Li, M. Liu, H. Zhang, C. Yu, J. Xu , H. Wu, C. Cheang, Y. Jing, W. Zhang, H. Liu, et al. 视觉语言基础模型作为有效的机器人模仿者。 arXiv 预印本 arXiv:2311.01378,2023 年。
  • Zhen et al. [2024]  Zhen et al. [2024] H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y. Du, Y. Hong, and C. Gan. 3d-vla: 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631, 2024.
    H. Zhen , X. Qiu, P. Chen, J. Yang, X. Yan, Y. Du, Y. Hong, 和 C. Gan. 3D-VLA:3D 视觉-语言-行动生成世界模型。 arXiv 预印本 arXiv:2403.09631,2024 年。
  • [75] PyTorch. Automatic mixed precision. URL https://pytorch.org/docs/stable/amp.html.
    PyTorch 的 Torch 中。 自动混合精度。 URL https://pytorch.org/docs/stable/amp.html
  • Dao [2023]  道 [2023] T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
    T. 道。 Flashattention-2:更快的注意力,更好的并行性和工作分区。 arXiv 预印本 arXiv:2307.08691,2023 年。
  • Zhao et al. [2023]  Zhao et al. [2023] Y. Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
    Y. Zhao, A. Gu, R. Varma, L. Luo, C.-C.Huang, M. Xu , L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, et al. Pytorch fsdp:扩展完全分片数据并行的经验。 arXiv 预印本 arXiv:2304.11277,2023 年。
  • [78] N. Dorka, C. Huang, T. Welschehold, and W. Burgard. What matters in employing vision language models for tokenizing actions in robot control? In First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024.
    N. Dorka、C. Huang、T. Welschehold 和 W. Burgard。 在机器人控制中使用视觉语言模型对动作进行标记时,什么很重要? ICRA 2024 的第一次导航和操作视觉语言模型研讨会中。
  • Zhai et al. [2023]  翟等人 [2023] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
    X. Zhai、B. Mustafa、A. Kolesnikov 和 L. Beyer。 用于语言图像预训练的 S 形丢失。 IEEE/CVF 计算机视觉国际会议论文集中,第 11975-11986 页,2023 年。
  • Radford et al. [2021]  Radford 等人 [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
    A. Radford、JW Kim、C. Hallacy、A. Ramesh、G. Goh、S. Agarwal、G. Sastry、A. Askell、P. Mishkin、J. Clark 等 人。 从自然语言监督中学习可转移的视觉模型。 机器学习国际会议上,第 8748-8763 页。PMLR,2021 年。
  • Sharma et al. [2018]  Sharma 等人 [2018] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
    P. Sharma、N. Ding、S. Goodman 和 R. Soricut。 概念字幕:用于自动图像字幕的清理后、上位词的图像替代文本数据集。 计算语言学协会第 56 届年会论文集(第 1 卷:长篇论文)中,第 2556-2565 页,2018 年。
  • Schuhmann et al. [2021]  Schuhmann 等人 [2021] C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
    C. Schuhmann、R. Vencu、R. Beaumont、R. Kaczmarczyk、C. Mullis、A. Katta、T. Coombes、J. Jitsev 和 A. Komatsuzaki。 Laion-400m:剪辑过滤的 4 亿个图像文本对的开放数据集。 arXiv 预印本 arXiv:2111.02114,2021 年。
  • Sidorov et al. [2020]  Sidorov 等人 [2020] O. Sidorov, R. Hu, M. Rohrbach, and A. Singh. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.
    O. Sidorov, R. 胡, M. Rohrbach, 和 A. Singh. Textcaps:具有阅读理解功能的图像字幕数据集。 计算机视觉-ECCV 2020:第 16 届欧洲会议,英国格拉斯哥,2020 年 8 月 23 日至 28 日,会议记录,第二部分 16,第 742-758 页。施普林格,2020 年。
  • Face [2024]  面部 [2024] H. Face. Introducing idefics: An open reproduction of state-of-the-art visual langage model. Hugging Face Blog, 2024.
    H. 脸。 idefics 简介:最先进的视觉语言模型的开放复制品。 拥抱脸博客,2024 年。
  • Liu et al. [2024]  Liu et al. [2024] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
    H. Liu、C. Li、Q. Wu 和 Y. J. Lee。 可视化指令调整。 神经信息处理系统进展, 36, 2024.
  • McKinzie et al. [2024]  McKinzie 等人 [2024] B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
    B. McKinzie, Z. Gan, J.-P.Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, et al. Mm1:来自多模态llm。 arXiv 预印本 arXiv:2403.09611,2024 年。
  • Lin et al. [2023]  Lin et al. [2023] J. Lin, H. Yin, W. Ping, Y. Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023.
    J. Lin, H. Yin , W. Ping, Y. Lu, P. Molchanov, A. Tao, H. 毛, J. 考茨, M. Shoeybi, 和 S. Han. Vila:关于视觉语言模型的预训练。 arXiv 预印本 arXiv:2312.07533,2023 年。
  • Dettmers et al. [2022]  Dettmers 等人 [2022] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.
    T. Dettmers、M. Lewis、Y. Belkada 和 L. Zettlemoyer。 gpt3. int8 (): 用于大规模变压器的 8 位矩阵乘法。 神经信息处理系统进展,35:30318–30332,2022 年。
  • [89] NVIDIA. Tensorrt-llm. URL https://github.com/NVIDIA/TensorRT-LLM.
    英伟达。 Tensorrt-llm。 URL https://github.com/NVIDIA/TensorRT-LLM
  • Zhao et al. [2023]  Zhao et al. [2023] T. Z. Zhao, V. Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023.
    TZ Zhao、V. Kumar、S. Levine 和 C. Finn。 使用低成本硬件学习细粒度双手操作。 arXiv 预印本 arXiv:2304.13705,2023 年。
  • Leviathan et al. [2023]  Leviathan 等人 [2023] Y. Leviathan, M. Kalman, and Y. Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
    Y. Leviathan、M. Kalman 和 Y. Matias。 通过推测解码从 transformer 进行快速推理。 机器学习国际会议中,第 19274-19286 页。PMLR,2023 年。
  • Brohan et al. [2022]  Brohan 等人 [2022] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
    A. Brohan、N. Brown、J. Carbajal、Y. Chebotar、J. Dabis、C. Finn、K. Gopalakrishnan、K. Hausman、A. Herzog、J. Hsu 等 人。 Rt-1:用于大规模实际控制的机器人转换器。 arXiv 预印本 arXiv:2212.06817,2022 年。
  • Rosete-Beas et al. [2022]
    Rosete-Beas 等人 [2022]
    E. Rosete-Beas, O. Mees, G. Kalweit, J. Boedecker, and W. Burgard. Latent plans for task agnostic offline reinforcement learning. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022.
    E. Rosete-Beas、O. Mees、G. Kalweit、J. Boedecker 和 W. Burgard。 与任务无关的离线强化学习的潜在计划。 6 届机器人学习会议 (CoRL) 会议记录,2022 年。
  • Mees et al. [2023]  Mees 等人 [2023] O. Mees, J. Borja-Diaz, and W. Burgard. Grounding language with visual affordances over unstructured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023.
    O. Mees、J. Borja-Diaz 和 W. Burgard。 将语言与非结构化数据相结合,提供视觉功能。 IEEE 机器人与自动化国际会议 (ICRA) 论文集,英国伦敦,2023 年。
  • Dass et al. [2023]  Dass 等人 [2023] S. Dass, J. Yapeter, J. Zhang, J. Zhang, K. Pertsch, S. Nikolaidis, and J. J. Lim. CLVR jaco play dataset, 2023. URL https://github.com/clvrai/clvr_jaco_play_dataset.
    S. Dass、J. Yapeter、J. Zhang、J. Zhang、K. Pertsch、S. Nikolaidis 和 J. J. Lim。 CLVR jaco play 数据集,2023 年。 URL https://github.com/clvrai/clvr_jaco_play_dataset
  • Luo et al. [2023]  Luo et al. [2023] J. Luo, C. Xu, X. Geng, G. Feng, K. Fang, L. Tan, S. Schaal, and S. Levine. Multi-stage cable routing through hierarchical imitation learning. arXiv preprint arXiv:2307.08927, 2023.
    J. Luo, C. Xu, X. Geng, G. Feng, K. Fang, L. Tan, S. Schaal, 和 S. Levine. 通过分层仿真学习实现多级电缆布线。 arXiv 预印本 arXiv:2307.08927,2023 年。
  • Mandlekar et al. [2018]  Mandlekar 等人 [2018] A. Mandlekar, Y. Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei-Fei. RoboTurk: A crowdsourcing platform for robotic skill learning through imitation. CoRR, abs/1811.02790, 2018. URL http://arxiv.org/abs/1811.02790.
    A. Mandlekar、Y. Zhu、A. Garg、J. Booher、M. Spero、A. Tung、J. Gao、J. Emmons、A. Gupta、E. Orbay、S. Savarese 和 L. Fei-Fei。 RoboTurk:一个通过模仿学习机器人技能的众包平台。 CoRR, abs/1811.02790, 2018. URL http://arxiv.org/abs/1811.02790
  • Zhu et al. [2023]  Zhu et al. [2023] Y. Zhu, A. Joshi, P. Stone, and Y. Zhu. Viola: Imitation learning for vision-based manipulation with object proposal priors, 2023.
    Y. Zhu、A. Joshi、P. Stone 和 Y. Zhu。 中提琴:基于视觉的操作与对象提案先验的模仿学习,2023 年。
  • [99] L. Y. Chen, S. Adebola, and K. Goldberg. Berkeley UR5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home.
    L. Y. Chen、S. Adebola 和 K. Goldberg。 Berkeley UR5 演示数据集。 https://sites.google.com/view/berkeley-ur5/home
  • Zhou et al. [2023]  周 et al. [2023] G. Zhou, V. Dean, M. K. Srirama, A. Rajeswaran, J. Pari, K. Hatch, A. Jain, T. Yu, P. Abbeel, L. Pinto, C. Finn, and A. Gupta. Train offline, test online: A real robot learning benchmark, 2023.
    G. 周, V. Dean, M. K. Srirama, A. Rajeswaran, J. Pari, K. Hatch, A. Jain, T. Yu, P. Abbeel, L. Pinto, C. Finn, 和 A. Gupta. 离线训练,在线测试:真正的机器人学习基准,2023 年。
  • Lynch et al. [2023]  Lynch 等人 [2023] C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
    C. Lynch、A. Wahid、J. Tompson、T. Ding、J. Betker、R. Baruch、T. Armstrong 和 P. Florence。 交互式语言:与机器人实时对话。 IEEE 机器人与自动化快报,2023 年。
  • Belkhale et al. [2023]  Belkhale 等人 [2023] S. Belkhale, Y. Cui, and D. Sadigh. Hydra: Hybrid robot actions for imitation learning. arxiv, 2023.
    S. Belkhale、Y. Cui 和 D. Sadigh。 Hydra:用于模仿学习的混合机器人动作。 ARXIV,2023 年。
  • Zhu et al. [2022]  Zhu et al. [2022] Y. Zhu, P. Stone, and Y. Zhu. Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022.
    Y. Zhu、P. Stone 和 Y. Zhu。 从未分段的机器人操作演示中自下而上地发现技能。 IEEE 机器人与自动化快报,7(2):4126–4133,2022 年。
  • Cui et al. [2022]  Cui et al. [2022] Z. J. Cui, Y. Wang, N. M. M. Shafiullah, and L. Pinto. From play to policy: Conditional behavior generation from uncurated robot data. arXiv preprint arXiv:2210.10047, 2022.
    Z. J. Cui, Y. Wang, N. M. M. Shafiullah, 和 L. Pinto. 从游戏到策略:从未策划的机器人数据生成条件行为。 arXiv 预印本 arXiv:2210.10047,2022 年。
  • Heo et al. [2023]  Heo 等人 [2023] M. Heo, Y. Lee, D. Lee, and J. J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In Robotics: Science and Systems, 2023.
    M. Heo、Y. Lee、D. Lee 和 J. J. Lim。 Furniturebench:用于长视距复杂操作的可重现真实基准测试。 机器人学:科学与系统,2023 年。
  • Yan et al. [2023]  Yan et al. [2023] G. Yan, K. Wu, and X. Wang. ucsd kitchens Dataset. August 2023.
    G. Yan、K. Wu 和 X. Wang。 ucsd kitchens 数据集。 2023 年 8 月。
  • Nasiriany et al. [2022]  Nasiriany 等人 [2022] S. Nasiriany, T. Gao, A. Mandlekar, and Y. Zhu. Learning and retrieval from prior data for skill-based imitation learning. In Conference on Robot Learning (CoRL), 2022.
    S. Nasiriany、T. Gao、A. Mandlekar 和 Y. Zhu。 从先前的数据中学习和检索,以进行基于技能的模仿学习。 机器人学习会议 (CoRL),2022 年。
  • Liu et al. [2023]  Liu et al. [2023] H. Liu, S. Nasiriany, L. Zhang, Z. Bao, and Y. Zhu. Robot learning on the job: Human-in-the-loop autonomy and learning during deployment. In Robotics: Science and Systems (RSS), 2023.
    H. Liu, S. Nasiriany, L. Zhang, Z. Bao, 和 Y. Zhu. 在工作中的机器人学习:部署期间的人机协同自主和学习。 机器人技术:科学与系统 (RSS),2023 年。
  • Quere et al. [2020]  Quere 等人 [2020] G. Quere, A. Hagengruber, M. Iskandar, S. Bustamante, D. Leidner, F. Stulp, and J. Vogel. Shared Control Templates for Assistive Robotics. In 2020 IEEE International Conference on Robotics and Automation (ICRA), page 7, Paris, France, 2020.
    G. Quere、A. Hagengruber、M. Iskandar、S. Bustamante、D. Leidner、F. Stulp 和 J. Vogel。 辅助机器人的共享控件模板。 2020 年 IEEE 机器人与自动化国际会议 (ICRA),第 7 页 ,法国巴黎,2020 年。
  • Saxena et al. [2023]  Saxena 等人 [2023] S. Saxena, M. Sharma, and O. Kroemer. Multi-resolution sensing for real-time control with vision-language models. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=WuBv9-IGDUA.
    S. Saxena、M. Sharma 和 O. Kroemer。 多分辨率传感,用于使用视觉语言模型进行实时控制。 2023 年第 7 届机器人学习年会上。 URL https://openreview.net/forum?id=WuBv9-IGDUA
  • Shah et al. [2023]  Shah 等人 [2023] R. Shah, R. Martín-Martín, and Y. Zhu. MUTEX: Learning unified policies from multimodal task specifications. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=PwqiqaaEzJ.
    R. Shah、R. Martín-Martín 和 Y. Zhu。 MUTEX:从多模态任务规范中学习统一策略。 2023 年第 7 届机器人学习年会上。 URL https://openreview.net/forum?id=PwqiqaaEzJ
  • Zhu et al. [2023]  Zhu et al. [2023] X. Zhu, R. Tian, C. Xu, M. Ding, W. Zhan, and M. Tomizuka. Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot. 2023.
    X. Zhu, R. Tian, C. Xu, M. Ding, W. Zhan, 和 M. Tomizuka. Fanuc 操作:使用 fanuc mate 200id 机器人进行基于学习的操作的数据集。 2023 年。
  • Mendonca et al. [2023]  Mendonca 等人 [2023] R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos. CoRL, 2023.
    R. Mendonca、S. Bahl 和 D. Pathak。 来自人类视频的结构化世界模型。 CoRL,2023 年。
  • Luo et al. [2024]  Luo et al. [2024] J. Luo, C. Xu, F. Liu, L. Tan, Z. Lin, J. Wu, P. Abbeel, and S. Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. arXiv preprint arXiv:2401.08553, 2024.
    J. Luo, C. Xu, F. Liu, L. Tan, Z. Lin, J. Wu, P. Abbeel, 和 S. Levine. Fmb:用于可推广机器人学习的函数操作基准。 arXiv 预印本 arXiv:2401.08553,2024 年。
  • Shafiullah et al. [2023]  Shafiullah 等人 [2023] N. M. M. Shafiullah, A. Rai, H. Etukuru, Y. Liu, I. Misra, S. Chintala, and L. Pinto. On bringing robots home, 2023.
     NMM Shafiullah、A. Rai、H. Etukuru、Y. Liu、I. Misra、S. Chitala和 L. Pinto。 关于将机器人带回家,2023 年。
  • Liu et al. [2024]  Liu et al. [2024] B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36, 2024.
    B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, 和 P. Stone. Libero:终身机器人学习的基准知识转移。 神经信息处理系统进展, 36, 2024.
  • Sanh et al. [2019]  Sanh 等人 [2019] V. Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
    V. Sanh、L. Debut、J. Chaumond 和 T. Wolf。 Distilbert 是 bert 的精炼版:更小、更快、更便宜、更轻。 arXiv 预印本 arXiv:1910.01108,2019 年。

Appendix A Data Mixture Details
附录 A 数据混合详细信息

We list our used data mixture in Table 3. The mixture mostly follows [5], with a few additional datasets.
我们在 3 中列出了我们使用的数据混合。混合主要遵循 [5],还有一些额外的数据集。

OpenVLA Training Dataset Mixture
OpenVLA 训练数据集混合
Fractal [92]
分形 [92]
12.7%
Kuka [45]
库卡 [45]
12.7%
Bridge[47, 6]
[476]
13.3%
Taco Play [93, 94]
塔可玩 [9394]
3.0%
Jaco Play [95]
雅科·普莱 [95]
0.4%
Berkeley Cable Routing [96]
伯克利电缆布线 [96]
0.2%
Roboturk [97]
机器人 [97]
2.3%
Viola [98]
中提琴 [98]
0.9%
Berkeley Autolab UR5 [99]
伯克利 Autolab UR5 [99]
1.2%
Toto [100]
多多 [100]
2.0%
Language Table [101]
语言表 [101]
4.4%
Stanford Hydra Dataset [102]
斯坦福 Hydra 数据集 [102]
4.4%
Austin Buds Dataset [103]
Austin Buds 数据集 [103]
0.2%
NYU Franka Play Dataset [104]
纽约大学 Franka Play 数据集 [104]
0.8%
Furniture Bench Dataset [105]
家具长凳数据集 [105]
2.4%
UCSD Kitchen Dataset [106]
UCSD 厨房数据集 [106]
<0.1%
Austin Sailor Dataset [107]
Austin Sailor 数据集 [107]
2.2%
Austin Sirius Dataset [108]
Austin Sirius 数据集 [108]
1.7%
DLR EDAN Shared Control [109]
DLR EDAN 共享控制器 [109]
<0.1%
IAMLab CMU Pickup Insert [110]
IAMLab CMU 传感器插件 [110]
0.9%
UTAustin Mutex [111] 2.2%
Berkeley Fanuc Manipulation [112]
Berkeley Fanuc 操作 [112]
0.7%
CMU Stretch [113]
CMU 伸展运动 [113]
0.2%
BC-Z [55] 7.5%
FMB Dataset [114]
FMB 数据集 [114]
7.1%
DobbE [115] 1.4%
DROID [11]
机器人 [11]
10.0%666We remove DROID for the last third of training due to slow learning progress (see Section 3.3) and re-distribute its mixture weights across all other datasets.
10.0%66由于学习进度缓慢,我们在训练的最后三分之一删除了 DROID(参见3.3  ),并将其混合权重重新分配给所有其他数据集。
Table 3: OpenVLA training data mixture using datasets from the Open X-Embodiment dataset [1], following [5] with a few additions.
表 3: OpenVLA 训练数据混合使用来自 Open X-Embodiment 数据集 [1] 的数据集,随后 [5] 添加了一些内容。

Appendix B Evaluation Tasks and Detailed Results
附录 B 评估任务和详细结果

In this section, we provide more details on the BridgeData V2 WidowX and Google robot evaluations discussed in Section 5.1, as well as the Franka-Tabletop and Franka-DROID fine-tuning evaluations discussed in Section 5.2.
在本节中,我们提供了有关5.1  中讨论的 BridgeData V2 WidowX 和 Google 机器人评估以及5.2  中讨论的 Franka-Tabletop 和 Franka-DROID 微调评估的更多详细信息。

B.1 BridgeData V2 WidowX Evaluation Details
B.1 BridgeData V2 WidowX 评估详细信息

Here we focus specifically on BridgeData V2 evaluations discussed in Section 5.1.
在这里,我们特别关注5.1  中讨论的 BridgeData V2 评估。

B.1.1 BridgeData V2 Evaluation Tasks
B.1.1 BridgeData V2 评估任务

As described in Section 5.1, we evaluate each generalist robot manipulation policy on 17 tasks with 10 trials each. In this section, we provide details on the task categories and individual tasks.
5.1  所述,我们在 17 项任务上评估每个通才机器人操作策略,每项任务进行 10 次试验。在本节中,我们提供了有关任务类别和单个任务的详细信息。

Refer to caption
Figure 6: BridgeData V2 WidowX robot evaluation tasks. We evaluate every generalist robot policy on 4 types out-of-distribution (OOD) generalization tasks: visual, motion, physical, and semantic (as defined in Section 5.1). Every pair of images shows the start state and an example end state after the robot completes the task. We also rigorously assess language grounding in the 3 tasks shown in the bottom 3 rows, by changing the prompt while fixing the initial state and testing whether the policy can approach the correct target object.
图 6:BridgeData V2 WidowX 机器人评估任务。我们根据 4 种类型的分布外 (OOD) 泛化任务评估每个通才机器人策略:视觉运动物理语义(如5.1  所定义)。每对图像都显示了机器人完成任务后的开始状态和示例结束状态。我们还严格评估了底部 3 行所示 3 项任务中的语言基础,方法是在修复初始状态的同时更改提示,并测试策略是否可以接近正确的目标对象。

In total, we evaluate on 5 visual generalization tasks, 2 motion generalization tasks, 3 physical generalization tasks, 4 semantic generalization tasks, and 3 language grounding tasks. Note that all tasks we evaluate on introduce some form of distribution shift since we are unable to procure the exact objects used in the original dataset (other distribution shifts naturally arise as we reproduce a real-world test environment originally constructed at a different location; see Section B.1.2 for a detailed discussion on such distribution shifts). All 17 tasks are depicted in Fig. 6. Each rollout is marked as a failure (0) or success (1). In some more difficult tasks, we record partial successes (0.5); we describe the conditions for partial credit in the task descriptions below.
我们总共评估了 5 个视觉泛化任务、2 个动作泛化任务、3 个物理泛化任务、4 个语义泛化任务和 3 个语言接地任务。请注意,我们评估的所有任务都会引入某种形式的分布偏移,因为我们无法获得原始数据集中使用的确切对象(当我们重现最初在不同位置构建的真实世界测试环境时,自然会出现其他分布偏移;有关此类分布偏移的详细讨论,请参见 Section B.1.2)。所有 17 个任务都如图 1 所示。 6.每个转出都标记为失败 (0) 或成功 (1)。在一些更困难的任务中,我们记录部分成功 (0.5);我们在下面的任务描述中描述了部分积分的条件。

Below we describe each of the 17 tasks, in the order shown in Fig. 6:
下面我们按照图 17 中所示的顺序描述 17 个任务中的每一个。 6

  1. 1.

    Put Eggplant into Pot (Easy Version): The robot’s goal is to pick up the eggplant and drop it into the pot. This is a visual generalization task because we use a handcrafted paper pot that has a different appearance than the pot used in the original BridgeData V2 training dataset (since we are unable to procure the original pot). Unlike all 16 other tasks, for this particular task we initialize the robot’s end-effector directly above the eggplant before rolling out the policy; hence, we call this the “Easy Version” of the “Put Eggplant into Pot” task.


    1. 将茄子放入锅中(简易版):机器人的目标是捡起茄子并将其放入锅中。这是一个视觉泛化任务,因为我们使用了一个手工制作的纸罐,它的外观与原始 BridgeData V2 训练数据集中使用的罐子不同(因为我们无法获得原始罐子)。与所有其他 16 项任务不同,对于此特定任务,我们在推出策略之前初始化机器人的末端执行器,直接位于茄子上方;因此,我们称之为 “Put Eggplant into Pot” 任务的 “简单版本”。
  2. 2.

    Put Eggplant into Pot: This is the same task as described above, except that the robot’s end-effector is not initialized directly above the eggplant. Instead, we initialize it in a position that is fixed across all rollouts, which means that the robot must horizontally reach for the eggplant first before manipulating it. (Note: The same applies to all other tasks described below.) This is a visual generalization task for the same reason as above.


    2. 将茄子放入锅中:这与上述任务相同,只是机器人的末端执行器没有直接在茄子上方初始化。相反,我们在所有卷展栏中固定的位置对其进行初始化,这意味着机器人必须先水平伸手去拿茄子,然后才能操纵它。(注意:这同样适用于下面描述的所有其他任务。这是一个视觉泛化任务,原因与上述相同。
  3. 3.

    Put Cup from Counter into Sink: The robot’s goal is to pick up the pink cup from either the kitchen countertop or drying rack and place it into the sink on the right. This is a visual generalization task because we use a pink cup rather than a blue cup (a blue cup is used in the original BridgeData V2 dataset, but we find that none of the methods we evaluate is able to manipulate it reliably – most likely because the color of the cup blends in with the color of the sink).


    3. 将柜台上的杯子放入水槽中:机器人的目标是从厨房台面或晾衣架上捡起粉红色的杯子,并将其放入右侧的水槽中。这是一项可视化泛化任务,因为我们使用的是粉红色杯子而不是蓝色杯子(蓝色杯子在原始 BridgeData V2 数据集中使用,但我们发现我们评估的方法都无法可靠地操纵它——很可能是因为杯子的颜色与水槽的颜色融为一体)。
  4. 4.

    Put Eggplant into Pot (w/ Clutter): This is the same task as the “Put Eggplant into Pot” task, except that it is more difficult due to the presence of several distractor objects. It is a visual generalization task for the same reason discussed in the normal “Put Eggplant into Pot” task, and even more so given unseen distractors in the scene. Partial credit (0.5 out of 1) is rewarded when the robot moves towards the correct target object.


    4.

    将茄子放入花盆中(带杂物):这与“将茄子放入花盆”任务相同,只是由于存在多个干扰物体,因此难度更大。这是一个视觉泛化任务,原因与正常的 “Put Eggplant into Pot” 任务中讨论的原因相同,考虑到场景中看不见的干扰物,更是如此。当机器人向正确的目标对象移动时,会奖励部分积分(0.5 分,满分 1 分)。

  5. 5.

    Put Yellow Corn on Pink Plate: The robot’s goal is to pick up the yellow corn and place it on the pink plate. This is a visual generalization task due to the presence of unseen distractor objects in the scene, such as a green dinosaur on the countertop in the back section of the sink. Partial credit (0.5 out of 1) is rewarded when the robot moves towards the correct target object.


    5.

    将黄玉米放在粉红色的盘子上:机器人的目标是捡起黄色的玉米并将其放在粉红色的盘子上。这是一项视觉概化任务,因为场景中存在看不见的干扰对象,例如水槽后部台面上的绿色恐龙。当机器人向正确的目标对象移动时,会奖励部分积分(0.5 分,满分 1 分)。

  6. 6.

    Lift Eggplant: The robot’s goal is to grasp and lift the eggplant into the air. This is a motion generalization task because the eggplant is initialized in unseen positions and/or orientations, and the robot is forced to move beyond its training distribution of positions and/or orientations and often perform long-range reaching in order to complete the task. (Note: Long-range reaching is not demonstrated in this environment in the original BridgeData V2 demonstrations; see Section B.1.2 for details.) We find that this task, though seemingly simple, is deceptively challenging for many policies. Partial credit (0.5 out of 1) is rewarded when the robot makes contact with the eggplant.


    6.举起茄子:机器人的目标是抓住茄子并将其举到空中。这是一项运动泛化任务,因为茄子在看不见的位置和/或方向上初始化,机器人被迫超出其位置和/或方向的训练分布,并且经常执行远程伸展以完成任务。(注意:在最初的 BridgeData V2 演示中,此环境中未演示长距离传输;有关详细信息,请参见 Section B.1.2。我们发现,这项任务虽然看似简单,但对许多政策来说却具有看似挑战性的。当机器人与茄子接触时,会奖励部分积分(0.5 分,满分 1 分)。
  7. 7.

    Put Carrot on Plate (w/ Height Change): The robot’s goal is to pick up the carrot and place it on the yellow plate. This is a motion generalization task because the plate is elevated from its usual position at the bottom of the sink, and the robot must adjust its trajectory to correctly place the carrot on the elevated platform (without knocking down the plate in the process). Partial credit (0.5 out of 1) is rewarded when the robot grasps the carrot and touches the plate with it.


    7.

    将胡萝卜放在盘子上(高度变化):机器人的目标是捡起胡萝卜并将其放在黄色盘子上。这是一个运动泛化任务,因为盘子从水槽底部的通常位置升起,机器人必须调整其轨迹,才能将胡萝卜正确放置在高架平台上(在此过程中不会撞倒盘子)。当机器人抓住胡萝卜并用它触摸盘子时,会奖励部分积分(0.5 分,满分 1 分)。

  8. 8.

    Put Carrot on Plate: This is the same task as above, except that the plate is at its normal position (at the bottom of the sink or drying rack). We consider this a physical generalization task because the carrot has a different size and shape than the one used in the original BridgeData V2 dataset, which is shorter and narrower. (Note that the previous version of this task listed above would also technically be a physical generalization task since it involves the same carrot, but we list it under the “motion generalization” category since that is the focus there.)


    8. 将胡萝卜放在盘子上:这与上述任务相同,只是盘子处于正常位置(在水槽或晾衣架的底部)。我们认为这是一项物理泛化任务,因为胡萝卜的大小和形状与原始 BridgeData V2 数据集中使用的胡萝卜不同,后者更短、更窄。(请注意,上面列出的此任务的先前版本在技术上也是一个物理泛化任务,因为它涉及相同的胡萝卜,但我们将其列在“运动泛化”类别下,因为这是那里的重点。
  9. 9.

    Flip Pot Upright: The robot’s goal is to manipulate the pot such that it is oriented upright in the sink at the end of the episode. This is a physical generalization task because this pot has a different size and shape than the one used in the original BridgeData V2 training demonstrations (the pot we use is wider and shorter).


    9. Flip Pot Upright:机器人的目标是操纵花盆,使其在剧集结束时在水槽中直立。这是一项物理泛化任务,因为这个花盆的大小和形状与原始 BridgeData V2 训练演示中使用的花盆不同(我们使用的花盆更宽更短)。
  10. 10.

    Lift AAA Battery: The robot’s goal is simply to grasp the AAA battery and lift it up into the air. This is considered a physical generalization task because the battery is much smaller and thinner than target objects seen in the BridgeData V2 training demonstrations in this environment; see Section B.1.2 for details. (Note that this target object does not exist in the original BridgeData V2 demonstrations in this environment, so this is also an instance of “semantic generalization”, but we classify it solely as “physical generalization” since that is the main focus here).


    10. 举起 AAA 电池:机器人的目标很简单,就是抓住 AAA 电池并将其举到空中。这被视为物理泛化任务,因为电池比在此环境中 BridgeData V2 训练演示中看到的目标对象小得多、更薄;有关详细信息,请参见Section B.1.2。(请注意,在这个环境下,这个目标对象在原始的 BridgeData V2 演示中不存在,所以这也是 “语义泛化” 的一个实例,但我们仅将其归类为 “物理泛化”,因为这是这里的主要关注点)。
  11. 11.

    Move Skull into Drying Rack: The robot’s goal is to grasp the skull windup toy and drop it into the yellow drying rack in the left part of the sink. This is a semantic generalization task since the skull is an unseen target object (does not appear in the BridgeData V2 training demonstrations).


    11. 将骷髅头移入晾衣架:机器人的目标是抓住骷髅头收条玩具并将其放入水槽左侧的黄色晾衣架中。这是一个语义泛化任务,因为头骨是一个看不见的目标对象(未出现在 BridgeData V2 训练演示中)。
  12. 12.

    Lift White Tape: The robot’s goal is to grasp and lift the white roll of tape into the air. This is a semantic generalization task since the white tape roll is an unseen target object (does not appear in the BridgeData V2 training demonstrations). (Note that this task may also be considered as “physical generalization” because of its shape being different than the objects seen in the training demonstrations in this environment; most policies struggle to grasp objects with this ring structure, and they often move the robot’s end-effector directly into the center region.)


    12. 举起白色胶带:机器人的目标是抓住白色胶带卷并将其举到空中。这是一项语义泛化任务,因为白色磁带卷是一个看不见的目标对象(未出现在 BridgeData V2 训练演示中)。(请注意,此任务也可能被视为“物理泛化”,因为它的形状与在此环境中的训练演示中看到的物体不同;大多数策略都难以抓取具有这种环形结构的物体,并且它们经常将机器人的末端执行器直接移动到中心区域。
  13. 13.

    Take Purple Grapes out of Pot: The robot’s goal is to grasp the purple grapes lying inside the steel pot and remove it from the pot (by lifting it out and/or dropping it anywhere outside the pot). This is a semantic generalization task because it is an unseen language instruction; the robot has never seen this task in the original BridgeData V2 training dataset.


    13. 从罐中取出紫葡萄:机器人的目标是抓住钢罐内的紫葡萄并将其从罐中取出(将其取出和/或将其掉落在罐外的任何地方)。这是一个语义泛化任务,因为它是一种看不见的语言指令;机器人从未在原始 BridgeData V2 训练数据集中看到过此任务。
  14. 14.

    Stack Blue Cup on Pink Cup: The robot’s goal is to grasp the blue cup and place it securely on top of the pink cup. This is a semantic generalization task because it is an unseen language instruction; the robot has never seen this task in this environment in the original BridgeData V2 training dataset. Partial credit (0.5 out of 1) is rewarded when the robot grasps the blue cup and touches the pink cup with the blue cup.


    14.

    将蓝色杯子堆叠在粉红色杯子上:机器人的目标是抓住蓝色杯子并将其牢固地放在粉红色杯子的顶部。这是一个语义泛化任务,因为它是一种看不见的语言指令;机器人在原始 BridgeData V2 训练数据集中从未在此环境中看到过此任务。当机器人抓住蓝色杯子并用蓝色杯子触摸粉红色杯子时,将获得部分积分(0.5 分,满分 1 分)。

  15. 15.

    Put {Eggplant, Red Bottle} into Pot: This is a language grounding task. The robot’s goal is to put the specified target object into the pot. Both the eggplant and red bottle are present in the scene. We conduct paired evaluations: for the same initial state, we prompt the policy to target the eggplant in one episode, and then the red bottle in the next episode. We test each method 5 times with the eggplant and 5 times with the red bottle, using the same set of 5 initial states for both target objects. Partial credit (0.5 out of 1) is rewarded when the robot moves towards the correct target object.


    15.将 {茄子, 红瓶} 放入锅中: 这是一项语言基础任务。机器人的目标是将指定的目标物体放入罐中。eggplant 和 red bottle 都出现在场景中。我们进行配对评估:对于相同的初始状态,我们提示策略在一集中定位茄子,然后在下一集中定位红瓶。我们用茄子测试每种方法5次,用红色瓶子测试5次,对两个目标对象使用相同的5个初始状态。当机器人向正确的目标对象移动时,会奖励部分积分(0.5 分,满分 1 分)。
  16. 16.

    Lift {Cheese, Red Chili Pepper}: This is a language grounding task. The robot’s goal is to grasp and lift the specified target object. We conduct paired evaluations as described in the task above. Partial credit (0.5 out of 1) is rewarded when the robot moves towards the correct target object.


    16.

    Lift {Cheese, Red Chili Pepper}:这是一项语言基础任务。机器人的目标是抓取和抬起指定的目标物体。我们按照上述任务中的描述进行配对评估。当机器人向正确的目标对象移动时,会奖励部分积分(0.5 分,满分 1 分)。

  17. 17.

    Put {Blue Cup, Pink Cup} on Plate: This is a language grounding task. The robot’s goal is to grasp the specified target object and place it onto the plate. We conduct paired evaluations as described in other language grounding tasks. Partial credit (0.5 out of 1) is rewarded when the robot moves towards the correct target object.


    17.

    将 {Blue Cup, Pink Cup} 放在盘子上:这是一项语言基础任务。机器人的目标是抓住指定的目标物体并将其放在板上。我们按照其他语言基础任务中的描述进行结对评估。当机器人向正确的目标对象移动时,会奖励部分积分(0.5 分,满分 1 分)。

B.1.2 Comparing Evaluation Tasks to Original BridgeData V2 Training Data
B.1.2 将评估任务与原始 BridgeData V2 训练数据进行比较

We conduct our evaluations in a sink environment used in the original BridgeData V2 dataset [6]. We reproduce the environment to match the original environment in the BridgeData V2 dataset with rough approximations for the robot’s location relative to the sink, as well as the camera’s placement relative to the scene. Given the lack of precise measurements of these positions in the original dataset, we are unable to reproduce the exact environment setup, and natural distribution shifts arise due to slightly different robot, sink, and camera placements. In addition, since we evaluate robot policies in a different location than where the training demonstrations were collected from, other natural distribution shifts arise. For example, the lighting conditions and background (e.g., visible areas behind the sink) are inevitably different than what was seen in the training dataset. Furthermore, we are unable to procure the exact set of objects used in the original BridgeData V2 dataset, so there are distribution shifts between the objects used at train time and those used at test time.
我们在原始 BridgeData V2 数据集 [6] 中使用的 sink 环境中进行评估。我们重现环境以匹配 BridgeData V2 数据集中的原始环境,并粗略估计机器人相对于接收器的位置,以及相机相对于场景的位置。鉴于原始数据集中缺乏对这些位置的精确测量,我们无法重现精确的环境设置,并且由于机器人、接收器和摄像机的位置略有不同,会出现自然分布偏移。此外,由于我们在与收集培训演示的位置不同的位置评估机器人策略,因此会出现其他自然分布变化。例如,照明条件和背景(例如,水槽后面的可见区域)不可避免地与训练数据集中看到的不同。此外,我们无法获得原始 BridgeData V2 数据集中使用的确切对象集,因此训练时使用的对象和测试时使用的对象之间存在分布偏移。

Despite all these challenges, we find that certain generalist policies, such as OpenVLA and RT-2-X, can still generalize and perform various tasks fairly reliably “out-of-the-box”. Other generalist policies, such as RT-1-X and Octo, can also complete some tasks, though they struggle when tested with more difficult generalization tasks in our BridgeData V2 evaluation suite.
尽管存在所有这些挑战,我们发现某些通用策略,例如 OpenVLA 和 RT-2-X,仍然可以相当可靠地“开箱即用”地泛化和执行各种任务。其他通用策略(如 RT-1-X 和 Octo)也可以完成一些任务,尽管在我们的 BridgeData V2 评估套件中测试更困难的泛化任务时,它们会很困难。

The original BridgeData V2 dataset includes demonstrations of the following seven tasks in this specific sink environment: “Flip Pot Upright”, “Put Carrot on Plate”, “Put Cup from Counter (or Drying Rack) into Sink”, “Put Eggplant into Pot”, “Put Knife on Cutting Board”, “Put Spoon in Pot”, and “Turn Lever Vertical to Front”. See Fig. 7 for samples images of all these tasks from the original dataset. Note that all training demonstrations collected in this environment are initialized such that the robot’s end-effector is positioned directly above the target object in the beginning of the episode. (However, this is not the case across all environments in the BridgeData V2 dataset; in some other environments, the robot is initialized farther away from the target object, so it must horizontally reach for the object first before manipulating it.)
原始的 BridgeData V2 数据集包括在这个特定的水槽环境中以下七项任务的演示:“将锅直立翻转”、“将胡萝卜放在盘子上”、“将杯子从柜台(或晾衣架)放入水槽”、“将茄子放入锅中”、“将刀放在砧板上”、“将勺子放入锅中”和“将控制杆垂直向前转动”。见 7 获取原始数据集中所有这些任务的样本图像。请注意,在此环境中收集的所有训练演示都经过初始化,以便在剧集开始时,机器人的末端执行器位于目标对象的正上方。(但是,在 BridgeData V2 数据集中的所有环境中,并非都是如此;在其他一些环境中,机器人在离目标对象较远的地方初始化,因此它必须先水平伸手去拿对象,然后再操作它。

Refer to caption
Figure 7: Original BridgeData V2 sink environment tasks. Images from sample demonstrations in the sink environment from the original BridgeData V2 dataset reveal that all demonstrations in this environment were initialized such that the robot’s end-effector was positioned immediately above the target object. Note that these initial states are different from the initial states we use in our BridgeData V2 evaluation tasks shown in Fig. 6. In our evaluations, we always initialize the robot’s end-effector to a fixed location above the sink, rather than positioning it directly above the target object (except for one task: “Put Eggplant into Pot (Easy Version)”).
图 7:原始 BridgeData V2 接收器环境任务。来自原始 BridgeData V2 数据集的接收器环境中的示例演示图像显示,该环境中的所有演示都经过初始化,以便机器人的末端执行器位于目标对象的正上方。请注意,这些初始状态与我们在 BridgeData V2 评估任务中使用的初始状态不同,如图 2 所示。 6.在我们的评估中,我们总是将机器人的末端执行器初始化到水槽上方的固定位置,而不是将其直接放置在目标物体上方(除了一项任务:“将茄子放入锅中(简易版)”)。

In our BridgeData V2 evaluation suite, only one task – “Put Eggplant into Pot (Easy Version”) – is initialized with the robot’s end-effector hovering directly over the target object; in all 16 other tasks, the end-effector is initialized at a fixed location above the sink such that the robot must horizontally reach towards the object. This initial condition, in combination with the distribution shifts we introduce in the various types of OOD generalization in our evaluation suite, challenges the generalist policies and requires a high degree of robustness in order to complete the tasks successfully. Hence, the success rates for policies like RT-1-X and Octo are lower than what is reported in prior works. However, we find that other policies such as RT-2-X and OpenVLA still achieve relatively strong performance despite all these distribution shifts and challenges.
在我们的 BridgeData V2 评估套件中,只有一个任务——“将茄子放入锅中(简易版)”——初始化时机器人的末端执行器直接悬停在目标物体上;在所有其他 16 项任务中,末端执行器都在水槽上方的固定位置初始化,以便机器人必须水平伸向物体。这个初始条件,结合我们在评估套件中各种类型的 OOD 泛化中引入的分布变化,对通用策略提出了挑战,并且需要高度的稳健性才能成功完成任务。因此,RT-1-X 和 Octo 等策略的成功率低于之前工作中报道的成功率。然而,我们发现,尽管存在所有这些分发变化和挑战,但 RT-2-X 和 OpenVLA 等其他策略仍然实现了相对强劲的性能。

B.1.3 Detailed BridgeData V2 Evaluation Results
B.1.3 BridgeData V2 评估结果详解

See Table 4 for the full BridgeData V2 WidowX evaluation results. The number of successes for each method, out of 10 trials, is listed for each of 17 tasks. OpenVLA achieves strongest performance in the majority of the tasks and has the highest aggregate success rate among the generalist policies. RT-2-X also shows good performance, outperforming RT-1-X and Octo, though it does not perform as well as OpenVLA. RT-1-X and Octo generally experience difficulty in these generalization tasks.
有关 BridgeData V2 WidowX 的完整评估结果,请参见 4。在 10 次试验中,列出了 17 项任务中每一项方法的成功次数。OpenVLA 在大多数任务中取得了最强的性能,并且在通用策略中具有最高的总成功率。RT-2-X 也表现出良好的性能,优于 RT-1-X 和 Octo,尽管它的性能不如 OpenVLA。RT-1-X 和 Octo 通常在这些泛化任务中遇到困难。

Table 4: Detailed BridgeData V2 WidowX evaluation results. We report performance on the full evaluation suite of 17 tasks (discussed in Section 5.1), including visual/motion/physical/semantic generalization tasks and language grounding tasks. Note that partial success (score of 0.5) is possible for some tasks; see Section B.1.1 for details. We find that OpenVLA performs best in most tasks and achieves highest performance overall, followed by RT-2-X. On the other hand, RT-1-X and Octo struggle in the evaluations, only getting 0–2 successes in several tasks. See Fig. 6 for illustrations of all tasks.
表 4:BridgeData V2 WidowX 评估结果的详细信息。我们报告了 17 个任务的完整评估套件(在5.1  中讨论)的性能,包括视觉/运动/物理/语义泛化任务和语言基础任务。请注意,某些任务可能会部分成功(分数为 0.5);有关详细信息,请参见 Section B.1.1 。我们发现 OpenVLA 在大多数任务中表现最好,总体性能最高,其次是 RT-2-X。另一方面,RT-1-X 和 Octo 在评估中苦苦挣扎,在几项任务中只获得了 0-2 次成功。见 6 表示所有任务的插图。
Category  类别 Task  任务 # Trials  # 试用 RT-1-X  RT-1-X 系列 # Successes  # 成功 Octo  八度 # Successes  # 成功 RT-2-X  RT-2-X 系列 # Successes  # 成功 OpenVLA (ours)  OpenVLA(我们的) # Successes  # 成功
Visual gen  视觉生成 Put Eggplant into Pot (Easy Version)
将茄子放入锅中(简易版)
10 1 5 7 10
Visual gen  视觉生成 Put Eggplant into Pot  将茄子放入锅中 10 0 1 5 10
Visual gen  视觉生成 Put Cup from Counter into Sink
将 Counter 中的杯子放入 Sink 中
10 1 1 0 7
Visual gen  视觉生成 Put Eggplant into Pot (w/ Clutter)
将茄子放入花盆中(带杂物)
10 1 3.5 6 7.5
Visual gen  视觉生成 Put Yellow Corn on Pink Plate
将黄玉米放在粉红色的盘子上
10 1 4 8 9
Motion gen  运动生成 Lift Eggplant  提升茄子 10 3 0.5 6.5 7.5
Motion gen  运动生成 Put Carrot on Plate (w/ Height Change)
将胡萝卜放在盘子上(高度变化)
10 2 1 4.5 4.5
Physical gen  物理生成 Put Carrot on Plate  将胡萝卜放在盘子里 10 1 0 1 8
Physical gen  物理生成 Flip Pot Upright  翻转罐直立 10 2 6 5 8
Physical gen  物理生成 Lift AAA Battery  Lift AAA 电池 10 0 0 2 7
Semantic gen  语义生成 Move Skull into Drying Rack
将 Skull 移入晾衣架
10 1 0 5 5
Semantic gen  语义生成 Lift White Tape  提升白色胶带 10 3 0 0 1
Semantic gen  语义生成 Take Purple Grapes out of Pot
将紫葡萄从罐中取出
10 6 0 5 4
Semantic gen  语义生成 Stack Blue Cup on Pink Cup
在粉色杯上堆叠蓝色杯
10 0.5 0 5.5 4.5
Language grounding  语言接地 Put {Eggplant, Red Bottle} into Pot
将 {茄子, 红瓶} 放入锅中
10 2.5 4 8.5 7.5
Language grounding  语言接地 Lift {Cheese, Red Chili Pepper}
提升 {奶酪、红辣椒}
10 1.5 2.5 8.5 10
Language grounding  语言接地 Put {Blue Cup, Pink Cup} on Plate
将 {蓝色杯, 粉色杯} 放在盘子上
10 5 5.5 8.5 9.5
Mean Success Rate  平均成功率 18.5±\pm±2.7% 20.0±\pm±2.6% 50.6±\pm±3.5% 70.6±\pm±3.2%

Additionally, in Table 5, we provide the full evaluation results for the quantized inference experiments that were summarized in Table 2. For these evaluations, we test policies on 8 representative BridgeData V2 tasks spanning all task categories in the full evaluation suite.
此外,在 5 中,我们提供了 2 中总结的量化推理实验的完整评估结果。对于这些评估,我们在 8 个具有代表性的 BridgeData V2 任务上测试策略,这些任务涵盖完整评估套件中的所有任务类别。

Table 5: Full quantized inference results. Here we present the detailed version of the results shown in Table 2.
表 5:全量化推理结果。在这里,我们展示了 2 中所示结果的详细版本。
Category  类别 Task  任务 # Trials  # 试用 bfloat16 # Successes  # 成功 int8 # Successes  # 成功 int4 # Successes  # 成功
Visual gen  视觉生成 Put Eggplant into Pot (Easy Version)
将茄子放入锅中(简易版)
10 9 7 9
Visual gen  视觉生成 Put Eggplant into Pot  将茄子放入锅中 10 7 7 7
Visual gen  视觉生成 Put Cup from Counter into Sink
将 Counter 中的杯子放入 Sink 中
10 5 3 7
Motion gen  运动生成 Lift Eggplant  提升茄子 10 6 4 7.5
Physical gen  物理生成 Put Carrot on Plate  将胡萝卜放在盘子里 10 6 5 7
Physical gen  物理生成 Lift AAA Battery  Lift AAA 电池 10 7 5 3
Semantic gen  语义生成 Take Purple Grapes out of Pot
将紫葡萄从罐中取出
10 8 8 9
Language grounding  语言接地 Put {Eggplant, Red Bottle} into Pot
将 {茄子, 红瓶} 放入锅中
10 9 7.5 8
Mean Success Rate  平均成功率 71.3 ±\pm± 4.8% 58.1 ±\pm± 5.1% 71.9 ±\pm± 4.7%

B.2 Google Robot Evaluation Details
B.2 Google Robot 评估详情

In this section, we provide more details on the Google robot evaluations introduced in Section 5.1.
在本节中,我们将提供有关5.1  中介绍的 Google 机器人评估的更多详细信息。

B.2.1 Google Robot Evaluation Tasks
B.2.1 Google 机器人评估任务

Refer to caption
Figure 8: Google robot evaluation tasks. We evaluate every generalist robot policy on in-distribution tasks and out-of-distribution (OOD) generalization tasks. OOD tasks involve unseen backgrounds, target objects, instructions/object relations, and semantic concepts (e.g., photos from the Internet that do not appear in robot action data).
图 8:Google 机器人评估任务。我们评估每个通才机器人关于分销内任务和分销外 (OOD) 泛化任务的策略。OOD 任务涉及看不见的背景、目标对象、指令/对象关系和语义概念(例如,来自 Internet 的照片未出现在机器人动作数据中)。

On the Google robot, we evaluate each generalist robot policy on 12 tasks with 5 rollouts each, for a total of 60 rollouts. The first five tasks test on in-distribution conditions, and the last seven tasks test on more difficult out-of-distribution (OOD) conditions. All tasks are depicted in Fig. 8. Each rollout is marked as a failure (0) or success (1).
在 Google 机器人上,我们评估每个通才机器人策略的 12 个任务,每个任务部署 5 次,总共推出 60 次。前 5 个任务测试分布内条件,后 7 个任务测试更困难的分布外 (OOD) 条件。所有任务都如图所示 8.每个转出都标记为失败 (0) 或成功 (1)。

We describe the 12 tasks below:
我们在下面描述 12 项任务:

  1. 1.

    Pick Coke Can (in-distribution): The robot is positioned in front of a platform with a can of Coke on top of it. The robot’s goal is to grasp and lift the Coke can.


    1. Pick Coke Can(配送):机器人位于平台前面,上面有一罐可乐。机器人的目标是抓住并举起可乐罐。
  2. 2.

    Move Apple near Green Can (in-distribution): The robot is positioned in front of a platform with an apple and a green soda can on top of it. The robot’s goal is to grasp the apple and move it next to the green can.


    2. 将苹果移动到绿色罐附近(内部分配):机器人被放置在一个平台前面,上面有一个苹果和一个绿色的汽水罐。机器人的目标是抓住苹果并将其移动到绿色罐头旁边。
  3. 3.

    Move Blue Chip Bag near Apple (in-distribution): The robot is positioned in front of a platform with a blue bag of chips and an apple on top of it. The robot’s goal is to grasp the blue bag of chips and move it close to the apple.


    3. 将蓝色筹码袋移动到苹果附近(内部分配):机器人被放置在一个平台前面,平台上有一个蓝色的筹码袋和一个苹果。机器人的目标是抓住装有薯片的蓝色袋子并将其移动到靠近苹果的位置。
  4. 4.

    Place Coke Can Upright (in-distribution): The robot is positioned in front of a platform with a can of Coke on top of it, and the can is oriented horizontally on its side. The robot’s goal is to grasp the Coke can and orient it to be in a vertical position.


    4. 将可乐罐直立放置(内部分配):机器人放置在平台前面,上面有一罐可乐,罐子在其侧面水平定向。机器人的目标是抓住可乐罐并将其定位为垂直位置。
  5. 5.

    Open Middle Drawer (in-distribution): The robot is positioned in front of a set of three drawers. The robot’s goal is to grasp the middle drawer handle and pull the drawer open.


    5. 打开中间抽屉(内部分配):机器人位于一组三个抽屉的前面。机器人的目标是抓住中间的抽屉把手并拉开抽屉。
  6. 6.

    Move Orange near Brown Chip Bag (OOD): The robot is positioned in front of a platform with a brown bag of chips and an orange on top of it. A tablecloth with blue sky and white cloud patterns covers the platform underneath the objects. The robot’s goal is to grasp the orange and bring it next to the bag of chips. This task is OOD because the orange is an unseen object relative to the training dataset, and the tablecloth is an unseen background.777See Appendix of Brohan et al. [7] for a detailed list of OOD conditions in Google robot evaluations.
    参见 Brohan 等 人 [7] 的附录,了解谷歌机器人评估中 OOD 条件的详细列表。


    6.将橙色移动到棕色薯片袋 (OOD) 附近:机器人位于平台前面,平台上有一个棕色的薯片袋和一个橙色。带有蓝天和白云图案的桌布覆盖了物体下方的平台。机器人的目标是抓住橙子并将其带到薯片袋旁边。此任务是 OOD,因为橙色是相对于训练数据集的不可见对象,而桌布是看不见的背景7。
  7. 7.

    Pick Pepsi Can (OOD): The robot is positioned in front of a platform with a can of Pepsi on top of it. A tablecloth with bright yellow/brown patterns covers the platform underneath the can. The robot’s goal is to grasp and lift the can. This task is OOD because the Pepsi can is an unseen object, and the tablecloth is an unseen background.


    7. 选择百事可乐罐 (OOD):机器人位于平台前,上面有一罐百事可乐。带有亮黄色/棕色图案的桌布覆盖了罐子下方的平台。机器人的目标是抓住并抬起罐子。这个任务是 OOD,因为 Pepsi 罐是一个看不见的物体,而桌布是一个看不见的背景。
  8. 8.

    Pick Banana (OOD): The robot is positioned in front of a platform with an apple, a can of Coke, and a banana. The robot’s goal is to grasp and lift the banana. This task is OOD because the banana is an unseen target object.


    8. Pick Banana (OOD):机器人被放置在一个平台前面,上面有一个苹果、一罐可乐和一根香蕉。机器人的目标是抓住并举起香蕉。此任务是 OOD,因为 banana 是看不见的目标对象。
  9. 9.

    Pick Green Cup (OOD): The robot is positioned in front of a platform with a banana, a can of Pepsi, and a green cup. The robot’s goal is to grasp and lift the green cup. This task is OOD because all objects in the scene are unseen in the training data.


    9. Pick Green Cup (OOD):机器人被放置在一个平台前,平台上有一根香蕉、一罐百事可乐和一个绿色杯子。机器人的目标是抓住并举起绿色的杯子。此任务是 OOD,因为场景中的所有对象在训练数据中都不可见。
  10. 10.

    Place Apple on Plate (OOD): The robot is positioned in front of a platform with a plate and an apple. The robot’s goal is to grasp the apple and move it onto the plate. This task is OOD because it is a novel instruction describing an unseen object relation: training demonstrations only cover moving the apple near the plate, rather than placing it on top of the plate.


    10. 将苹果放在盘子上 (OOD):机器人位于带有盘子和苹果的平台前面。机器人的目标是抓住苹果并将其移动到盘子上。这个任务是 OOD,因为它是一个描述看不见的物体关系的新颖指令:训练演示只包括将苹果移动到板附近,而不是将其放在板的顶部。
  11. 11.

    Place Banana in Pan (OOD): The robot is positioned in front of a platform with a pan and a banana. The robot’s goal is to grasp the banana and move it into the pan. This task is OOD because the banana is an unseen target object, and it is a novel instruction describing an unseen object relation, as explained in the previous task.


    11. 将香蕉放入锅中 (OOD):机器人位于带有平底锅和香蕉的平台前。机器人的目标是抓住香蕉并将其移动到锅中。这个任务是 OOD,因为香蕉是一个看不见的目标对象,它是一个描述看不见的对象关系的新指令,如上一个任务中所述。
  12. 12.

    Move Coke Can to Taylor Swift (OOD): The robot is positioned in front of a platform with a can of Coke and photos of three different celebrities, including Taylor Swift. The robot’s goal is to grasp the can and move it to the photo of Taylor Swift. This task is OOD because the photos of the celebrities are unseen in the robot interaction data.


    12. 将可乐罐移动到泰勒·斯威夫特 (OOD):机器人被放置在一个平台前,平台上有一罐可乐和包括泰勒·斯威夫特在内的三位不同名人的照片。机器人的目标是抓住罐子并将其移动到泰勒·斯威夫特的照片上。此任务是 OOD,因为名人的照片在机器人交互数据中不可见。

B.2.2 Detailed Google Robot Evaluation Results
B.2.2 详细的 Google Robot 评估结果

Table 6: Detailed Google robot evaluation results. We report full evaluation results for Google robot evaluations discussed in Section 5.1. Each generalist policy is evaluated with 60 rollouts across 12 tasks, covering both in-distribution and out-of-distribution (OOD) testing conditions. In the bottom row, we report mean success rate ±\pm± StdErr for each policy. OpenVLA and RT-2-X both significantly outperform RT-1-X and Octo overall (we bold the mean success rate for both due to overlapping error bars). See Fig. 8 for illustrations of all tasks.
表 6:详细的 Google 机器人评估结果。我们报告了5.1  中讨论的 Google 机器人评估的完整评估结果。每个通用策略都通过跨 12 个任务的 60 次推出进行评估,涵盖分发内和分发外 (OOD) 测试条件。在底行中,我们报告每个策略的平均成功率 ±plus-or-minus\pm± StdErr。OpenVLA 和 RT-2-X 的整体性能都明显优于 RT-1-X 和 Octo(由于误差线重叠,我们将两者的平均成功率加粗)。见 8 用于所有任务的插图。
Category  类别 Task  任务 # Trials  # 试用 RT-1-X  RT-1-X 系列 # Successes  # 成功 Octo  八度 # Successes  # 成功 RT-2-X  RT-2-X 系列 # Successes  # 成功 OpenVLA (ours)  OpenVLA(我们的) # Successes  # 成功
In-distribution  分销内 Pick Coke Can  Pick Coke 罐 5 5 1 5 5
In-distribution  分销内 Move Apple near Green Can
将 Apple 移到 Green Can 附近
5 3 3 3 5
In-distribution  分销内 Move Blue Chip Bag near Apple
Move Blue Chip Bag 靠近 Apple
5 0 3 4 5
In-distribution  分销内 Place Coke Can Upright  将可乐罐直立放置 5 0 0 4 4
In-distribution  分销内 Open Middle Drawer  打开中间抽屉 5 0 4 2 3
OOD Move Orange near Brown Chip Bag
将 Orange 移动到 Brown Chip Bag 附近
5 1 2 5 5
OOD Pick Pepsi Can  选择百事可乐罐 5 3 0 5 4
OOD Pick Banana 5 5 3 5 5
OOD Pick Green Cup  选择绿色杯 5 1 0 5 5
OOD Place Apple on Plate  将 Apple 放在盘子上 5 0 0 4 4
OOD Place Banana in Pan  将香蕉放入锅中 5 0 0 2 4
OOD Move Coke Can near Taylor Swift
Move Coke Can 靠近 Taylor Swift
5 2 0 3 2
Mean Success Rate  平均成功率 33.3±\pm±6.1% 26.7±\pm±5.8% 78.3±\pm±5.4% 85.0±\pm±4.6%

Full results for the Google robot evaluations are shown in Table 6. Overall, we find that RT-1-X and Octo experience difficulty on the evaluation tasks; they are often unable to achieve a single success out of five trials in several tasks. On the other hand, RT-2-X and OpenVLA demonstrate strong performance, completing every task at least two times out of five trials; these two VLA policies perform comparably with each other on this particular evaluation suite.
Google 机器人评估的完整结果如 6 所示。总体而言,我们发现 RT-1-X 和 Octo 在评估任务上遇到困难;他们通常无法在多项任务的五次试验中取得一次成功。另一方面,RT-2-X 和 OpenVLA 表现出强大的性能,在五次试验中至少完成每项任务两次;这两个 VLA 策略在此特定评估套件上的性能相当。

B.3 Data-Efficient Adaptation Experiment Details
B.3 数据高效的适应实验详细信息

In this section, we provide more details on the data-efficient adaptation experiments discussed in Section 5.2, where we investigate the effectiveness of fine-tuned OpenVLA policies on new robot setups such as Franka-Tabletop and Franka-DROID.
在本节中,我们提供了有关5.2  中讨论的数据高效适应实验的更多详细信息,其中我们研究了微调的 OpenVLA 策略对新机器人设置(如 Franka-Tabletop 和 Franka-DROID)的有效性。

B.3.1 Franka-Tabletop and Franka-DROID Tasks
B.3.1 Franka-Tabletop 和 Franka-DROID 任务

We collect 10–150 demonstrations of each of seven tasks. The first six tasks correspond to a robot setup which we denote as “Franka-Tabletop” (Franka Emika Panda robot mounted on top of a table), and the final task corresponds to a robot setup which we call “Franka-DROID”.
我们收集了 7 项任务中每一项的 10-150 个演示。前六个任务对应于一个机器人设置,我们将其表示为“Franka-Tabletop”(安装在桌子顶部的 Franka Emika Panda 机器人),最后一个任务对应于我们称为“Franka-DROID”的机器人设置。

In the Franka-Tabletop setup, the first three of six tasks correspond to single-instruction tasks and are narrow, while the last three tasks correspond to multi-instruction tasks in which multiple objects are present in the scene and the robot must manipulate the correct one depending on the language instruction.
在 Franka-Tabletop 设置中,六个任务中的前三个对应于单指令任务并且很窄,而后三个任务对应于多指令任务,其中场景中存在多个对象,机器人必须根据语言指令操纵正确的对象。

Refer to caption
Figure 9: Franka-Tabletop fine-tuning tasks. Franka-Tabletop tasks used in the data-efficient adaptation experiments in Section 5.2 and described in detail in Fig. 9 are depicted above. The first three of six tasks, shown in the top three rows, only involve a single instruction, while the last three tasks in the bottom three rows involve multiple objects and instructions (the instructions specify the target object or target location). The first column shows sample initial states matching the training data distribution, while the second column shows out-of-distribution (OOD) initial states (e.g., unseen backgrounds, target objects, distractors, and object positions/orientations). Every policy in Section 5.2 is evaluated with 10–12 rollouts on in-distribution tasks and 5–6 rollouts on OOD tasks.
图 9:Franka-Tabletop 微调任务。Franka-Tabletop 任务用于第 5.2  中的数据高效适应实验,并在图 1 中详细描述 9 如上所示。前三行中显示的六个任务中的前三个只涉及一条指令,而底部三行中的最后三个任务涉及多个对象和指令(指令指定目标对象或目标位置)。第一列显示与训练数据分布匹配的样本初始状态,而第二列显示分布外 (OOD) 初始状态(例如,看不见的背景、目标对象、干扰项和对象位置/方向)。5.2  中的每个策略都通过分发中任务的 10-12 次推出和 OOD 任务的 5-6 次推出来进行评估。

Below we describe each of the six Franka-Tabletop tasks shown in Fig. 9:
下面我们描述了图 1 所示的六个 Franka-Tabletop 任务中的每一个。 9

  1. 1.

    Put Carrot in Bowl (single-instruction): The robot’s goal is to grasp the carrot and place it into the bowl. We collect 50 demonstrations of this task for the training dataset, randomly placing the carrot and the bowl at different locations on the table in every episode. The carrot is always initialized on the left side of the bowl. During evaluation, each trial is recorded as a success (1) or failure (0); there is no partial credit.


    1. 将胡萝卜放入碗中(单指令):机器人的目标是抓住胡萝卜并将其放入碗中。我们为训练数据集收集了 50 个关于此任务的演示,在每一集中将胡萝卜和碗随机放置在桌子上的不同位置。胡萝卜总是在碗的左侧初始化。在评估过程中,每次试验都记录为成功 (1) 或失败 (0);没有部分学分。
  2. 2.

    Pour Corn into Pot (single-instruction): The robot’s goal is to grasp the red bowl, move towards the steel pot, and pour the contents (a yellow corn) into the pot. We collect 50 demonstrations of this task for the training dataset, randomly placing the bowl and the pot at different locations on the table in every episode. The bowl is always initialized on the right side of the pot. During evaluation, each trial is recorded as a success (1) or failure (0); there is no partial credit.


    2. 将玉米倒入锅中(单指令):机器人的目标是抓住红色碗,向钢锅移动,然后将内容物(黄色玉米)倒入锅中。我们为训练数据集收集了 50 个任务演示,在每一集中将碗和罐子随机放置在桌子上的不同位置。碗总是在底池的右侧初始化。在评估过程中,每次试验都记录为成功 (1) 或失败 (0);没有部分学分。
  3. 3.

    Flip Pot Upright (single-instruction): The robot’s goal is to grasp the steel pot (which is initially oriented vertically), rotate it to be in the upright position, and place it back onto the table. We collect only 10 demonstrations of this task for the training dataset, randomly placing the steel pot at various locations within a small section of the table. During evaluation, each trial is recorded as a success (1), failure (0), or partial success (0.5). Partial successes include grasping the pot but not orienting it upright, or knocking it over to the upright position but not carefully guiding it. The robot must release the pot at the end of the episode for full credit.


    3. Flip Pot Upright(单指令):机器人的目标是抓住钢罐(最初是垂直的),将其旋转到直立位置,然后将其放回桌子上。我们只为训练数据集收集了此任务的 10 个演示,将钢罐随机放置在表格的一小部分内的不同位置。在评估期间,每个试验都记录为成功 (1)、失败 (0) 或部分成功 (0.5)。部分成功包括抓住花瓶但没有将其竖直,或将其撞到直立位置但没有仔细引导它。机器人必须在剧集结束时释放罐子以获得全额积分。
  4. 4.

    Move <object> onto Plate (multi-instruction): The robot’s goal is to grasp one out of three objects (depending on the target specified in the language instruction) and place it on the plate on the right side of the table. We collect 150 demonstrations of this task for the training dataset, randomly placing different combinations of three objects on the table and selecting one as the target. The plate is always initialized on the right side of the table. During evaluation, each trial is recorded as a success (1), failure (0), or partial success (0.5). Partial success is recorded when the first object that the robot makes contact with is the correct target object (i.e., the object specified in the language instruction), but the robot does not complete the task.


    4. 移动到盘子上(多指令):机器人的目标是抓住三个物体中的一个(取决于语言指令中指定的目标)并将其放在桌子右侧的板上。我们为训练数据集收集了此任务的 150 个演示,将三个对象的不同组合随机放在桌子上,并选择一个作为目标。板始终在表格的右侧初始化。在评估期间,每个试验都记录为成功 (1)、失败 (0) 或部分成功 (0.5)。当机器人接触的第一个对象是正确的目标对象(即语言指令中指定的对象),但机器人没有完成任务时,将记录部分成功。
  5. 5.

    Knock <object> Over (multi-instruction): The robot’s goal is to approach one out of three objects (depending on the target specified in the language instruction) and push it until it falls over. We collect 70 demonstrations of this task for the training dataset, randomly placing different combinations of three objects on the table and selecting one as the target. During evaluation, each trial is recorded as a success (1), failure (0), or partial success (0.5). Partial success is recorded when the first object that the robot makes contact with is the correct target object (i.e., the object specified in the language instruction), but the robot does not complete the task.


    5. 撞倒(多指令):机器人的目标是接近三个物体中的一个(取决于语言指令中指定的目标)并推动它直到它翻倒。我们为训练数据集收集了此任务的 70 个演示,将三个对象的不同组合随机放在桌子上,并选择一个作为目标。在评估期间,每个试验都记录为成功 (1)、失败 (0) 或部分成功 (0.5)。当机器人接触的第一个对象是正确的目标对象(即语言指令中指定的对象),但机器人没有完成任务时,将记录部分成功。
  6. 6.

    Cover <object> with Towel (multi-instruction): The robot’s goal is to grasp the blue towel and place it on one out of three objects (depending on the target specified in the language instruction). We collect 45 demonstrations of this task for the training dataset, randomly placing different combinations of three objects on the table. During evaluation, each trial is recorded as a success (1), failure (0), or partial success (0.5). Partial success is recorded when the first object that the robot touches with the towel is the correct target object (i.e., the object specified in the language instruction), but the robot does not complete the task (e.g., it drops the towel onto the table instead of on top of the target object). Full credit is given when any part of the towel is resting over the top surface of the target object, i.e., the object does not need to be fully covered.


    6. 用毛巾盖住(多指令):机器人的目标是抓住蓝色毛巾并将其放在三个物体中的一个上(取决于语言指令中指定的目标)。我们为训练数据集收集了此任务的 45 个演示,将三个对象的不同组合随机放在桌子上。在评估期间,每个试验都记录为成功 (1)、失败 (0) 或部分成功 (0.5)。当机器人用毛巾接触的第一个对象是正确的目标对象(即语言指令中指定的对象),但机器人没有完成任务(例如,它将毛巾掉在桌子上而不是目标对象上)时,将记录部分成功。当毛巾的任何部分放在目标物体的顶面上时,即不需要完全覆盖该物体时,即给予完全信任。

For every Franka-Tabletop task, we evaluate each method with 10–12 in-distribution trials and 5–6 OOD generalization trials. The in-distribution and OOD test conditions are depicted in Fig. 9 (second column).
对于每个 Franka-Tabletop 任务,我们通过 10-12 次分布内试验和 5-6 次 OOD 泛化试验来评估每种方法。分布内和 OOD 测试条件如图 2 所示。 9(第二列)。

We describe the OOD test conditions for each of the six tasks below:
我们在下面描述了 6 项任务中每一项的 OOD 测试条件:

  1. 1.

    Put Carrot in Bowl (OOD): An eggplant (unseen object) replaces the carrot.


    1. 将胡萝卜放入碗中 (OOD):茄子(看不见的物体)代替胡萝卜。
  2. 2.

    Pour Corn into Pot (OOD): An unseen brown tablecloth covers the tabletop.


    2. 将玉米倒入锅中 (OOD):一块看不见的棕色桌布覆盖了桌面。
  3. 3.

    Flip Pot Upright (OOD): An unseen white tablecloth covers the tabletop


    3. Flip Pot Upright (OOD):一块看不见的白色桌布覆盖了桌面
  4. 4.

    Move <object> onto Plate (OOD): A set of three unseen objects are placed on the table.


    4. 移动到盘子上 (OOD):一组三个看不见的物体放在桌子上。
  5. 5.

    Knock <object> Over (OOD): Two unseen distractor objects (red plastic cup and brown box) are positioned behind the set of three seen objects.


    5. 撞倒 (OOD):两个看不见的干扰物体(红色塑料杯和棕色盒子)位于三个看得见的物体的后面。
  6. 6.

    Cover <object> with Towel (OOD): The three objects on the table are placed upside-down and at unseen positions.


    6. 用毛巾盖住 (OOD):桌子上的三个物体倒置放置在看不见的位置。

Finally, in the Franka-DROID environment, we experiment with one task and variants of it: Wipe Table (see Fig. 10). In this task, the robot’s goal is to grab the brush and sweep all three small brown objects into the dustpan. We collect 70 demonstrations for this task for the training dataset, varying the positions of all the objects.
最后,在 Franka-DROID 环境中,我们试验了一个任务及其变体:擦除表(见图 D)。 10).在这个任务中,机器人的目标是抓住刷子,把所有三个棕色的小物体扫进簸箕里。我们为训练数据集收集了 70 个任务的演示,改变了所有对象的位置。

Refer to caption
Figure 10: Franka-DROID fine-tuning task. The “Wipe Table” task shown here is the final task used in the data-efficient adaptation experiments in Section 5.2. The left image shows the initial conditions for an in-distribution trial. The right image shows an out-of-distribution trial in which unseen distractor objects are present on the table. To fully complete the task, the robot must grab the brush and sweep all three objects into the dustpan.
图 10:Franka-DROID 微调任务。此处显示的 “擦除表” 任务是5.2  中数据高效适应实验中使用的最后一个任务。左图显示了分发中试验的初始条件。右图显示了一个分布外试验,其中桌子上存在看不见的干扰物。要完全完成任务,机器人必须抓住刷子并将所有三个物体扫入簸箕中。

At test time, we evaluate on in-distribution conditions matching the training data (Fig. 10, left), as well as out-of-distribution (OOD) conditions in which distractor objects are also present in the scene on the table (Fig. 10, right). Since there are various possible outcomes for each trial, we define a scoring rubric as follows: The maximum score for each trial is 2 points. The policy receives the full 2 points if the robot sweeps all three objects into the dustpan. It receives 1 point for successfully sweeping one or two objects into the dustpan. Otherwise, it receives 0 points. We evaluate each policy with 18 in-distribution trials and 12 OOD trials, so each policy receives an aggregate score out of 60 points.
在测试时,我们评估与训练数据匹配的分布条件(图 D)。 10,左)以及分布外 (OOD) 情况,其中干扰对象也存在于桌子上的场景中(图 D)。 10,右)。由于每次试验都有各种可能的结果,因此我们定义评分标准如下:每次试验的最高分数为 2 分。如果机器人将所有三个对象扫入簸箕,则策略将获得满分 2 分。成功将 1 个或 2 个物体扫入簸箕后,得 1 分。否则,它将获得 0 分。我们通过 18 次分发中试验和 12 次 OOD 试验来评估每个策略,因此每个策略的总分是 60 分。

B.3.2 Detailed Franka-Tabletop and Franka-DROID Evaluation Results
B.3.2 详细的 Franka-Tabletop 和 Franka-DROID 评估结果

Full evaluation results for both Franka-Tabletop and Franka-DROID evaluations are shown in Table 7. We evaluate the methods discussed in Section 5.2. We find that Diffusion Policy demonstrates strong performance on the single-instruction Franka-Tabletop tasks (e.g., “Put Carrot in Bowl” and “Pour Corn in Pot”), outperforming other methods. However, OpenVLA and Octo achieve higher performance in the more diverse multi-instruction tasks (“Move <object> onto Plate”, “Knock <object> Over”, and “Cover <object> with Towel”). In the Franka-DROID environment, OpenVLA obtains best results. Overall, we find that OpenVLA achieves the highest average performance across both tasks.
 7 显示了 Franka-Tabletop 和 Franka-DROID 评估的完整评估结果。我们评估5.2  中讨论的方法。我们发现,Diffusion Policy 在单指令 Franka-Tabletop 任务(例如,“Put Carrot in Bowl”和 “Pour Corn in Pot”)上表现出强大的性能,优于其他方法。然而,OpenVLA 和 Octo 在更多样化的多指令任务(“Move onto Plate”、“Knock Over”和“Cover with Towel”)中实现了更高的性能。在 Franka-DROID 环境中,OpenVLA 获得了最好的结果。总体而言,我们发现 OpenVLA 在两项任务中都实现了最高的平均性能。

Table 7: Detailed data-efficient adaptation experiment results. Here we present the full breakdown of results summarized in Fig. 4. We report the performance of Diffusion Policy trained from scratch on new robot tasks, as well as generalist policies fine-tuned on the same data. Each policy is tested against both in-distribution and out-of-distribution (OOD) generalization conditions (see Fig. 9 for Franka-Tabletop tasks and Fig. 10 for Franka-DROID tasks). We find that no single policy performs best on all tasks: Diffusion Policy achieves high success rates on single-instruction tasks, while OpenVLA and Octo performs well on diverse multi-instruction tasks. In terms of aggregate performance, however, OpenVLA obtains the highest average success rate across both environments.
表 7:详细的数据高效适应实验结果。在这里,我们展示了图 1 中总结的结果的完整细分 4.我们报告了在新的机器人任务上从头开始训练的 Diffusion Policy 的性能,以及在相同数据上微调的通用策略。每个策略都针对分布内和分布外 (OOD) 泛化条件进行了测试(参见图 1)。 9 用于 Franka-Tabletop 任务和图 . 10 表示 Franka-DROID 任务)。我们发现,没有一个策略在所有任务上都表现最好: Diffusion Policy 在单指令任务上取得了很高的成功率,而 OpenVLA 和 Octo 在各种多指令任务上表现良好。然而,就总体性能而言,OpenVLA 在两种环境中的平均成功率最高。
# trials  # 试用 Diffusion Policy  扩散策略 Diffusion Policy  扩散策略 (matched)  (匹配) Octo  八度 OpenVLA (scratch)  (刮擦) OpenVLA (ours)  (我们的)
Franka-Tabletop (5Hz)  Franka-桌面 (5Hz) “Put Carrot in Bowl” (in-distribution)
“Put Carrot in Bowl” (发行中)
10 90.0% 80.0% 40.0% 70.0% 70.0%
“Put Carrot in Bowl” (OOD)
“Put Carrot in Bowl” (OOD)
5 20.0% 0.0% 20.0% 0.0% 40.0%
“Pour Corn into Pot” (in-distribution)
“Pour Corn into Pot” (分发中)
10 100.0% 90.0% 0.0% 10.0% 50.0%
“Pour Corn into Pot” (OOD)
“Pour Corn into Pot” (OOD)
5 80.0% 60.0% 0.0% 20.0% 60.0%
“Flip Pot Upright” (in-distribution)
“Flip Pot Upright” (发行中)
10 100.0% 85.0% 40.0% 85.0% 100.0%
“Flip Pot Upright” (OOD)  “Flip Pot Upright” (OOD) 5 50.0% 20.0% 0.0% 40.0% 80.0%
“Move <object> onto Plate” (in-distribution)
“将 <object> 移动到印版上”(向内分布)
12 25.0% 25.0% 41.7% 8.3% 75.0%
“Move <object> onto Plate” (OOD)
将 <object> 移动到板上 (OOD)
6 8.3% 33.3% 8.3% 33.3% 58.3%
“Knock <object> Over” (in-distribution)
“Knock <object> Over”(按分发)
12 33.3% 25.0% 83.3% 75.0% 75.0%
“Knock <object> Over” (OOD)
“撞倒 <object>”(OOD)
6 16.7% 16.7% 33.3% 58.3% 83.3%
“Cover <object> with Towel” (in-distribution)
“用毛巾覆盖 <object>”(分发中)
12 16.7% 20.8% 91.7% 41.7% 50.0%
“Cover <object> with Towel” (OOD)
“用毛巾盖住 <object>”(OOD)
6 16.7% 33.3% 91.7% 50.0% 50.0%
Average  平均 48.5±\pm±4.9% 43.4±\pm±4.7% 43.4±\pm±4.4% 43.4±\pm±4.6% 67.2±\pm±4.0%
Franka-DROID (15Hz)  Franka-DROID (15 赫兹) “Wipe Table” (in-distribution)
“擦除表”(in-distribution)
18 50.0% 27.8% 52.8% 25.0% 55.6%
“Wipe Table” + Distractors (OOD)
“擦除表” + 干扰项 (OOD)
12 12.5% 25.0% 16.7% 16.7% 62.5%
Average  平均 35.0±\pm±8.0% 26.7±\pm±7.5% 38.3±\pm±8.5% 21.7±\pm±6.6% 58.3±\pm±7.2%

Additionally, in Table 8, we show the detailed version of the parameter-efficient fine-tuning experiment results summarized in Table 1. In these experiments, we use a representative subset of two Franka-Tabletop tasks, with both in-distribution and OOD variants: one narrow single-instruction task (“Put Carrot in Bowl”) and one diverse multi-instruction task (“Move <object> onto Plate”). We use the same number of training demonstrations used in Section 5.2 (50 and 150, respectively), which is delineated in Section B.3.1.
此外,在 8 中,我们显示了 1 中总结的参数高效微调实验结果的详细版本。在这些实验中,我们使用了两个 Franka-Tabletop 任务的代表性子集,包括分布内和 OOD 变体:一个狭窄的单指令任务(“将胡萝卜放入碗中”)和一个不同的多指令任务(“Move onto Plate”)。我们使用与5.2  相同的培训演示数量(分别为 50 和 150),这在B.3.1  中进行了描述。

Table 8: Detailed parameter-efficient fine-tuning experiment results. Here we present the detailed task performance results summarized in Table 1.
表 8:详细的参数高效微调实验结果。在这里,我们展示了 1 中总结的详细任务性能结果。
# trials  # 试用 Full FT  全 FT Last layer only  仅最后一层 Frozen vision  冻结的视力 Sandwich  三明治 LoRA, r=32  LoRA,r=32 LoRA, r=64  LoRA,r=64
Franka-Tabletop (5Hz)  Franka-桌面 (5Hz) “Put Carrot in Bowl” (in-distribution)
“Put Carrot in Bowl” (发行中)
10 90.0 40.0 40.0 90.0 60.0 90.0
“Put Carrot in Bowl” (OOD)
“Put Carrot in Bowl” (OOD)
5 40.0 0.0 40.0 0.0 60.0 40.0
“Move <object> onto Plate” (in-distribution)
“将 <object> 移动到印版上”(向内分布)
12 79.2 33.3 50.0 75.0 75.0 62.5
“Move <object> onto Plate” (OOD)
将 <object> 移动到板上 (OOD)
6 41.7 33.3 58.3 41.7 75.0 66.7
Average  平均 69.7±\pm±7.2% 30.3±\pm±6.1% 47.0±\pm±6.9% 62.1±\pm±7.9% 68.2±\pm±7.5% 68.2±\pm±7.8%

Appendix C RT-2-X vs. OpenVLA in BridgeData V2 Evaluations
附录 C BridgeData V2 评估中的 RT-2-X 与 OpenVLA

In this section, we provide additional details on RT-2-X vs. OpenVLA comparisons in BridgeData V2 evaluations discussed in Section 5.1. As discussed previously, OpenVLA is pretrained on a larger subset of OpenX data than RT-2-X and uses a fused SigLIP-DinoV2 vision backbone rather than a single visual encoder. However, in addition to these factors, we believe that OpenVLA’s significant improvement upon RT-2-X specifically in BridgeData V2 evaluations (as shown in Fig. 2) also stems from more careful preprocessing of the Bridge dataset.
在本节中,我们提供了第 5.1  中讨论的 BridgeData V2 评估中 RT-2-X 与 OpenVLA 比较的更多详细信息。如前所述,OpenVLA 在比 RT-2-X 更大的 OpenX 数据子集上进行预训练,并使用融合的 SigLIP-DinoV2 视觉主干,而不是单个视觉编码器。然而,除了这些因素之外,我们认为 OpenVLA 对 RT-2-X 的显着改进,特别是在 BridgeData V2 评估中(如图 2 所示)。 2) 也源于对 Bridge 数据集的更仔细的预处理。

During the development of the OpenVLA model, we discovered that the original version of the BridgeData V2 dataset contained many transitions with all-zero (no-op) actions. For instance, in every demonstration, an all-zero action was recorded as the ground-truth action in the first timestep. Consequently, training a highly expressive VLA model on the original dataset without any data preprocessing led to a policy that frequently predicted all-zero actions and froze during evaluations. Therefore, we simply filtered out the first transition in every demonstration when training the OpenVLA model, and this was sufficient for mitigating the freezing behavior in most cases.
在 OpenVLA 模型的开发过程中,我们发现 BridgeData V2 数据集的原始版本包含许多具有全零(无操作)操作的转换。例如,在每个演示中,一个全零操作都被记录为第一个时间步中的真值操作。因此,在没有任何数据预处理的原始数据集上训练一个高度表达的 VLA 模型,导致了一个经常预测全零动作并在评估过程中冻结的策略。因此,在训练 OpenVLA 模型时,我们简单地过滤掉了每个演示中的第一个转换,这在大多数情况下足以缓解冻结行为。

However, the RT-2-X model was trained without such data preprocessing, so it often suffers the aforementioned freezing behavior if deployed out of the box without modifying the model querying procedure – which severely deteriorates rollout performance. Since this is a proprietary model that is infeasible for us to re-train (e.g., with our preprocessed version of the BridgeData V2 dataset), we mitigated this issue by simply querying the second-most-likely action from the model, since the first-most-likely action was often all zeros while the second-most-likely action was not. (Note that this is the same workaround that was applied by the developers of the RT-2-X model for BridgeData V2 evaluations reported in the Open X-Embodiment experiments [1].) This workaround led to much stronger RT-2-X performance on BridgeData V2 evaluations – though we believe that it is still suboptimal compared to re-training the model on the preprocessed version of the dataset.
但是,RT-2-X 模型是在没有此类数据预处理的情况下进行训练的,因此如果在不修改模型查询程序的情况下开箱即用,它通常会遭受上述冻结行为——这严重恶化了推出性能。由于这是一个专有模型,我们无法重新训练(例如,使用我们的预处理版本的 BridgeData V2 数据集),因此我们通过简单地从模型中查询第二可能的动作来缓解这个问题,因为第一个最可能的动作通常全为零,而第二最可能的动作则不是。(请注意,这与 RT-2-X 模型的开发人员对 BridgeData V2 评估应用的解决方法相同,在 OpenX 实施例实验 [1]。这种解决方法导致 BridgeData V2 评估的 RT-2-X 性能要强得多——尽管我们认为与在数据集的预处理版本上重新训练模型相比,它仍然不是最佳的。

We also tried to dynamically query RT-2-X, i.e., by first sampling the first-most-likely action and then sampling the second-most-likely action if the first one was all zeros. However, we empirically found that dynamic querying led to worse performance than simply querying the second-most-likely action at all times. We hypothesize that this is due to a change in the robot’s dynamics that arises from dynamic querying: pausing in the middle of a trajectory to re-query the model leads to slight interruptions in the robot’s movement due to non-neglible latency in the querying pipeline, and this leads to subtle performance degradation. Therefore, we report the performance of RT-2-X when always querying the second-most-likely action, as done in the Open X-Embodiment project [1].
我们还尝试动态查询 RT-2-X,即,首先对第一个最可能的动作进行采样,如果第一个动作全为零,则对第二个最可能的动作进行采样。然而,我们凭实证发现,动态查询比简单地始终查询第二可能的动作性能更差。我们假设这是由于动态查询引起的机器人动力学变化:由于查询管道中不可忽略的延迟,在轨迹中间暂停以重新查询模型会导致机器人的运动出现轻微中断,从而导致性能的轻微下降。因此,我们报告了 RT-2-X 在始终查询第二可能的动作时的性能,就像在 Open X-Embodiment 项目 [1] 中所做的那样。

Appendix D Additional Experiments and Ablations
附录 D 其他实验和消融

In this section, we conduct several additional experiments to analyze the effects of individual components of the OpenVLA model architecture and training scheme, as well as provide quantitative evidence for claims made in earlier sections of this work. We aim to answer the following questions:
在本节中,我们进行了几个额外的实验来分析 OpenVLA 模型架构和训练方案的各个组件的影响,并为本文前面部分的主张提供定量证据。我们的目标是回答以下问题:

  1. 1.

    How important is OpenX training and how does it impact OpenVLA’s performance (Section D.1)?


    1. OpenX 训练有多重要,它如何影响 OpenVLA 的性能(D.1 节)?
  2. 2.

    What effect does using a fused SigLIP-DinoV2 vision encoder have on OpenVLA’s performance, compared to using a SigLIP-only vision encoder (Section D.2)?


    2. 与仅使用 SigLIP 的视觉编码器相比,使用融合的 SigLIP-DinoV2 视觉编码器对 OpenVLA 的性能有什么影响(第 D.2 节)?
  3. 3.

    Is it better to fine-tune or freeze the vision encoder in OpenVLA (Section D.3)?


    3. 在 OpenVLA 中微调还是冻结视觉编码器更好(D.3 节)?
  4. 4.

    How do the quantized inference results discussed in Section 5.3 change when policy performance is disentangled from model inference speed (Section D.4)?


    4. 当策略性能与模型推理速度(D.4 节)分离时,第 5.3 节中讨论的量化推理结果会如何变化?

We discuss the experimental setup and results addressing each of the above questions sequentially in the following sections.
我们将在以下各节中依次讨论解决上述每个问题的实验设置和结果。

D.1 OpenX Training Data Ablation Experiments
D.1 OpenX 训练数据消融实验

As discussed in Section 3.3, OpenVLA is trained on a large dataset of robot embodiments, scenes, and tasks from the Open X-Embodiment dataset [1] (OpenX). In this section, we ablate the OpenX mixture and train a VLA policy solely on one robot dataset, to assess the impact of OpenX training on policy performance. Note that we have already observed the negative effect of ablating OpenX training in the fine-tuning regime, as discussed in Section 5.2 (see OpenVLA (Scratch)), but we discuss additional experiments on another robot embodiment in this section to provide more supporting evidence.
3.3  所述,OpenVLA 是在 Open X-Embodiment 数据集 [1] (OpenX) 中的机器人实施例、场景和任务的大型数据集上进行训练的。在本节中,我们消融 OpenX 混合物并仅在一个机器人数据集上训练 VLA 策略,以评估 OpenX 训练对策略性能的影响。请注意,我们已经观察到在微调机制中消融 OpenX 训练的负面影响,如5.2  所述(参见 OpenVLA (Scratch)),但我们在本节中讨论了另一个机器人实施例的额外实验,以提供更多的支持证据。

Experimental setup and tasks. We compare the original OpenVLA model with OpenVLA-Bridge, which is produced by taking the same pretrained VLM as OpenVLA (Prismatic VLM [44]) and fine-tuning it solely on BridgeData V2 [6] rather than the entire OpenX training mixture discussed in Appendix A. We evaluate OpenVLA and OpenVLA-Bridge on a subset of 8 representative tasks from the BridgeData V2 WidowX robot evaluation suite discussed in Section B.1.1. The tasks are listed in Table 9.
实验设置和任务。我们将原始 OpenVLA 模型与 OpenVLA-Bridge 进行了比较,后者是通过采用与 OpenVLA 相同的预训练 VLM(棱柱形 VLM [44])并仅在 BridgeData V2 [6] 上对其进行微调而生成的,而不是附录 A  讨论的整个 OpenX 训练混合物。我们在 B.1.1  中讨论的 BridgeData V2 WidowX 机器人评估套件中的 8 个代表性任务的子集上评估了 OpenVLA 和 OpenVLA-Bridge。这些任务如 9 所示。

Results. Results for the OpenX training mixture ablation are shown in Table 9. By comparing OpenVLA with OpenVLA-Bridge, we see that performance drops drastically (reduction of 30 percent in absolute success rate), which demonstrates the importance of OpenX pretraining on final policy performance. Although the language grounding performance is not impacted, we observe performance reduction across all generalization categories. This result suggests that the large diversity of scenes, objects, and tasks in the OpenX training mixture is essential for unlocking improved generalization capabilities in the OpenVLA model.
结果。OpenX 训练混合消融的结果如 9 所示。通过将 OpenVLA 与 OpenVLA-Bridge 进行比较,我们发现性能急剧下降(绝对成功率降低了 30%),这表明了 OpenX 预训练对最终策略性能的重要性。尽管语言基础性能不受影响,但我们观察到所有泛化类别的性能都会降低。这一结果表明,OpenX 训练混合中场景、对象和任务的大量多样性对于解锁 OpenVLA 模型中改进的泛化能力至关重要。

Table 9: BridgeData V2 WidowX ablation experiment results. We evaluate various methods on a subset of 8 representative tasks to assess the importance of different components of the OpenVLA model architecture and training scheme. OpenVLA-Bridge is a version of OpenVLA without OpenX training (it is trained only on BridgeData V2), and OpenVLA-Bridge-SigLIP additionally ablates the fused vision backbone by removing the DinoV2 encoder (its vision backbone only consists of the SigLIP encoder). We observe that both OpenX training and the fused vision encoder improve policy performance, though the former has a much greater effect than the latter.
表 9:BridgeData V2 WidowX 消融实验结果。我们在 8 个代表性任务的子集上评估各种方法,以评估 OpenVLA 模型架构和训练方案的不同组件的重要性。OpenVLA-Bridge 是没有 OpenX 训练的 OpenVLA 版本(它仅在 BridgeData V2 上训练),OpenVLA-Bridge-SigLIP 通过移除 DinoV2 编码器(其视觉主干仅由 SigLIP 编码器组成)来额外消融融合视觉主干。我们观察到,OpenX 训练和融合视觉编码器都提高了策略性能,尽管前者的效果比后者大得多。
Category  类别 Task  任务 # Trials  # 试用 OpenVLA # Successes  # 成功 OpenVLA-Bridge  OpenVLA 桥 # Successes  # 成功 OpenVLA-Bridge-SigLIP # Successes  # 成功
Visual gen  视觉生成 Put Eggplant into Pot (Easy Version)
将茄子放入锅中(简易版)
10 10 8 8
Visual gen  视觉生成 Put Eggplant into Pot  将茄子放入锅中 10 10 2 3
Visual gen  视觉生成 Put Cup from Counter into Sink
将 Counter 中的杯子放入 Sink 中
10 7 4 2
Motion gen  运动生成 Lift Eggplant  提升茄子 10 7.5 5.5 6.5
Physical gen  物理生成 Put Carrot on Plate  将胡萝卜放在盘子里 10 8 4 1
Physical gen  物理生成 Lift AAA Battery  Lift AAA 电池 10 7 2 2
Semantic gen  语义生成 Take Purple Grapes out of Pot
将紫葡萄从罐中取出
10 4 3 3
Language grounding  语言接地 Put {Eggplant, Red Bottle} into Pot
将 {茄子, 红瓶} 放入锅中
10 7.5 8 7
Mean Success Rate  平均成功率 76.3 ±\pm± 4.8% 45.6 ±\pm± 5.6% 40.6 ±\pm± 5.5%

D.2 Dual vs. Single Vision Encoder Experiments
D.2 双视觉与单视觉编码器实验

The OpenVLA model architecture consists of a fused vision backbone that combines the SigLIP [9] and DinoV2 [25] encoders. In this section, we ablate the DinoV2 component to assess the importance of using a dual vision encoder.
OpenVLA 模型架构由一个融合的视觉主干组成,该主干结合了 SigLIP [9] 和 DinoV2 [25] 编码器。在本节中,我们将消融 DinoV2 组件,以评估使用双视觉编码器的重要性。

Experimental setup and tasks. We instantiate a model, OpenVLA-Bridge-SigLIP, which is a version of OpenVLA that is trained only on BridgeData V2 and consists of only the SigLIP encoder as the vision backbone. We compare this model with the OpenVLA-Bridge model discussed in the previous section (Section D.1), which shares the same model architecture as the original OpenVLA model and is only trained on Bridge robot data. Therefore, the only difference between OpenVLA-Bridge-SigLIP and OpenVLA-Bridge is that the former omits the DinoV2 encoder in the vision backbone. We evaluate these models on the same subset of 8 Bridge tasks described in the previous section.
实验设置和任务。我们实例化了一个模型 OpenVLA-Bridge-SigLIP,它是 OpenVLA 的一个版本,仅在 BridgeData V2 上训练,并且仅包含作为视觉主干的 SigLIP 编码器。我们将此模型与上一节(D.1  )中讨论的 OpenVLA-Bridge 模型进行了比较,后者与原始 OpenVLA 模型共享相同的模型架构,并且仅在 Bridge 机器人数据上进行训练。因此,OpenVLA-Bridge-SigLIP 和 OpenVLA-Bridge 之间的唯一区别是前者在视觉主干中省略了 DinoV2 编码器。我们在上一节中描述的 8 个 Bridge 任务的相同子集上评估这些模型。

Results. Results for the dual vision encoder ablation are shown in Table 9. The drop in performance from OpenVLA-Bridge to OpenVLA-Bridge-SigLIP implies that additionally including the DinoV2 encoder in the vision backbone improves policy performance. However, the 5 percent reduction in performance here is not as significant as the 30 percent drop in performance observed from ablating OpenX training. The low-level spatial features represented in DinoV2 appear to aid generalization in only some cases.
结果。双视觉编码器消融的结果如 9 所示。从 OpenVLA-Bridge 到 OpenVLA-Bridge-SigLIP 的性能下降意味着在视觉主干中额外包含 DinoV2 编码器可以提高策略性能。然而,这里 5% 的性能降低并不像从消融 OpenX 训练中观察到的性能下降 30% 那么明显。DinoV2 中表示的低级空间特征似乎仅在某些情况下有助于泛化。

D.3 Fine-Tuned vs. Frozen Vision Encoder Experiments
D.3 微调与 Frozen Vision Encoder 实验

As discussed in Section 3.4, prior work on VLMs observed higher performance from freezing the vision encoder than fine-tuning its parameters [44]. However, when training OpenVLA, we fine-tuned all 7B parameters in the model, including the SigLIP-DinoV2 vision backbone, as we discovered early on during development that fine-tuning the vision encoder led to higher-performing VLAs — a finding which held across various pretrained VLMs and model architectures. We discuss details of such findings below.
第3.4 所述,之前对自动立体货柜的研究观察到,冻结视觉编码器的性能高于微调其参数[44]。然而,在训练 OpenVLA 时,我们对模型中的所有 7B 参数进行了微调,包括 SigLIP-DinoV2 视觉主干,因为我们在开发的早期就发现,微调视觉编码器可以带来更高性能的 VLA——这一发现适用于各种预训练的 VLM 和模型架构。我们将在下面讨论此类调查结果的详细信息。

Experimental setup and tasks. In this section, we report the performance of two VLA policies produced by fine-tuning two different pretrained models from the Prismatic VLMs [44] repository on BridgeData V2. The two pretrained models are named SigLIP ViT-SO 224px and LLaVa v1.5 7B (Reproduction); see Karamcheti et al. [44] for details on their architectures and training mixtures. We evaluate both policies on various Bridge tasks shown in Table 10. Note that the evaluation configurations here differ from previously discussed Bridge evaluations, so the results are not directly comparable to results in other similar experiments.
实验设置和任务。在本节中,我们报告了通过微调 BridgeData V2 上的 Prismatic VLM [44] 存储库中的两个不同预训练模型生成的两个 VLA 策略的性能。这两个预训练模型分别命名为 SigLIP ViT-SO 224pxLLaVa v1.5 7B (Reproduction);有关其架构和训练混合物的详细信息,请参见Karamcheti等 [44]。我们在 10 所示的各种 Bridge 任务上评估了这两种策略。请注意,此处的评估配置与之前讨论的 Bridge 评估不同,因此结果不能与其他类似实验的结果直接比较。

Results. Results for the fine-tuned vs. frozen vision encoder experiments are shown in Table 10. We find that for both VLAs tested, fine-tuning the vision encoder leads to significantly higher success rates across various tasks. Qualitatively, in some cases, deploying the frozen vision encoder policies leads to unstable robot behaviors that are clearly suboptimal. Consequently, we decided early on during development to not conduct further experimentation with frozen vision encoders.
结果。微调与冻结视觉编码器实验的结果如 10 所示。我们发现,对于测试的两个 VLA,微调视觉编码器可以显著提高各种任务的成功率。从定性上讲,在某些情况下,部署冻结的视觉编码器策略会导致机器人行为不稳定,这显然是次优的。因此,我们在开发初期就决定不对 Frozen Vision 编码器进行进一步的实验。

Table 10: Fine-tuned vs. frozen vision encoder experiment results. We evaluate the performance of fine-tuning (“Fine-Tuned”) vs. freezing the vision encoder (“Frozen Vision”) in two VLA policies built on top of two different pretrained VLMs from the Prismatic VLMs [44] repository. BridgeData V2 WidowX tasks shown here are performed in the same sink environment used for other Bridge experiments in this work (however, the initial environment configurations here differ, as these evaluations were conducted at an earlier stage in the project). We find that fine-tuning the vision encoder is crucial to obtain good policy performance. Certain frozen vision encoder evaluations were discontinued due to very poor (near-zero) performance and unstable robot behaviors. Among the evaluations where both frozen vision and fine-tuned approaches are tested, fine-tuning the vision encoder leads to 80.0% average success versus 46.7% average success from leaving it frozen.
表 10:微调与冻结视觉编码器实验结果。我们在两个 VLA 策略中评估了微调(“微调”)与冻结视觉编码器(“冻结视觉”)的性能,这些策略建立在 Prismatic VLM [44] 存储库中的两个不同的预训练 VLM 之上。此处显示的 BridgeData V2 WidowX 任务在与本研究中的其他 Bridge 实验相同的 sink 环境中执行(但是,这里的初始环境配置不同,因为这些评估是在项目的早期阶段进行的)。我们发现,微调 vision encoder 对于获得良好的策略性能至关重要。由于性能非常差(接近零)和机器人行为不稳定,某些冻结的视觉编码器评估已停止。在测试冻结视觉和微调方法的评估中,微调视觉编码器的平均成功率为 80.0%,而冻结的平均成功率为 46.7%。
SigLIP ViT-SO 224px  SigLIP ViT-SO 224 像素 LLaVa v1.5 7B (Reproduction)
LLaVa v1.5 7B(复制)
Task  任务 # Trials  # 试用 Frozen Vision  冰雪奇缘 # Successes  # 成功 Fine-Tuned  微调 # Successes  # 成功 Frozen Vision  冰雪奇缘 # Successes  # 成功 Fine-Tuned  微调 # Successes  # 成功
Put Eggplant into Pot  将茄子放入锅中 10 7 10 5 9
Put Corn on Plate  将玉米放在盘子上 10 10 9 0 9
Mean Success Rate  平均成功率 85 95 25 90
Put { Eggplant, Red Bottle } into Pot
将 { 茄子、红瓶 } 放入锅中
4 2 4 3
Put { Blue Cup, Pink Cup } on Plate
将 { 蓝色杯, 粉色杯 } 放在盘子上
4 0 0 0
Lift { Cheese, Red Chili Pepper }
Lift { 奶酪, 红辣椒 }
4 0 3 2
Put { Strawberry, Lime } into Pot
将 { 草莓、酸橙 } 放入锅中
4 1 0 3
Move { Sushi, Grapes }
移动 { 寿司, 葡萄 }
4 3 4 3
Mean Success Rate  平均成功率 30 55 55

D.4 Additional Quantized Inference Experiments: Disentangling Policy Performance and Model Inference Speed
D.4 其他量化推理实验:解开策略性能和模型推理速度

In Section 5.3, we evaluated OpenVLA with different levels of precision at inference time: half precision (bfloat16), 8-bit quantization, and 4-bit quantization. 8-bit quantization led to lower BridgeData V2 performance relative to the other two approaches, and we hypothesized that the reduction in performance was caused by lower model inference speed from the operations used in 8-bit quantization. In this section, we conduct experiments to assess the veracity of this claim.
5.3  中,我们评估了 OpenVLA 在推理时具有不同的精度水平:半精度 (bfloat16)、8 位量化和 4 位量化。相对于其他两种方法,8 位量化导致 BridgeData V2 性能较低,我们假设性能下降是由于 8 位量化中使用的操作的模型推理速度较低造成的。在本节中,我们进行了实验来评估这一说法的真实性。

Specifically, we evaluate OpenVLA again with the three different levels of precision listed above, but now with blocking control. In other words, each action is fully executed on the robot before the next one is predicted by the policy and executed by the controller. This scheme controls system dynamics across methods with varying amounts of latency and thus allows us to test the quality of a policy’s action predictions, independent of its prediction speed. Effectively, the precision levels that have higher throughput – bfloat16 and 4-bit quantization – are forced to run slower to match the dynamics observed when deploying OpenVLA with 8-bit precision. Therefore, we expect OpenVLA’s performance with 8-bit precision to match the performance of bfloat16 and 4-bit precision under blocking control.
具体来说,我们使用上面列出的三个不同精度级别再次评估 OpenVLA,但现在使用阻塞控制。换句话说,在策略预测下一个操作并由控制器执行下一个操作之前,每个操作都会在机器人上完全执行。该方案以不同的延迟量控制不同方法的系统动态,因此允许我们测试策略操作预测的质量,而不受其预测速度的影响。实际上,具有较高吞吐量的精度级别(bfloat16 和 4 位量化)被迫运行得更慢,以匹配在以 8 位精度部署 OpenVLA 时观察到的动态。因此,我们预计 OpenVLA 的 8 位精度性能将与阻塞控制下的 bfloat16 和 4 位精度的性能相匹配。

Experimental setup and tasks. We report the performance of OpenVLA with blocking control and quantized inference on the same subset of 8 BridgeData V2 tasks used in Section D.1 and Section D.2.