Segment Anything 任何分段

Alexander Kirillov^1,2,4 Eric Mintun² Nikhila Ravi^1,2 Hanzi Mao² Chloe Rolland³ Laura Gustafson³
Tete Xiao³ Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Dollár⁴ Ross Girshick⁴
¹project lead ²joint first author ³equal contribution ⁴directional lead
¹ 项目负责人 ² 共同第一作者 ³ 同等贡献 ⁴ 定向负责人
Meta AI Research, FAIR
元人工智能研究，FAIR

Abstract 摘要

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive – often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.
我们介绍了 Segment Anything (SA) 项目：一个用于图像分割的新任务、模型和数据集。通过在数据收集循环中使用我们的高效模型，我们建立了迄今为止最大的分割数据集（迄今为止），其中包含 1100 万张授权图像上的 10 亿多个掩码，并且尊重隐私。该模型的设计和训练具有可提示性，因此它可以在新的图像分布和任务中进行零转移。我们在大量任务中评估了它的能力，发现它的零镜头性能令人印象深刻--通常可与之前的完全监督结果相媲美，甚至更胜一筹。我们在 https://segment-anything.com 上发布了 Segment Anything Model（SAM）和相应的数据集（SA-1B），其中包括 1B 个遮罩和 1,100 万张图像，以促进计算机视觉基础模型的研究。

Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a promptable segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks.
图 1：我们的目标是通过引入三个相互关联的组件来建立一个分割基础模型：一个可提示的分割任务；一个分割模型（SAM），该模型为数据注释提供动力，并可通过提示工程将零镜头转移到一系列任务中；以及一个数据引擎，用于收集 SA-1B（我们拥有超过 10 亿个掩码的数据集）。

1 Introduction 1引言

Large language models pre-trained on web-scale datasets are revolutionizing NLP with strong zero-shot and few-shot generalization [10]. These “foundation models” [8] can generalize to tasks and data distributions beyond those seen during training. This capability is often implemented with prompt engineering in which hand-crafted text is used to prompt the language model to generate a valid textual response for the task at hand. When scaled and trained with abundant text corpora from the web, these models’ zero and few-shot performance compares surprisingly well to (even matching in some cases) fine-tuned models [10, 21]. Empirical trends show this behavior improving with model scale, dataset size, and total training compute [56, 10, 21, 51].
在网络规模的数据集上预先训练的大型语言模型正以强大的零点泛化和少量泛化能力为 NLP 带来革命性的变化[10]。这些 "基础模型"[8] 可以泛化到训练期间所见之外的任务和数据分布。这种能力通常是通过提示工程来实现的，在这种工程中，手工制作的文本被用来提示语言模型为手头的任务生成有效的文本响应。当使用来自网络的大量文本语料进行扩展和训练时，这些模型的零点和少点性能与经过微调的模型相比（在某些情况下甚至不相上下）出奇地好[10, 21]。经验趋势表明，随着模型规模、数据集大小和总训练计算量的增加，这种性能也在提高[56, 10, 21, 51]。

Foundation models have also been explored in computer vision, albeit to a lesser extent. Perhaps the most prominent illustration aligns paired text and images from the web. For example, CLIP [82] and ALIGN [55] use contrastive learning to train text and image encoders that align the two modalities. Once trained, engineered text prompts enable zero-shot generalization to novel visual concepts and data distributions. Such encoders also compose effectively with other modules to enable downstream tasks, such as image generation (e.g., DALL·E [83]). While much progress has been made on vision and language encoders, computer vision includes a wide range of problems beyond this scope, and for many of these, abundant training data does not exist.
计算机视觉领域也对基础模型进行了探索，尽管程度较低。最突出的例子可能是将网络上成对的文本和图像进行对齐。例如，CLIP [ 82] 和 ALIGN [ 55] 使用对比学习来训练文本和图像编码器，使两种模式对齐。一旦经过训练，工程化文本提示就能实现对新的视觉概念和数据分布的零点概括。这种编码器还能与其他模块有效组合，以完成图像生成等下游任务（如 DALL-E [ 83]）。虽然在视觉和语言编码器方面已经取得了很大进展，但计算机视觉还包括许多超出这一范围的问题，而且其中许多问题并不存在丰富的训练数据。

In this work, our goal is to build a foundation model for image segmentation. That is, we seek to develop a promptable model and pre-train it on a broad dataset using a task that enables powerful generalization. With this model, we aim to solve a range of downstream segmentation problems on new data distributions using prompt engineering.
在这项工作中，我们的目标是为图像分割建立一个基础模型。也就是说，我们试图开发一个可提示的模型，并在一个广泛的数据集上使用一个能实现强大泛化的任务对其进行预训练。有了这个模型，我们就能在新的数据分布上利用提示工程解决一系列下游分割问题。

The success of this plan hinges on three components: task, model, and data. To develop them, we address the following questions about image segmentation:
该计划的成功取决于三个要素：任务、模型和数据。为了开发这三个部分，我们要解决以下有关图像分割的问题：

1.

What task will enable zero-shot generalization?

1.什么任务可以实现零点概括？
2.

What is the corresponding model architecture?

2.相应的模型结构是什么？
3.

What data can power this task and model?

3.哪些数据可以为这项任务和模型提供动力？

These questions are entangled and require a comprehensive solution. We start by defining a promptable segmentation task that is general enough to provide a powerful pre-training objective and to enable a wide range of downstream applications. This task requires a model that supports flexible prompting and can output segmentation masks in real-time when prompted to allow for interactive use. To train our model, we need a diverse, large-scale source of data. Unfortunately, there is no web-scale data source for segmentation; to address this, we build a “data engine”, i.e., we iterate between using our efficient model to assist in data collection and using the newly collected data to improve the model. We introduce each interconnected component next, followed by the dataset we created and the experiments that demonstrate the effectiveness of our approach.
这些问题错综复杂，需要全面的解决方案。我们首先定义了一个可提示的分割任务，它的通用性足以提供强大的预训练目标，并支持广泛的下游应用。这项任务需要一个支持灵活提示的模型，并能在提示时实时输出分割掩码，以便交互使用。为了训练我们的模型，我们需要多样化、大规模的数据源。遗憾的是，目前还没有网络规模的分割数据源；为了解决这个问题，我们建立了一个 "数据引擎"，即在使用我们的高效模型来协助数据收集和使用新收集的数据来改进模型之间反复切换。接下来，我们将介绍每个相互关联的组件，然后是我们创建的数据集和证明我们方法有效性的实验。

Task (§2). 任务（§2）。

In NLP and more recently computer vision, foundation models are a promising development that can perform zero-shot and few-shot learning for new datasets and tasks often by using “prompting” techniques. Inspired by this line of work, we propose the promptable segmentation task, where the goal is to return a valid segmentation mask given any segmentation prompt (see Fig. 1a). A prompt simply specifies what to segment in an image, e.g., a prompt can include spatial or text information identifying an object. The requirement of a valid output mask means that even when a prompt is ambiguous and could refer to multiple objects (for example, a point on a shirt may indicate either the shirt or the person wearing it), the output should be a reasonable mask for at least one of those objects. We use the promptable segmentation task as both a pre-training objective and to solve general downstream segmentation tasks via prompt engineering.
在 NLP 领域以及最近的计算机视觉领域，基础模型是一种前景广阔的发展方向，通常通过使用 "提示 "技术，可以对新数据集和任务进行零次或少量学习。受此启发，我们提出了可提示分割任务，其目标是在给出任何分割提示的情况下返回有效的分割掩码（见图 1a）。提示只需指定图像中要分割的内容，例如，提示可以包括识别物体的空间或文本信息。对有效输出掩码的要求意味着，即使提示模棱两可，可能指向多个对象（例如，衬衫上的一个点既可能指向衬衫，也可能指向穿着衬衫的人），输出也应至少是其中一个对象的合理掩码。我们将可提示分割任务作为预训练目标，并通过提示工程解决一般的下游分割任务。

Model (§3). 模型（§3）。

The promptable segmentation task and the goal of real-world use impose constraints on the model architecture. In particular, the model must support flexible prompts, needs to compute masks in amortized real-time to allow interactive use, and must be ambiguity-aware. Surprisingly, we find that a simple design satisfies all three constraints: a powerful image encoder computes an image embedding, a prompt encoder embeds prompts, and then the two information sources are combined in a lightweight mask decoder that predicts segmentation masks. We refer to this model as the Segment Anything Model, or SAM (see Fig. 1b). By separating SAM into an image encoder and a fast prompt encoder / mask decoder, the same image embedding can be reused (and its cost amortized) with different prompts. Given an image embedding, the prompt encoder and mask decoder predict a mask from a prompt in $\scriptstyle\sim$ 50ms in a web browser. We focus on point, box, and mask prompts, and also present initial results with free-form text prompts. To make SAM ambiguity-aware, we design it to predict multiple masks for a single prompt allowing SAM to naturally handle ambiguity, such as the shirt vs. person example.
可提示的分割任务和在现实世界中使用的目标对模型架构提出了限制。特别是，模型必须支持灵活的提示，需要实时计算掩码以允许交互式使用，并且必须具有模糊感知能力。令人惊讶的是，我们发现一个简单的设计就能满足所有三个限制条件：一个功能强大的图像编码器计算图像嵌入，一个提示编码器嵌入提示，然后将这两个信息源结合到一个轻量级掩码解码器中，预测分割掩码。我们将这一模型称为 "任何分割模型"，或 SAM（见图 1b）。通过将 SAM 分离为图像编码器和快速提示编码器/掩码解码器，相同的图像嵌入可以通过不同的提示重复使用（并摊销其成本）。在给定图像嵌入的情况下，提示编码器和掩码解码器可在 50 毫秒内从网页浏览器的提示中预测出掩码。我们的重点是点、方框和掩码提示，同时也展示了自由格式文本提示的初步结果。为了使 SAM 具有歧义感知能力，我们设计了它来预测单个提示的多个掩码，使 SAM 能够自然地处理歧义，例如衬衫与人的例子。

Data engine (§4). 数据引擎（§4）。

To achieve strong generalization to new data distributions, we found it necessary to train SAM on a large and diverse set of masks, beyond any segmentation dataset that already exists. While a typical approach for foundation models is to obtain data online [82], masks are not naturally abundant and thus we need an alternative strategy. Our solution is to build a “data engine”, i.e., we co-develop our model with model-in-the-loop dataset annotation (see Fig. 1c). Our data engine has three stages: assisted-manual, semi-automatic, and fully automatic. In the first stage, SAM assists annotators in annotating masks, similar to a classic interactive segmentation setup. In the second stage, SAM can automatically generate masks for a subset of objects by prompting it with likely object locations and annotators focus on annotating the remaining objects, helping increase mask diversity. In the final stage, we prompt SAM with a regular grid of foreground points, yielding on average $\scriptstyle\sim$ 100 high-quality masks per image.
为了实现对新数据分布的强泛化，我们发现有必要在大量不同的掩码集上训练 SAM，而不局限于已有的任何分割数据集。虽然基础模型的典型方法是在线获取数据[82]，但掩码并不天然丰富，因此我们需要一种替代策略。我们的解决方案是建立一个 "数据引擎"，即通过模型在环数据集注释共同开发我们的模型（见图 1c）。我们的数据引擎分为三个阶段：辅助手动、半自动和全自动。在第一阶段，SAM 协助注释者注释掩码，类似于经典的交互式分割设置。在第二阶段，SAM 可以通过提示可能的对象位置来自动生成对象子集的遮罩，而注释者则专注于注释其余的对象，这有助于提高遮罩的多样性。在最后阶段，我们用规则的前景点网格提示 SAM，平均每幅图像可生成 100 个高质量的遮罩。

Dataset (§5). 数据集（§5）。

Our final dataset, SA-1B, includes more than 1B masks from 11M licensed and privacy-preserving images (see Fig. 2). SA-1B, collected fully automatically using the final stage of our data engine, has 400 ${\times}$ more masks than any existing segmentation dataset [66, 44, 117, 60], and as we verify extensively, the masks are of high quality and diversity. Beyond its use in training SAM to be robust and general, we hope SA-1B becomes a valuable resource for research aiming to build new foundation models.
我们的最终数据集 SA-1B 包括来自 1100 万张授权和隐私保护图像的超过 1B 个掩码（见图 2）。SA-1B 是利用我们数据引擎的最后阶段全自动收集的，它比任何现有的分割数据集都多出 400 ${\times}$ 个掩码[66, 44, 117, 60]，而且正如我们广泛验证的那样，这些掩码具有很高的质量和多样性。除了用于训练 SAM 的鲁棒性和通用性外，我们还希望 SA-1B 成为旨在建立新基础模型的研究的宝贵资源。

Responsible AI (§6). 负责任的人工智能（§6）。

We study and report on potential fairness concerns and biases when using SA-1B and SAM. Images in SA-1B span a geographically and economically diverse set of countries and we found that SAM performs similarly across different groups of people. Together, we hope this will make our work more equitable for real-world use cases. We provide model and dataset cards in the appendix.
我们研究并报告了使用 SA-1B 和 SAM 时可能出现的公平性问题和偏差。SA-1B 中的图像跨越了地理和经济上不同的国家，我们发现 SAM 在不同人群中的表现类似。我们希望这将使我们的工作在实际应用案例中更加公平。我们在附录中提供了模型和数据集卡片。

Experiments (§7). 实验（§7）。

We extensively evaluate SAM. First, using a diverse new suite of 23 segmentation datasets, we find that SAM produces high-quality masks from a single foreground point, often only slightly below that of the manually annotated ground truth. Second, we find consistently strong quantitative and qualitative results on a variety of downstream tasks under a zero-shot transfer protocol using prompt engineering, including edge detection, object proposal generation, instance segmentation, and a preliminary exploration of text-to-mask prediction. These results suggest that SAM can be used out-of-the-box with prompt engineering to solve a variety of tasks involving object and image distributions beyond SAM’s training data. Nevertheless, room for improvement remains, as we discuss in §8.
我们对 SAM 进行了广泛评估。首先，我们使用 23 个不同的新分割数据集，发现 SAM 能从单个前景点生成高质量的遮罩，通常仅略低于人工标注的地面实况。其次，我们发现，在零镜头传输协议下，SAM 在各种下游任务（包括边缘检测、对象建议生成、实例分割以及文本到掩码预测的初步探索）上都取得了一致的优异定量和定性结果。这些结果表明，除了 SAM 的训练数据之外，SAM 还能与即时工程一起用于解决涉及对象和图像分布的各种任务。不过，正如我们在第8节中讨论的那样，仍有改进的余地。

Release. 发布。

We are releasing the SA-1B dataset for research purposes and making SAM available under a permissive open license (Apache 2.0) at https://segment-anything.com. We also showcase SAM’s capabilities with an online demo.
我们发布了用于研究目的的 SA-1B 数据集，并根据许可开放许可证（Apache 2.0）在 https://segment-anything.com 上提供 SAM。我们还通过在线演示展示了 SAM 的功能。

Refer to caption — Figure 2: Example images with overlaid masks from our newly introduced dataset, SA-1B. SA-1B contains 11M diverse, high-resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks. These masks were annotated *fully automatically* by SAM, and as we verify by human ratings and numerous experiments, are of high quality and diversity. We group images by number of masks per image for visualization (there are $\scriptstyle\sim$ 100 masks per image on average).
图 2：新推出的数据集 SA-1B 中带有叠加掩码的示例图像。SA-1B 包含 1,100 万张多样化、高分辨率、授权和隐私保护图像，以及 11B 个高质量分割掩码。这些掩码由 SAM 全自动注释，我们通过人工评分和大量实验验证了它们的高质量和多样性。为便于可视化，我们按每张图像的掩码数量对图像进行分组（平均每张图像有 100 个掩码）。

2 Segment Anything Task 2分段任何任务

We take inspiration from NLP, where the next token prediction task is used for foundation model pre-training and to solve diverse downstream tasks via prompt engineering [10]. To build a foundation model for segmentation, we aim to define a task with analogous capabilities.
我们从 NLP 中汲取灵感，在 NLP 中，下一个标记预测任务被用于基础模型的预训练，并通过提示工程解决各种下游任务[10]。为了建立一个用于分割的基础模型，我们的目标是定义一个具有类似能力的任务。

Task. 任务。

We start by translating the idea of a prompt from NLP to segmentation, where a prompt can be a set of foreground / background points, a rough box or mask, free-form text, or, in general, any information indicating what to segment in an image. The promptable segmentation task, then, is to return a valid segmentation mask given any prompt. The requirement of a “valid” mask simply means that even when a prompt is ambiguous and could refer to multiple objects (e.g., recall the shirt vs. person example, and see Fig. 3), the output should be a reasonable mask for at least one of those objects. This requirement is similar to expecting a language model to output a coherent response to an ambiguous prompt. We choose this task because it leads to a natural pre-training algorithm and a general method for zero-shot transfer to downstream segmentation tasks via prompting.
我们首先将提示的概念从 NLP 转化为分割，提示可以是一组前景/背景点、一个粗略的方框或遮罩、自由形式的文本，或者，一般来说，指示图像中要分割的内容的任何信息。因此，可提示分割任务就是在给出任何提示的情况下，返回一个有效的分割掩码。对 "有效 "掩码的要求简单来说就是，即使提示模棱两可，可能指向多个对象（例如，回顾衬衫与人的对比示例，见图 3），输出也应该是至少其中一个对象的合理掩码。这一要求类似于期望语言模型对模棱两可的提示输出连贯的反应。我们之所以选择这项任务，是因为它带来了一种自然的预训练算法和一种通过提示将零镜头转移到下游分割任务的通用方法。

Pre-training. 预培训。

The promptable segmentation task suggests a natural pre-training algorithm that simulates a sequence of prompts (e.g., points, boxes, masks) for each training sample and compares the model’s mask predictions against the ground truth. We adapt this method from interactive segmentation [109, 70], although unlike interactive segmentation whose aim is to eventually predict a valid mask after enough user input, our aim is to always predict a valid mask for any prompt even when the prompt is ambiguous. This ensures that a pre-trained model is effective in use cases that involve ambiguity, including automatic annotation as required by our data engine §4. We note that performing well at this task is challenging and requires specialized modeling and training loss choices, which we discuss in §3.
可提示分割任务提出了一种自然的预训练算法，即为每个训练样本模拟一系列提示（如点、方框、掩码），并将模型的掩码预测与地面实况进行比较。我们从交互式分割法[109, 70]中借鉴了这一方法，但与交互式分割法不同的是，交互式分割法的目的是在足够多的用户输入后最终预测出一个有效的掩码，而我们的目的则是始终为任何提示预测出一个有效的掩码，即使该提示是模棱两可的。这就确保了预训练模型在涉及模棱两可的用例（包括我们的数据引擎 §4 所要求的自动注释）中的有效性。我们注意到，完成这项任务具有挑战性，需要选择专门的建模和训练损失，我们将在§3中讨论这一点。

Zero-shot transfer. 零镜头传输。

Intuitively, our pre-training task endows the model with the ability to respond appropriately to any prompt at inference time, and thus downstream tasks can be solved by engineering appropriate prompts. For example, if one has a bounding box detector for cats, cat instance segmentation can be solved by providing the detector’s box output as a prompt to our model. In general, a wide array of practical segmentation tasks can be cast as prompting. In addition to automatic dataset labeling, we explore five diverse example tasks in our experiments in §7.
直观地说，我们的预训练任务赋予了模型在推理时对任何提示做出适当反应的能力，因此下游任务可以通过设计适当的提示来解决。例如，如果我们有一个猫的边界框检测器，就可以通过将检测器的边界框输出作为对模型的提示来解决猫的实例分割问题。一般来说，大量的实际分割任务都可以作为提示任务。除了自动数据集标注，我们还在第 7 节的实验中探索了五种不同的示例任务。

Related tasks. 相关任务

Segmentation is a broad field: there’s interactive segmentation [57, 109], edge detection [3], super pixelization [85], object proposal generation [2], foreground segmentation [94], semantic segmentation [90], instance segmentation [66], panoptic segmentation [59], etc. The goal of our promptable segmentation task is to produce a broadly capable model that can adapt to many (though not all) existing and new segmentation tasks via prompt engineering. This capability is a form of task generalization [26]. Note that this is different than previous work on multi-task segmentation systems. In a multi-task system, a single model performs a fixed set of tasks, e.g., joint semantic, instance, and panoptic segmentation [114, 19, 54], but the training and test tasks are the same. An important distinction in our work is that a model trained for promptable segmentation can perform a new, different task at inference time by acting as a component in a larger system, e.g., to perform instance segmentation, a promptable segmentation model is combined with an existing object detector.
分割是一个广泛的领域：有交互式分割[ 57, 109]、边缘检测[ 3]、超级像素化[ 85]、对象建议生成[ 2]、前景分割[ 94]、语义分割[ 90]、实例分割[ 66]、全景分割[ 59]等。我们的可提示分割任务的目标是生成一个具有广泛能力的模型，该模型可以通过提示工程适应许多（尽管不是全部）现有的和新的分割任务。这种能力是任务泛化的一种形式[26]。请注意，这与之前关于多任务分割系统的研究不同。在多任务系统中，单个模型执行一组固定的任务，例如联合语义分割、实例分割和全视角分割[114, 19, 54]，但训练和测试任务是相同的。我们工作中的一个重要区别是，为可提示分割训练的模型在推理时可以通过作为更大系统中的一个组件来执行新的、不同的任务，例如，为了执行实例分割，可提示分割模型与现有的对象检测器相结合。

Discussion. 讨论。

Prompting and composition are powerful tools that enable a single model to be used in extensible ways, potentially to accomplish tasks unknown at the time of model design. This approach is analogous to how other foundation models are used, e.g., how CLIP [82] is the text-image alignment component of the DALL $\cdot$ E [83] image generation system. We anticipate that composable system design, powered by techniques such as prompt engineering, will enable a wider variety of applications than systems trained specifically for a fixed set of tasks. It’s also interesting to compare promptable and interactive segmentation through the lens of composition: while interactive segmentation models are designed with human users in mind, a model trained for promptable segmentation can also be composed into a larger algorithmic system as we will demonstrate.
提示和组合是强大的工具，可以使单个模型以可扩展的方式使用，从而完成模型设计时未知的任务。这种方法类似于其他基础模型的使用方式，例如 CLIP [ 82] 是 DALL $\cdot$ E [ 83] 图像生成系统的文本图像对齐组件。我们预计，与专门为一组固定任务而训练的系统相比，以提示工程（prompt engineering）等技术为动力的可组合系统设计将能实现更广泛的应用。从组合的角度来比较可提示分割和交互式分割也很有趣：交互式分割模型是以人类用户为中心设计的，而为可提示分割而训练的模型也可以组合成一个更大的算法系统，正如我们将演示的那样。

3 Segment Anything Model 3 段任何模式

We next describe the Segment Anything Model (SAM) for promptable segmentation. SAM has three components, illustrated in Fig. 4: an image encoder, a flexible prompt encoder, and a fast mask decoder. We build on Transformer vision models [14, 33, 20, 62] with specific tradeoffs for (amortized) real-time performance. We describe these components at a high-level here, with details in §A.
接下来，我们将介绍用于可提示分割的 "任意分割模型"（SAM）。SAM 有三个组成部分，如图 4 所示：图像编码器、灵活的提示编码器和快速掩码解码器。我们以 Transformer 视觉模型为基础[14, 33, 20, 62]，对（摊销）实时性能进行了特定的权衡。我们在此对这些组件进行了高层次的描述，详情请参见 §A。

Image encoder. 图像编码器

Motivated by scalability and powerful pre-training methods, we use an MAE [47] pre-trained Vision Transformer (ViT) [33] minimally adapted to process high resolution inputs [62]. The image encoder runs once per image and can be applied prior to prompting the model.
在可扩展性和强大的预训练方法的激励下，我们使用 MAE [ 47] 预训练的视觉转换器（ViT）[ 33] 进行最小化调整，以处理高分辨率输入 [ 62]。图像编码器每张图像运行一次，可在提示模型之前应用。

Prompt encoder. 提示编码器。

We consider two sets of prompts: sparse (points, boxes, text) and dense (masks). We represent points and boxes by positional encodings [95] summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP [82]. Dense prompts (i.e., masks) are embedded using convolutions and summed element-wise with the image embedding.
我们考虑了两组提示：稀疏（点、方框、文本）和密集（掩码）。我们用位置编码[95]表示点和方框，并与每种提示类型的已学嵌入相加；用 CLIP 的现成文本编码器[82]表示自由格式文本。密集提示（即掩码）使用卷积法嵌入，并与图像嵌入相加。

Mask decoder. 掩码解码器

The mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask. This design, inspired by [14, 20], employs a modification of a Transformer decoder block [103] followed by a dynamic mask prediction head. Our modified decoder block uses prompt self-attention and cross-attention in two directions (prompt-to-image embedding and vice-versa) to update all embeddings. After running two blocks, we upsample the image embedding and an MLP maps the output token to a dynamic linear classifier, which then computes the mask foreground probability at each image location.
掩码解码器能有效地将图像嵌入、提示嵌入和输出标记映射到掩码上。这一设计受[ 14, 20] 的启发，采用了变形解码器块[ 103] 的修改，然后是动态掩码预测头。我们修改后的解码器块使用提示自注意和双向交叉注意（提示到图像嵌入，反之亦然）来更新所有嵌入。运行两个区块后，我们对图像嵌入进行上采样，然后将输出标记映射到动态线性分类器的 MLP，该分类器会计算每个图像位置的掩码前景概率。

Resolving ambiguity. 解决模棱两可的问题。

With one output, the model will average multiple valid masks if given an ambiguous prompt. To address this, we modify the model to predict multiple output masks for a single prompt (see Fig. 3). We found 3 mask outputs is sufficient to address most common cases (nested masks are often at most three deep: whole, part, and subpart). During training, we backprop only the minimum loss [15, 45, 64] over masks. To rank masks, the model predicts a confidence score (i.e., estimated IoU) for each mask.
对于一个输出，如果给出一个模棱两可的提示，模型将平均预测出多个有效掩码。为了解决这个问题，我们修改了模型，以预测单个提示的多个输出掩码（见图 3）。我们发现 3 个掩码输出足以应对大多数常见情况（嵌套掩码通常最多有三个深度：整体、部分和子部分）。在训练过程中，我们只对掩码的最小损失[15, 45, 64]进行反推。为了对掩码进行排序，模型会预测每个掩码的置信度得分（即估计 IoU）。

Efficiency. 效率。

The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on CPU, in $\scriptstyle\sim$ 50ms. This runtime performance enables seamless, real-time interactive prompting of our model.
整个模型的设计主要是为了提高效率。给定一个预先计算的图像嵌入，提示编码器和掩码解码器在网络浏览器的 CPU 上运行，只需 $\scriptstyle\sim$ 50 毫秒。这种运行性能使我们的模型能够实现无缝、实时的交互式提示。

Losses and training. 损失和培训。

We supervise mask prediction with the linear combination of focal loss [65] and dice loss [73] used in [14]. We train for the promptable segmentation task using a mixture of geometric prompts (for text prompts see §7.5). Following [92, 37], we simulate an interactive setup by randomly sampling prompts in 11 rounds per mask, allowing SAM to integrate seamlessly into our data engine.
我们使用[14]中使用的焦点损失[65]和骰子损失[73]的线性组合来监督掩码预测。我们使用混合的几何提示（文本提示见第 7.5 节）对可提示分割任务进行训练。按照[92, 37]的方法，我们通过在每个掩码中随机抽取 11 轮提示来模拟交互式设置，从而使 SAM 能够无缝集成到我们的数据引擎中。

4 Segment Anything Data Engine
4 段任何数据引擎

As segmentation masks are not abundant on the internet, we built a data engine to enable the collection of our 1.1B mask dataset, SA-1B. The data engine has three stages: (1) a model-assisted manual annotation stage, (2) a semi-automatic stage with a mix of automatically predicted masks and model-assisted annotation, and (3) a fully automatic stage in which our model generates masks without annotator input. We go into details of each next.
由于互联网上的分割掩码并不丰富，我们建立了一个数据引擎，以便收集 1.1B 掩码数据集 SA-1B。数据引擎分为三个阶段：(1) 模型辅助人工标注阶段；(2) 半自动阶段，包括自动预测的掩码和模型辅助标注；(3) 全自动阶段，我们的模型无需标注者输入即可生成掩码。接下来我们将详细介绍每个阶段。

Assisted-manual stage. 辅助手动阶段。

In the first stage, resembling classic interactive segmentation, a team of professional annotators labeled masks by clicking foreground / background object points using a browser-based interactive segmentation tool powered by SAM. Masks could be refined using pixel-precise “brush” and “eraser” tools. Our model-assisted annotation runs in real-time directly inside a browser (using precomputed image embeddings) enabling a truly interactive experience. We did not impose semantic constraints for labeling objects, and annotators freely labeled both “stuff” and “things” [1]. We suggested annotators label objects they could name or describe, but did not collect these names or descriptions. Annotators were asked to label objects in order of prominence and were encouraged to proceed to the next image once a mask took over 30 seconds to annotate.
第一阶段类似于经典的交互式分割，由一组专业注释人员使用由 SAM 支持的基于浏览器的交互式分割工具，通过点击前景/背景对象点来标记掩码。可以使用像素精确的 "画笔 "和 "橡皮擦 "工具对遮罩进行细化。我们的模型辅助标注直接在浏览器中实时运行（使用预先计算的图像嵌入），实现了真正的交互式体验。我们没有对标注对象施加语义限制，标注者可以自由标注 "东西 "和 "事物"[ 1]。我们建议标注者标注他们可以命名或描述的对象，但不收集这些名称或描述。我们要求标注者按照突出顺序标注对象，并鼓励标注者在一个遮罩的标注时间超过 30 秒后继续标注下一幅图像。

At the start of this stage, SAM was trained using common public segmentation datasets. After sufficient data annotation, SAM was retrained using only newly annotated masks. As more masks were collected, the image encoder was scaled from ViT-B to ViT-H and other architectural details evolved; in total we retrained our model 6 times. Average annotation time per mask decreased from 34 to 14 seconds as the model improved. We note that 14 seconds is 6.5 ${\times}$ faster than mask annotation for COCO [66] and only 2 ${\times}$ slower than bounding-box labeling with extreme points [76, 71]. As SAM improved, the average number of masks per image increased from 20 to 44 masks. Overall, we collected 4.3M masks from 120k images in this stage.
在这一阶段开始时，使用常见的公共分割数据集对 SAM 进行训练。在获得足够的数据注释后，只使用新注释的掩码对 SAM 进行重新训练。随着收集的掩码越来越多，图像编码器也从 ViT-B 升级到 ViT-H，其他架构细节也在不断变化；我们总共对模型进行了 6 次重新训练。随着模型的改进，每个掩码的平均标注时间从 34 秒减少到 14 秒。我们注意到，14 秒比 COCO 的掩码注释快 6.5 ${\times}$ [66]，比用极值点进行边界框标注仅慢 2 ${\times}$ [76, 71]。随着 SAM 的改进，每幅图像的平均掩码数量从 20 个增加到 44 个。总体而言，我们在这一阶段从 120k 幅图像中收集了 430 万个掩码。

Semi-automatic stage. 半自动舞台

In this stage, we aimed to increase the diversity of masks in order to improve our model’s ability to segment anything. To focus annotators on less prominent objects, we first automatically detected confident masks. Then we presented annotators with images prefilled with these masks and asked them to annotate any additional unannotated objects. To detect confident masks, we trained a bounding box detector [84] on all first stage masks using a generic “object” category. During this stage we collected an additional 5.9M masks in 180k images (for a total of 10.2M masks). As in the first stage, we periodically retrained our model on newly collected data (5 times). Average annotation time per mask went back up to 34 seconds (excluding the automatic masks) as these objects were more challenging to label. The average number of masks per image went from 44 to 72 masks (including the automatic masks).
在这一阶段，我们的目标是增加遮罩的多样性，以提高我们的模型分割任何物体的能力。为了让注释者专注于不太突出的物体，我们首先自动检测出有信心的遮罩。然后，我们向注释者展示预填充了这些掩码的图像，并要求他们注释任何其他未注释的物体。为了检测有把握的遮罩，我们使用通用的 "物体 "类别对所有第一阶段的遮罩进行了边界框检测器[84]训练。在这一阶段，我们又在 180k 幅图像中收集了 590 万个遮罩（共计 1020 万个遮罩）。与第一阶段一样，我们定期在新收集的数据上重新训练模型（5 次）。每个掩码的平均标注时间回升到 34 秒（不包括自动掩码），因为这些对象的标注更具挑战性。每幅图像的平均遮罩数从 44 个增加到 72 个（包括自动遮罩）。

Fully automatic stage. 全自动舞台。

In the final stage, annotation was fully automatic. This was feasible due to two major enhancements to our model. First, at the start of this stage, we had collected enough masks to greatly improve the model, including the diverse masks from the previous stage. Second, by this stage we had developed the ambiguity-aware model, which allowed us to predict valid masks even in ambiguous cases. Specifically, we prompted the model with a 32 ${\times}$ 32 regular grid of points and for each point predicted a set of masks that may correspond to valid objects. With the ambiguity-aware model, if a point lies on a part or subpart, our model will return the subpart, part, and whole object. The IoU prediction module of our model is used to select confident masks; moreover, we identified and selected only stable masks (we consider a mask stable if thresholding the probability map at $0.5-\delta$ and $0.5+\delta$ results in similar masks). Finally, after selecting the confident and stable masks, we applied non-maximal suppression (NMS) to filter duplicates. To further improve the quality of smaller masks, we also processed multiple overlapping zoomed-in image crops. For further details of this stage, see §B. We applied fully automatic mask generation to all 11M images in our dataset, producing a total of 1.1B high-quality masks. We describe and analyze the resulting dataset, SA-1B, next.
在最后阶段，注释工作完全自动化。之所以能够做到这一点，是因为我们对模型进行了两大改进。首先，在这一阶段开始时，我们已经收集了足够多的掩码来大大改进模型，包括上一阶段的各种掩码。其次，在这一阶段，我们已经开发出了模棱两可的感知模型，即使在模棱两可的情况下，我们也能预测出有效的掩码。具体来说，我们用 32 ${\times}$ 32 个有规则的网格点来提示模型，并为每个点预测一组可能与有效对象相对应的遮罩。有了模糊感知模型，如果一个点位于一个部件或子部件上，我们的模型就会返回子部件、部件和整个对象。我们模型的 IoU 预测模块用于选择有把握的掩码；此外，我们只识别和选择稳定的掩码（如果在 $0.5-\delta$ 和 $0.5+\delta$ 处对概率图进行阈值处理后得到相似的掩码，我们就认为该掩码是稳定的）。最后，在选择了可靠和稳定的掩码后，我们使用非最大抑制（NMS）来过滤重复的掩码。为了进一步提高较小掩码的质量，我们还处理了多个重叠的放大图像裁剪。有关这一阶段的更多详情，请参阅§B。我们对数据集中的所有 1,100 万张图像进行了全自动掩膜生成，总共生成了 1.1B 个高质量掩膜。接下来，我们将描述并分析生成的数据集 SA-1B。

5 Segment Anything Dataset
5段任何数据集

Our dataset, SA-1B, consists of 11M diverse, high-resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks collected with our data engine. We compare SA-1B with existing datasets and analyze mask quality and properties. We are releasing SA-1B to aid future development of foundation models for computer vision. We note that SA-1B will be released under a favorable license agreement for certain research uses and with protections for researchers.
我们的数据集 SA-1B 由 1,100 万张不同的、高分辨率的、经过授权的、隐私保护的图像和我们的数据引擎收集的 11B 个高质量分割掩码组成。我们将 SA-1B 与现有数据集进行了比较，并分析了遮罩的质量和属性。我们发布 SA-1B 是为了帮助未来计算机视觉基础模型的开发。我们注意到，SA-1B 将根据有利的许可协议发布，用于某些研究用途，并为研究人员提供保护。

Images 图片

. We licensed a new set of 11M images from a provider that works directly with photographers. These images are high resolution (3300 ${\times}$ 4950 pixels on average), and the resulting data size can present accessibility and storage challenges. Therefore, we are releasing downsampled images with their shortest side set to 1500 pixels. Even after downsampling, our images are significantly higher resolution than many existing vision datasets (e.g., COCO [66] images are $\scriptstyle\sim$ 480 ${\times}$ 640 pixels). Note that most models today operate on much lower resolution inputs. Faces and vehicle license plates have been blurred in the released images.
.我们从一个直接与摄影师合作的供应商那里获得了一组新的 1,100 万张图片的授权。这些图片分辨率很高（平均 3300 ${\times}$ 4950 像素），由此产生的数据量可能会给访问和存储带来挑战。因此，我们发布的是经过缩减取样的图像，其最短边设置为 1500 像素。即使在降低采样率后，我们的图像分辨率也大大高于许多现有的视觉数据集（例如，COCO [ 66] 图像的分辨率为 $\scriptstyle\sim$ 480 ${\times}$ 640 像素）。请注意，目前大多数模型都是在分辨率低得多的输入上运行的。在已发布的图像中，人脸和车辆牌照已被模糊处理。

Masks 面具

. Our data engine produced 1.1B masks, 99.1% of which were generated fully automatically. Therefore, the quality of the automatic masks is centrally important. We compare them directly to professional annotations and look at how various mask properties compare to prominent segmentation datasets. Our main conclusion, as borne out in the analysis below and the experiments in §7, is that our automatic masks are high quality and effective for training models. Motivated by these findings, SA-1B only includes automatically generated masks.
.我们的数据引擎生成了 1.1B 个掩码，其中 99.1% 是全自动生成的。因此，自动掩码的质量至关重要。我们将它们直接与专业注释进行比较，并考察各种掩码属性与著名分割数据集的比较情况。下文的分析和第 7 节的实验证明，我们的主要结论是，我们的自动遮罩质量高，对训练模型很有效。基于这些结论，SA-1B 只包含自动生成的掩码。

Mask quality. 面罩质量。

To estimate mask quality, we randomly sampled 500 images ( $\scriptstyle\sim$ 50k masks) and asked our professional annotators to improve the quality of all masks in these images. Annotators did so using our model and pixel-precise “brush” and “eraser” editing tools. This procedure resulted in pairs of automatically predicted and professionally corrected masks. We computed IoU between each pair and found that 94% of pairs have greater than 90% IoU (and 97% of pairs have greater than 75% IoU). For comparison, prior work estimates inter-annotator consistency at 85-91% IoU [44, 60]. Our experiments in §7 confirm by human ratings that mask quality is high relative to a variety of datasets and that training our model on automatic masks is nearly as good as using all masks produced by the data engine.
为了评估遮罩的质量，我们随机抽取了 500 幅图像（ $\scriptstyle\sim$ 50k 个遮罩），并要求我们的专业注释员提高这些图像中所有遮罩的质量。注释者使用我们的模型和像素精确的 "画笔 "和 "橡皮擦 "编辑工具来完成这项工作。这一过程产生了一对自动预测和专业修正的遮罩。我们计算了每对蒙版之间的 IoU，发现 94% 的蒙版之间的 IoU 大于 90%（97% 的蒙版之间的 IoU 大于 75%）。相比之下，先前的工作估计注释者之间的一致性为 85-91% IoU [ 44, 60]。我们在第 7 节中的实验通过人工评分证实，相对于各种数据集，掩码质量很高，而且在自动掩码上训练我们的模型几乎与使用数据引擎生成的所有掩码一样好。

Mask properties. 面具特性

In Fig. 5 we plot the spatial distribution of object centers in SA-1B compared to the largest existing segmentation datasets. Common photographer biases are present in all datasets. We observe that SA-1B has greater coverage of image corners compared to LVIS v1 [44] and ADE20K [117], the two most similarly distributed datasets, while COCO [66] and Open Images V5 [60] have a more prominent center bias. In Fig. 6 (legend) we compare these datasets by size. SA-1B has 11 ${\times}$ more images and 400 ${\times}$ more masks than the second largest, Open Images. On average, it has 36 ${\times}$ more masks per image than Open Images. The closest dataset in this respect, ADE20K, still has 3.5 ${\times}$ fewer masks per image. Fig. 6 (left) plots the masks-per-image distribution. Next, we look at image-relative mask size (square root of the mask area divided by image area) in Fig. 6 (middle). As expected, since our dataset has more masks per image, it also tends to include a greater percentage of small and medium relative-size masks. Finally, to analyze shape complexity, we look at mask concavity (1 minus mask area divided by area of mask’s convex hull) in Fig. 6 (right). Since shape complexity is correlated with mask size, we control for the datasets’ mask size distributions by first performing stratified sampling from binned mask sizes. We observe that the concavity distribution of our masks is broadly similar to that of other datasets.
在图 5 中，我们绘制了 SA-1B 中物体中心的空间分布图，并与现有最大的分割数据集进行了比较。所有数据集都存在共同的摄影偏差。我们观察到，与 LVIS v1 [ 44] 和 ADE20K [ 117] 这两个分布最相似的数据集相比，SA-1B 对图像角落的覆盖范围更大，而 COCO [ 66] 和 Open Images V5 [ 60] 的中心偏差更为突出。在图 6（图例）中，我们按大小对这些数据集进行了比较。与第二大数据集 "开放图像 "相比，SA-1B 的图像数量多 11 ${\times}$ ，遮罩数量多 400 ${\times}$ 。平均而言，每个图像的掩码数比 Open Images 多 36 ${\times}$ 。在这方面，最接近的数据集 ADE20K 每幅图像的掩码仍比它少 3.5 ${\times}$ 。图 6（左）显示了每张图像的掩码分布。接下来，我们在图 6（中）中查看图像相关掩码大小（掩码面积的平方根除以图像面积）。不出所料，由于我们的数据集每幅图像有更多的掩码，因此也倾向于包含更大比例的中小型相对大小的掩码。最后，为了分析形状复杂性，我们查看了图 6（右图）中的掩膜凹度（1 减去掩膜面积除以掩膜凸壳面积）。由于形状复杂度与掩膜尺寸相关，我们首先对掩膜尺寸进行分层抽样，以控制数据集的掩膜尺寸分布。我们观察到，我们的掩码的凹度分布与其他数据集的凹度分布大致相似。

6 Segment Anything RAI Analysis
6 分部任何 RAI 分析

We next perform a Responsible AI (RAI) analysis of our work by investigating potential fairness concerns and biases when using SA-1B and SAM. We focus on the geographic and income distribution of SA-1B and fairness of SAM across protected attributes of people. We also provide dataset, data annotation, and model cards in §F.
接下来，我们通过调查使用 SA-1B 和 SAM 时可能存在的公平性问题和偏差，对我们的工作进行了负责任的人工智能 (RAI) 分析。我们重点关注 SA-1B 的地域和收入分布以及 SAM 在受保护的人群属性方面的公平性。我们还在 §F 中提供了数据集、数据注释和模型卡。

		SA-1B		% images % 图像
	# countries # 国家	#imgs	#masks	SA-1B	COCO	O.I.
Africa 非洲	54	300k	28M	2.8%	3.0%	1.7%
Asia & Oceania 亚洲和大洋洲	70	3.9M	423M	36.2%	11.4%	14.3%
Europe 欧洲	47	5.4M	540M	49.8%	34.2%	36.2%
Latin America & Carib. 拉丁美洲和加勒比	42	380k	36M	3.5%	3.1%	5.0%
North America 北美	4	830k	80M	7.7%	48.3%	42.8%
high income countries 高收入国家	81	5.8M	598M	54.0%	89.1%	87.5%
middle income countries 中等收入国家	108	4.9M	499M	45.0%	10.5%	12.0%
low income countries 低收入国家	28	100k	9.4M	0.9%	0.4%	0.5%

Table 1: Comparison of geographic and income representation. SA-1B has higher representation in Europe and Asia & Oceania as well as middle income countries. Images from Africa, Latin America & Caribbean, as well as low income countries, are underrepresented in all datasets.
表 1：地域和收入代表性比较。SA-1B 在欧洲、亚洲和大洋洲以及中等收入国家的代表性较高。非洲、拉丁美洲和加勒比地区以及低收入国家的图像在所有数据集中的代表性都较低。

Geographic and income representation.
地域和收入代表性。

We infer the country images were photographed in using standard methods (see §C). In Fig. 7 we visualize the per-country image counts in SA-1B (left) and the 50 countries with the most images (right). We note that the top-three countries are from different parts of the world. Next, in Table 1 we compare the geographic and income representation of SA-1B, COCO [66], and Open Images [60]. SA-1B has a substantially higher percentage of images in Europe and Asia & Oceania as well as in middle income countries. All datasets underrepresent Africa as well as low income countries. We note that in SA-1B, all regions, including Africa, have at least 28 million masks, 10 ${\times}$ more than the total number of masks of any previous dataset. Finally, we observe that the average number of masks per image (not shown) is fairly consistent across region and income (94-108 per image).
我们使用标准方法推断图像拍摄的国家（见 C 节）。在图 7 中，我们展示了 SA-1B 中每个国家的图片数量（左图）和图片数量最多的 50 个国家（右图）。我们注意到，排名前三位的国家来自世界不同地区。接下来，我们在表 1 中比较了 SA-1B、COCO [ 66] 和 Open Images [ 60] 在地域和收入方面的代表性。在 SA-1B 中，欧洲、亚洲和大洋洲以及中等收入国家的图像比例要高得多。所有数据集在非洲和低收入国家的代表性都不足。我们注意到，在 SA-1B 中，包括非洲在内的所有地区都有至少 2800 万个掩码，比以往任何数据集的掩码总数都要多 10 ${\times}$ 。最后，我们注意到，每幅图像的平均掩码数量（未显示）在不同地区和收入国家相当一致（每幅图像 94-108 个）。

Fairness in segmenting people.
公平划分人群。

We investigate potential fairness concerns across perceived gender presentation, perceived age group, and perceived skin tone by measuring the performance discrepancy of SAM between groups. We use the More Inclusive Annotations for People (MIAP) [87] dataset for gender presentation and age and a proprietary dataset for skin tone (see §C). Our evaluation uses simulated interactive segmentation with random sampling of 1 and 3 points (see §D). Table 2 (top left) shows results for perceived gender presentation. We note that females have been shown to be underrepresented in detection and segmentation datasets [115], but observe that SAM performs similarly across groups. We repeat the analysis for perceived age in Table 2 (bottom left), noting that those who are perceived to be younger and older have been shown to be underrepresented in large-scale datasets [110]. SAM performs best on those who are perceived older (although the confidence interval is large). Finally, we repeat the analysis for perceived skin tone in Table 2 (right), noting that those with lighter apparent skin tones have been shown to be overrepresented and those with darker skin tones underrepresented in large-scale datasets [110]. As MIAP does not contain perceived skin tone annotations, we use a proprietary dataset that contains annotations for the perceived Fitzpatrick skin type [36], which ranges from 1 (lightest skin tone) to 6 (darkest skin tone). While the means vary somewhat, we do not find a significant difference across groups. We believe our findings stem from the nature of the task, and acknowledge biases may arise when SAM is used as a component in larger systems. Finally, in §C we extend the analysis to segmenting clothing where we find an indication of bias across perceived gender presentation.
我们通过测量 SAM 在不同群体间的表现差异，调查了在感知性别呈现、感知年龄组别和感知肤色方面可能存在的公平性问题。我们使用 "More Inclusive Annotations for People（MIAP）" [ 87] 数据集来测量性别和年龄，并使用一个专有数据集来测量肤色（见第 C 节）。我们的评估使用 1 点和 3 点随机抽样的模拟交互式分割（见 §D）。表 2（左上）显示了感知性别呈现的结果。我们注意到，女性在检测和分割数据集中的代表性不足[115]，但观察到 SAM 在不同群体中的表现类似。我们在表 2（左下）中重复了对感知年龄的分析，注意到在大规模数据集中，那些被认为是年轻和年长的人的代表性不足[110]。SAM 在被认为年龄较大的人群中表现最佳（尽管置信区间较大）。最后，我们在表 2（右）中重复了对感知肤色的分析，注意到在大规模数据集中，表观肤色较浅的人所占比例较高，而肤色较深的人所占比例较低[110]。由于 MIAP 不包含感知肤色注释，我们使用了一个包含感知菲茨帕特里克肤色类型注释的专有数据集[36]，其范围从 1（最浅肤色）到 6（最深肤色）不等。虽然平均值略有不同，但我们并未发现各组之间存在显著差异。我们认为，我们的研究结果源于任务的性质，并承认当 SAM 被用作大型系统的一个组件时，可能会出现偏差。最后，在第 C 节中，我们将分析扩展到对服装的细分，在服装细分中，我们发现了不同性别呈现的偏差迹象。

	mIoU at 在
	1 point 1 分	3 points 3 分
perceived gender presentation 性别认知
feminine 女性化	54.4 $\pm$ 1.7	90.4 $\pm$ 0.6
masculine 阳性	55.7 $\pm$ 1.7	90.1 $\pm$ 0.6
perceived age group 认知年龄组
older 年长	62.9 $\pm$ 6.7	92.6 $\pm$ 1.3
middle 中层	54.5 $\pm$ 1.3	90.2 $\pm$ 0.5
young 年轻	54.2 $\pm$ 2.2	91.2 $\pm$ 0.7

	mIoU at 在
	1 point 1 分	3 points 3 分
perceived skin tone 感知肤色
1	52.9 $\pm$ 2.2	91.0 $\pm$ 0.9
2	51.5 $\pm$ 1.4	91.1 $\pm$ 0.5
3	52.2 $\pm$ 1.9	91.4 $\pm$ 0.7
4	51.5 $\pm$ 2.7	91.7 $\pm$ 1.0
5	52.4 $\pm$ 4.2	92.5 $\pm$ 1.4
6	56.7 $\pm$ 6.3	91.2 $\pm$ 2.4

Table 2: SAM’s performance segmenting people across perceived gender presentation, age group, and skin tone. 95% confidence intervals are shown. Within each grouping, all confidence intervals overlap except older vs. middle.
表 2：SAM 对不同性别、年龄组和肤色的人进行细分的结果。显示了 95% 的置信区间。在每个组别中，除老年与中年外，所有置信区间都是重叠的。

7 Zero-Shot Transfer Experiments
7 零射程转移实验

In this section, we present zero-shot transfer experiments with SAM, the Segment Anything Model. We consider five tasks, four of which differ significantly from the promptable segmentation task used to train SAM. These experiments evaluate SAM on datasets and tasks that were not seen during training (our usage of “zero-shot transfer” follows its usage in CLIP [82]). The datasets may include novel image distributions, such as underwater or ego-centric images (e.g. Fig. 8) that, to our knowledge, do not appear in SA-1B.
在本节中，我们将介绍 SAM（即 Segment Anything Model）的零镜头传输实验。我们考虑了五项任务，其中四项与用于训练 SAM 的可提示分割任务有很大不同。这些实验在训练过程中没有出现过的数据集和任务上对 SAM 进行评估（我们对 "零镜头转移 "的用法沿用了 CLIP [ 82] 中的用法）。这些数据集可能包括新颖的图像分布，如水下图像或以自我为中心的图像（如图 8），据我们所知，这些图像在 SA-1B 中并未出现。

Our experiments begin by testing the core goal of promptable segmentation: producing a valid mask from any prompt. We emphasize the challenging scenario of a single foreground point prompt, since it is more likely to be ambiguous than other more specific prompts. Next, we present a sequence of experiments that traverse low, mid, and high-level image understanding and roughly parallel the historical development of the field. Specifically, we prompt SAM to (1) perform edge detection, (2) segment everything, i.e. object proposal generation, (3) segment detected objects, i.e. instance segmentation, and (4), as a proof-of-concept, to segment objects from free-form text. These four tasks differ significantly from the promptable segmentation task that SAM was trained on and are implemented via prompt engineering. Our experiments conclude with an ablation study.
我们的实验首先测试了可提示分割的核心目标：根据任何提示生成有效的遮罩。我们强调单个前景点提示的挑战性，因为它比其他更具体的提示更容易产生歧义。接下来，我们将介绍一系列实验，这些实验涉及低、中、高级图像理解，与该领域的历史发展大致平行。具体来说，我们提示 SAM：(1) 执行边缘检测；(2) 分割一切，即生成对象建议；(3) 分割检测到的对象，即实例分割；(4) 作为概念验证，从自由格式文本中分割对象。这四项任务与 SAM 所训练的可提示分割任务有很大不同，都是通过提示工程来实现的。我们的实验以一项消融研究结束。

Implementation. 实施。

Unless otherwise specified: (1) SAM uses an MAE [47] pre-trained ViT-H [33] image encoder and (2) SAM was trained on SA-1B, noting that this dataset includes only automatically generated masks from the final stage of our data engine. For all other model and training details, such as hyperparameters, refer to §A.
除非另有说明：(1) SAM 使用的是 MAE [ 47] 预先训练过的 ViT-H [ 33] 图像编码器；(2) SAM 是在 SA-1B 上训练的，注意这个数据集只包括我们数据引擎最后阶段自动生成的掩码。有关超参数等所有其他模型和训练细节，请参阅§A。

7.1 Zero-Shot Single Point Valid Mask Evaluation
7.1 零点单点有效屏蔽评估

Task. 任务。

We evaluate segmenting an object from a single foreground point. This task is ill-posed as one point can refer to multiple objects. Ground truth masks in most datasets do not enumerate all possible masks, which can make automatic metrics unreliable. Therefore, we supplement the standard mIoU metric (i.e., the mean of all IoUs between predicted and ground truth masks) with a human study in which annotators rate mask quality from 1 (nonsense) to 10 (pixel-perfect). See §D.1, §E, and §G for additional details.
我们对从单个前景点分割物体进行了评估。由于一个点可以指多个物体，因此这项任务的难度很大。大多数数据集中的地面实况掩码并不能列举所有可能的掩码，这可能导致自动度量不可靠。因此，我们在标准 mIoU 指标（即预测掩膜与地面实况掩膜之间所有 IoU 的平均值）的基础上，增加了一项人工研究，即注释者对掩膜质量的评分从 1（无意义）到 10（像素完美）。更多详情请参见§D.1、§E 和§G。

By default, we sample points from the “center” of ground truth masks (at a maximal value of the mask’s interior distance transform), following the standard evaluation protocol in interactive segmentation [92]. Since SAM is capable of predicting multiple masks, we evaluate only the model’s most confident mask by default. The baselines are all single-mask methods. We compare mainly to RITM [92], a strong interactive segmenter that performs best on our benchmark compared to other strong baselines [67, 18].
默认情况下，我们会按照交互式分割中的标准评估协议[92]，从地面实况掩码的 "中心"（掩码内部距离变换的最大值）对点进行采样。由于 SAM 能够预测多个掩码，因此我们默认只评估模型中最有把握的掩码。基线都是单掩膜方法。我们主要将其与 RITM [ 92] 进行比较，RITM 是一种强大的交互式分割器，与其他强大的基准相比，它在我们的基准上表现最佳 [ 67, 18]。

Datasets. 数据集。

We use a newly compiled suite of 23 datasets with diverse image distributions. Fig. 8 lists the datasets and shows a sample from each one (see appendix Table 7 for more details). We use all 23 datasets for mIoU evaluation. For the human study, we use the subset listed in Fig. 9b (due to the resource requirements of such studies). This subset includes both datasets for which SAM outperforms and underperforms RITM according to automatic metrics.
我们使用了一套新编制的 23 个数据集，这些数据集具有不同的图像分布。图 8 列出了这些数据集，并展示了每个数据集的样本（更多详情请参见附录表 7）。我们使用所有 23 个数据集进行 mIoU 评估。对于人体研究，我们使用图 9b 中列出的子集（由于此类研究的资源要求）。该子集包括根据自动指标 SAM 优于 RITM 和 RITM 劣于 SAM 的两个数据集。

Results. 结果

First, we look at automatic evaluation on the full suite of 23 datasets using mIoU. We compare per-dataset results in Fig. 9a against RITM. SAM yields higher results on 16 of the 23 datasets, by as much as $\scriptstyle\sim$ 47 IoU. We also present an “oracle” result, in which the most relevant of SAM’s 3 masks is selected by comparing them to the ground truth, rather than selecting the most confident mask. This reveals the impact of ambiguity on automatic evaluation. In particular, with the oracle to perform ambiguity resolution, SAM outperforms RITM on all datasets.
首先，我们使用 mIoU 对全部 23 个数据集进行自动评估。在图 9a 中，我们将每个数据集的结果与 RITM 进行了比较。在 23 个数据集中的 16 个数据集上，SAM 得到了更高的结果，高达 $\scriptstyle\sim$ 47 IoU。我们还展示了一个 "oracle "结果，即通过将 SAM 的 3 个掩码与地面实况进行比较，选择出最相关的掩码，而不是选择最有信心的掩码。这揭示了模糊性对自动评估的影响。特别是，在使用 "神谕 "来解决模糊性问题时，SAM 在所有数据集上的表现都优于 RITM。

Results of the human study are presented in Fig. 9b. Error bars are 95% confidence intervals for mean mask ratings (all differences are significant; see §E for details). We observe that the annotators consistently rate the quality of SAM’s masks substantially higher than the strongest baseline, RITM. An ablated, “ambiguity-unaware” version of SAM with a single output mask has consistently lower ratings, though still higher than RITM. SAM’s mean ratings fall between 7 and 9, which corresponds to the qualitative rating guideline: “A high score (7-9): The object is identifiable and errors are small and rare (e.g., missing a small, heavily obscured disconnected component, …).” These results indicate that SAM has learned to segment valid masks from a single point. Note that for datasets like DRAM and IBD, where SAM is worse on automatic metrics, it receives consistently higher ratings in the human study.
人类研究的结果见图 9b。误差条为平均掩码评分的 95% 置信区间（所有差异均显著；详见 §E）。我们观察到，注释者对 SAM 掩码质量的评价始终远远高于最强的基线 RITM。具有单一输出掩码的 "无歧义感知 "版 SAM 的评分一直较低，但仍高于 RITM。SAM 的平均评分在 7 到 9 分之间，符合定性评分准则："高分（7-9 分）：对象可识别，错误小且罕见（如遗漏一个小的、严重模糊的断开组件......）"。这些结果表明，SAM 已学会从单点开始分割有效的掩码。请注意，在 DRAM 和 IBD 等数据集上，SAM 的自动指标较差，但在人工研究中却获得了一致较高的评分。

Fig. 9c shows additional baselines, SimpleClick [67] and FocalClick [18], which obtain lower single point performance than RITM and SAM. As the number of points increases from 1 to 9, we observe that the gap between methods decreases. This is expected as the task becomes easier; also, SAM is not optimized for the very high IoU regime. Finally, in Fig. 9d we replace the default center point sampling with random point sampling. We observe that the gap between SAM and the baselines grows and SAM is able to achieve comparable results under either sampling method.
图 9c 显示了其他基线方法：SimpleClick [ 67] 和 FocalClick [ 18]，它们的单点性能低于 RITM 和 SAM。随着点数从 1 个增加到 9 个，我们发现方法之间的差距在缩小。这是意料之中的，因为任务变得更容易了；另外，SAM 没有针对极高的 IoU 机制进行优化。最后，在图 9d 中，我们用随机点采样取代了默认的中心点采样。我们观察到，SAM 与基线之间的差距越来越大，而 SAM 在任一采样方法下都能取得相当的结果。

7.2 Zero-Shot Edge Detection
7.2 零镜头边缘检测

Approach. 方法。

We evaluate SAM on the classic low-level task of edge detection using BSDS500 [72, 3]. We use a simplified version of our automatic mask generation pipeline. Specifically, we prompt SAM with a 16 ${\times}$ 16 regular grid of foreground points resulting in 768 predicted masks (3 per point). Redundant masks are removed by NMS. Then, edge maps are computed using Sobel filtering of unthresholded mask probability maps and standard lightweight postprocessing, including edge NMS (see §D.2 for details).
我们使用 BSDS500 [ 72, 3] 对 SAM 的经典低级边缘检测任务进行了评估。我们使用了简化版的自动掩膜生成管道。具体来说，我们使用 16 ${\times}$ 16 个前景点的规则网格来提示 SAM，从而得到 768 个预测掩码（每个点 3 个）。冗余遮罩由 NMS 去除。然后，使用未阈值掩膜概率图的索贝尔滤波和标准轻量级后处理（包括边缘 NMS）计算边缘图（详见第 D.2 节）。

method 方法	year 年	ODS	OIS	AP	R50
HED [108]	2015	.788	.808	.840	.923
EDETR [79] EDETR [ 79］	2022	.840	.858	.896	.930
zero-shot transfer methods: 零镜头传输方法：
Sobel filter 索贝尔滤波器	1968	.539	-	-	-
Canny [13] Canny [ 13］	1986	.600	.640	.580	-
Felz-Hutt [35] 费尔兹-胡特 [ 35］	2004	.610	.640	.560	-
SAM	2023	.768	.786	.794	.928

Table 3: Zero-shot transfer to edge detection on BSDS500.
表 3：在 BSDS500 上从零镜头传输到边缘检测。

Results. 结果

We visualize representative edge maps in Fig. 10 (see Fig. 15 for more). Qualitatively, we observe that even though SAM was not trained for edge detection, it produces reasonable edge maps. Compared to the ground truth, SAM predicts more edges, including sensible ones that are not annotated in BSDS500. This bias is reflected quantitatively in Table 3: recall at 50% precision (R50) is high, at the cost of precision. SAM naturally lags behind state-of-the-art methods that learn the biases of BSDS500, i.e., which edges to suppress. Nevertheless, SAM performs well compared to pioneering deep learning methods such as HED [108] (also trained on BSDS500) and significantly better than prior, though admittedly outdated, zero-shot transfer methods.
我们在图 10 中展示了具有代表性的边缘图（更多信息请参见图 15）。我们观察到，尽管 SAM 没有经过边缘检测的训练，但它还是绘制出了合理的边缘图。与地面实况相比，SAM 预测了更多的边缘，包括 BSDS500 中没有注释的合理边缘。表 3 定量地反映了这一偏差：50% 精确度下的召回率（R50）很高，但精确度却有所降低。SAM 自然落后于学习 BSDS500 偏差（即抑制哪些边缘）的最先进方法。不过，与 HED [ 108]（也是在 BSDS500 上训练的）等开创性深度学习方法相比，SAM 的表现还是不错的，而且明显优于之前的零点转移方法，尽管这种方法已经过时。

7.3 Zero-Shot Object Proposals
7.3 零投篮物体提案

Approach. 方法。

Next, we evaluate SAM on the mid-level task of object proposal generation [2, 102]. This task has played an important role in object detection research, serving as an intermediate step in pioneering systems (e.g., [102, 41, 84]). To generate object proposals, we run a slightly modified version of our automatic mask generation pipeline and output the masks as proposals (see §D.3 for details).
接下来，我们将对 SAM 的中级任务--生成对象建议[2, 102]--进行评估。这项任务在物体检测研究中发挥了重要作用，是先驱系统（如[ 102, 41, 84]）的中间步骤。为了生成对象建议，我们运行了稍作修改的自动遮罩生成管道，并将遮罩输出为建议（详见第 D.3 节）。

We compute the standard average recall (AR) metric on LVIS v1 [44]. We focus on LVIS because its large number of categories presents a challenging test. We compare to a strong baseline implemented as a ViTDet [62] detector (with cascade Mask R-CNN [48, 11] ViT-H). We note that this “baseline” corresponds to the “Detector Masquerading as Proposal generator” (DMP) method [16] that was shown to game AR, making it a truly demanding comparison.
我们在 LVIS v1 [ 44] 上计算标准平均召回率 (AR) 指标。我们将重点放在 LVIS 上，因为它有大量的类别，是一项具有挑战性的测试。我们将其与作为 ViTDet [ 62] 检测器（使用级联掩码 R-CNN [ 48, 11] ViT-H）实现的强大基线进行比较。我们注意到，这一 "基线 "对应的是 "伪装成提案生成器的检测器"（DMP）方法[ 16]，该方法已被证明可与 AR 进行博弈，因此这是一项真正高要求的比较。

	mask AR@1000 掩码 AR@1000
method 方法	all 一应俱全	small 小	med. 中。	large 大	freq. 频率	com.	rare 罕见的
ViTDet-H [62] ViTDet-H [ 62］	63.0	51.7	80.8	87.0	63.1	63.3	58.3
zero-shot transfer methods: 零镜头传输方法：
SAM – single out. SAM - 单人出局。	54.9	42.8	76.7	74.4	54.7	59.8	62.0
SAM	59.3	45.5	81.6	86.9	59.1	63.9	65.8

Table 4: Object proposal generation on LVIS v1. SAM is applied zero-shot, i.e. it was not trained for object proposal generation nor did it access LVIS images or annotations.
表 4：在 LVIS v1 上生成对象建议。SAM 是零拍摄应用，即没有经过生成对象建议的训练，也没有访问 LVIS 图像或注释。

Results. 结果

In Table 4 we see unsurprisingly that using the detections from ViTDet-H as object proposals (i.e., the DMP method [16] that games AR) performs the best overall. However, SAM does remarkably well on several metrics. Notably, it outperforms ViTDet-H on medium and large objects, as well as rare and common objects. In fact, SAM only underperforms ViTDet-H on small objects and frequent objects, where ViTDet-H can easily learn LVIS-specific annotation biases since it was trained on LVIS, unlike SAM. We also compare against an ablated ambiguity-unaware version of SAM (“single out.”), which performs significantly worse than SAM on all AR metrics.
从表 4 中我们不难看出，使用 ViTDet-H 的检测结果作为对象建议（即与 AR 游戏相关的 DMP 方法[16]）的整体表现最佳。然而，SAM 在几个指标上的表现都非常出色。值得注意的是，它在中型和大型物体以及稀有和常见物体上的表现都优于 ViTDet-H。事实上，SAM 仅在小型物体和常见物体上的表现不如 ViTDet-H，而 ViTDet-H 可以很容易地学习 LVIS 特有的注释偏差，因为 ViTDet-H 是在 LVIS 上训练出来的，这一点与 SAM 不同。我们还将其与 SAM 的不感知歧义版本（"single out."）进行了比较，后者在所有 AR 指标上的表现都明显不如 SAM。

7.4 Zero-Shot Instance Segmentation
7.4 零镜头实例分割

Approach. 方法。

Moving to higher-level vision, we use SAM as the segmentation module of an instance segmenter. The implementation is simple: we run a object detector (the ViTDet used before) and prompt SAM with its output boxes. This illustrates composing SAM in a larger system.
在更高层次的视觉方面，我们将 SAM 用作实例分割器的分割模块。实现方法很简单：我们运行一个对象检测器（之前使用过的 ViTDet），然后用其输出框提示 SAM。这说明了如何将 SAM 组合到一个更大的系统中。

Results. 结果

We compare the masks predicted by SAM and ViTDet on COCO and LVIS in Table 5. Looking at the mask AP metric we observe gaps on both datasets, where SAM is reasonably close, though certainly behind ViTDet. By visualizing outputs, we observed that SAM masks are often qualitatively better than those of ViTDet, with crisper boundaries (see §D.4 and Fig. 16). To investigate this observation, we conducted an additional human study asking annotators to rate the ViTDet masks and SAM masks on the 1 to 10 quality scale used before. In Fig. 11 we observe that SAM consistently outperforms ViTDet in the human study.
表 5 比较了 SAM 和 ViTDet 在 COCO 和 LVIS 上预测的掩码。通过观察掩码 AP 指标，我们发现在这两个数据集上都存在差距，其中 SAM 与 ViTDet 相当接近，但肯定落后于 ViTDet。通过可视化输出，我们观察到 SAM 的掩码通常比 ViTDet 的掩码质量更好，边界更清晰（见第 D.4 节和图 16）。为了研究这一观察结果，我们又进行了一项人工研究，要求注释者按照之前使用的 1 到 10 的质量等级对 ViTDet 掩膜和 SAM 掩膜进行评分。在图 11 中，我们发现 SAM 在人类研究中的表现始终优于 ViTDet。

	COCO [66] COCO [ 66］				LVIS v1 [44]
method 方法	AP	AP^S	AP^M	AP^L	AP	AP^S	AP^M	AP^L
ViTDet-H [62] ViTDet-H [ 62］	51.0	32.0	54.3	68.9	46.6	35.0	58.0	66.3
zero-shot transfer methods (segmentation module only): 零镜头传输方法（仅限分割模块）：
SAM	46.5	30.8	51.0	61.7	44.7	32.5	57.6	65.5

Table 5: Instance segmentation results. SAM is prompted with ViTDet boxes to do zero-shot segmentation. The fully-supervised ViTDet outperforms SAM, but the gap shrinks on the higher-quality LVIS masks. Interestingly, SAM outperforms ViTDet according to human ratings (see Fig. 11).
表 5：实例分割结果。使用 ViTDet 框提示 SAM 进行零镜头分割。完全监督的 ViTDet 的表现优于 SAM，但在更高质量的 LVIS 掩膜上，差距有所缩小。有趣的是，根据人工评分，SAM 的表现优于 ViTDet（见图 11）。

We hypothesize that on COCO, where the mask AP gap is larger and the ground truth quality is relatively low (as borne out by the human study), ViTDet learns the specific biases of COCO masks. SAM, being a zero-shot method, is unable to exploit these (generally undesirable) biases. The LVIS dataset has higher quality ground truth, but there are still specific idiosyncrasies (e.g., masks do not contain holes, they are simple polygons by construction) and biases for modal vs. amodal masks. Again, SAM is not trained to learn these biases, while ViTDet can exploit them.
我们假设，在 COCO 上，掩码 AP 差距较大，地面实况质量相对较低（人类研究证实了这一点），ViTDet 可以学习 COCO 掩码的特定偏差。而 SAM 作为一种零拍摄方法，无法利用这些（通常不可取的）偏差。LVIS 数据集的地面实况质量较高，但仍存在一些特殊的特异性（例如，掩膜不包含孔洞，它们在结构上是简单的多边形），以及模态掩膜与非模态掩膜的偏差。同样，SAM 没有经过训练来学习这些偏差，而 ViTDet 可以利用这些偏差。

7.5 Zero-Shot Text-to-Mask
7.5 零镜头文本到掩码

Approach. 方法。

Finally, we consider an even higher-level task: segmenting objects from free-form text. This experiment is a proof-of-concept of SAM’s ability to process text prompts. While we used the exact same SAM in all prior experiments, for this one SAM’s training procedure is modified to make it text-aware, but in a way that does not require new text annotations. Specifically, for each manually collected mask with area larger than $\textrm{100}^{\textrm{2}}$ we extract the CLIP image embedding. Then, during training, we prompt SAM with the extracted CLIP image embeddings as its first interaction. The key observation here is that because CLIP’s image embeddings are trained to align with its text embeddings, we can train with image embeddings, but use text embeddings for inference. That is, at inference time we run text through CLIP’s text encoder and then give the resulting text embedding as a prompt to SAM (see §D.5 for details).
最后，我们考虑了一个更高级别的任务：从自由格式文本中分割对象。本实验是对 SAM 处理文本提示能力的概念验证。虽然我们在之前的实验中使用的是完全相同的 SAM，但在本实验中，我们对 SAM 的训练程序进行了修改，使其具有文本感知能力，但不需要新的文本注释。具体来说，对于每个人工收集的面积大于 $\textrm{100}^{\textrm{2}}$ 的掩码，我们都会提取 CLIP 图像嵌入。然后，在训练过程中，我们用提取的 CLIP 图像嵌入作为第一次交互来提示 SAM。这里的关键观察点是，由于 CLIP 的图像嵌入经过训练后与文本嵌入保持一致，因此我们可以使用图像嵌入进行训练，但使用文本嵌入进行推理。也就是说，在推理时，我们通过 CLIP 的文本编码器运行文本，然后将生成的文本嵌入作为对 SAM 的提示（详见第 D.5 节）。

Results. 结果

We show qualitative results in Fig. 12. SAM can segment objects based on simple text prompts like “a wheel” as well as phrases like “beaver tooth grille”. When SAM fails to pick the right object from a text prompt only, an additional point often fixes the prediction, similar to [31].
我们在图 12 中展示了定性结果。SAM 可以根据简单的文本提示（如 "车轮"）以及短语（如 "海狸牙格栅"）来分割对象。当 SAM 无法仅从文本提示中挑选出正确的对象时，一个附加点往往可以修复预测结果，这与 [ 31] 相似。

7.6 Ablations 7.6 切割

We perform several ablations on our 23 dataset suite with the single center point prompt protocol. Recall that a single point may be ambiguous and that ambiguity may not be represented in the ground truth, which contains only a single mask per point. Since SAM is operating in a zero-shot transfer setting there can be systematic biases between SAM’s top-ranked mask vs. the masks resulting from data annotation guidelines. We therefore additionally report the best mask with respect to the ground truth (“oracle”).
我们使用单中心点提示协议对 23 个数据集套件进行了多次消融。回想一下，单个点可能是模糊的，而这种模糊性可能无法在地面实况中体现，因为地面实况中每个点只包含一个掩膜。由于 SAM 是在零镜头传输设置下运行的，因此 SAM 排名第一的掩码与数据注释指南产生的掩码之间可能存在系统性偏差。因此，我们额外报告了相对于地面实况（"oracle"）的最佳掩码。

Fig. 13 (left) plots SAM’s performance when trained on cumulative data from the data engine stages. We observe that each stage increases mIoU. When training with all three stages, the automatic masks vastly outnumber the manual and semi-automatic masks. To address this, we found that oversampling the manual and semi-automatic masks during training by 10 ${\times}$ gave best results. This setup complicates training. We therefore tested a fourth setup that uses only the automatically generated masks. With this data, SAM performs only marginally lower than using all data ( $\scriptstyle\sim$ 0.5 mIoU). Therefore, by default we use only the automatically generated masks to simplify the training setup.
图 13（左）显示了 SAM 在数据引擎各阶段的累积数据上进行训练后的性能。我们发现，每个阶段都会增加 mIoU。在使用所有三个阶段进行训练时，自动掩码的数量远远超过手动和半自动掩码。为了解决这个问题，我们发现在训练过程中对手动和半自动掩码进行 10 ${\times}$ 的超采样可以获得最佳效果。这种设置使训练变得复杂。因此，我们测试了第四种设置，即只使用自动生成的掩码。在使用这些数据时，SAM 的性能仅略低于使用所有数据时的性能（ $\scriptstyle\sim$ 0.5 mIoU）。因此，我们默认只使用自动生成的掩码来简化训练设置。

In Fig. 13 (middle) we look at the impact of data volume. The full SA-1B contains 11M images, which we uniformly subsample to 1M and 0.1M for this ablation. At 0.1M images, we observe a large mIoU decline under all settings. However, with 1M images, about 10% of the full dataset, we observe results comparable to using the full dataset. This data regime, which still includes approximately 100M masks, may be a practical setting for many use cases.
图 13（中）显示了数据量的影响。完整的 SA-1B 包含 1,100 万张图像，我们将这些图像统一抽样为 100 万张和 0.1 万张。在 110 万幅图像中，我们观察到在所有设置下 mIoU 都有大幅下降。然而，在 100 万幅图像（约为全部数据集的 10%）上，我们观察到的结果与使用全部数据集的结果相当。这种数据机制仍包括约 1 亿个掩膜，可能是许多用例的实用设置。

Finally, Fig. 13 (right) shows results with ViT-B, ViT-L, and ViT-H image encoders. ViT-H improves substantially over ViT-B, but has only marginal gains over ViT-L. Further image encoder scaling does not appear fruitful at this time.
最后，图 13（右）显示了 ViT-B、ViT-L 和 ViT-H 图像编码器的结果。与 ViT-B 相比，ViT-H 有了大幅提升，但与 ViT-L 相比，ViT-H 仅有微弱提升。目前看来，对图像编码器进行进一步的扩展并不奏效。

8 Discussion 8讨论

Foundation models. 基础模型。

Pre-trained models have been adapted to downstream tasks since the early days of machine learning [99]. This paradigm has become increasingly important in recent years with a growing emphasis on scale, and such models have recently been (re-)branded as “foundation models”: i.e. models that are “trained on broad data at scale and are adaptable to a wide range of downstream tasks” [8]. Our work correlates well with this definition, though we note that a foundation model for image segmentation is an inherently limited scope, since it represents an important, yet fractional, subset of computer vision. We also contrast one aspect of our approach with [8], which emphasizes the role of self-supervised learning in foundation models. While our model is initialized with a self-supervised technique (MAE [47]), the vast majority of its capabilities come from large-scale supervised training. In cases where data engines can scale available annotations, like ours, supervised training provides an effective solution.
自机器学习诞生之初，预训练模型就已被用于下游任务[99]。近年来，随着人们对规模的日益重视，这种模式变得越来越重要，此类模型最近被（重新）命名为 "基础模型"：即 "在大规模的广泛数据上训练并可适应广泛下游任务 "的模型[8]。我们的工作与这一定义十分吻合，不过我们也注意到，图像分割基础模型的范围本身就很有限，因为它代表了计算机视觉的一个重要但又零碎的子集。我们还将我们方法的一个方面与 [ 8] 进行了对比，后者强调了自监督学习在基础模型中的作用。虽然我们的模型是通过自监督技术（MAE [ 47]）初始化的，但其绝大部分功能都来自大规模的监督训练。在数据引擎可以扩展可用注释的情况下，比如我们的情况，监督训练提供了有效的解决方案。

Compositionality. 构成性。

Pre-trained models can power new capabilities even beyond ones imagined at the moment of training. One prominent example is how CLIP [82] is used as a component in larger systems, such as DALL $\cdot$ E [83]. Our goal is to make this kind of composition straightforward with SAM. We aim to achieve this by requiring SAM to predict a valid mask for a wide range of segmentation prompts. The effect is to create a reliable interface between SAM and other components. For example, MCC [106] can easily use SAM to segment an object of interest and achieve strong generalization to unseen objects for 3D reconstruction from a single RGB-D image. In another example, SAM can be prompted with gaze points detected by a wearable device, enabling new applications. Thanks to SAM’s ability to generalize to new domains like ego-centric images, such systems work without need for additional training.
预先训练的模型可以提供新的功能，甚至超出训练时的想象。一个突出的例子是 CLIP [ 82] 如何被用作 DALL $\cdot$ E [ 83] 等大型系统的一个组件。我们的目标是利用 SAM 直接进行这种组合。为了实现这一目标，我们要求 SAM 为各种分段提示预测有效的掩码。其效果是在 SAM 和其他组件之间建立一个可靠的接口。例如，MCC[106] 可以轻松地使用 SAM 对感兴趣的物体进行分割，并实现对未见物体的强泛化，以便从单张 RGB-D 图像进行 3D 重建。在另一个例子中，SAM 可以通过可穿戴设备检测到的注视点进行提示，从而实现新的应用。由于 SAM 能够泛化到以自我为中心的图像等新领域，因此此类系统无需额外培训即可运行。

Limitations. 局限性。

While SAM performs well in general, it is not perfect. It can miss fine structures, hallucinates small disconnected components at times, and does not produce boundaries as crisply as more computationally intensive methods that “zoom-in”, e.g. [18]. In general, we expect dedicated interactive segmentation methods to outperform SAM when many points are provided, e.g. [67]. Unlike these methods, SAM is designed for generality and breadth of use rather than high IoU interactive segmentation. Moreover, SAM can process prompts in real-time, but nevertheless SAM’s overall performance is not real-time when using a heavy image encoder. Our foray into the text-to-mask task is exploratory and not entirely robust, although we believe it can be improved with more effort. While SAM can perform many tasks, it is unclear how to design simple prompts that implement semantic and panoptic segmentation. Finally, there are domain-specific tools, such as [7], that we expect to outperform SAM in their respective domains.
虽然 SAM 的总体性能良好，但它并不完美。它可能会遗漏细微的结构，有时会幻化出断开的小部分，而且无法像 "放大 "等计算密集型方法那样清晰地生成边界，例如[ 18]。一般来说，如果提供的点数较多，我们认为专门的交互式分割方法会优于 SAM，例如[ 67]。与这些方法不同的是，SAM 是为通用性和广泛使用而设计的，而不是高 IoU 交互式分割。此外，SAM 可以实时处理提示，但在使用重型图像编码器时，SAM 的整体性能并不是实时的。我们对文本到掩码任务的尝试是探索性的，并不完全稳健，不过我们相信通过更多的努力，它可以得到改善。虽然 SAM 可以执行很多任务，但目前还不清楚如何设计简单的提示来实现语义和全视角分割。最后，还有一些针对特定领域的工具，如[7]，我们希望它们能在各自的领域胜过 SAM。

Conclusion. 结论

The Segment Anything project is an attempt to lift image segmentation into the era of foundation models. Our principal contributions are a new task (promptable segmentation), model (SAM), and dataset (SA-1B) that make this leap possible. Whether SAM achieves the status of a foundation model remains to be seen by how it is used in the community, but regardless we expect the perspective of this work, the release of over 1B masks, and our promptable segmentation model will help pave the path ahead.
Segment Anything 项目是将图像分割提升到基础模型时代的一次尝试。我们的主要贡献是新任务（可提示分割）、模型（SAM）和数据集（SA-1B），它们使这一飞跃成为可能。SAM 是否能成为基础模型还有待观察其在社区中的使用情况，但无论如何，我们希望这项工作的前景、超过 1B 掩膜的发布以及我们的可提示分割模型将有助于铺平前进的道路。

Acknowledgments. 致谢。

We would like to thank Aaron Adcock and Jitendra Malik for helpful discussion. We thank Vaibhav Aggarwal and Yanghao Li for help with scaling the model. We thank Cheng-Yang Fu, Jiabo Hu, and Robert Kuo for help with data annotation platform. We thank Allen Goodman and Bram Wasti for help in optimizing web-version of our model. Finally, we thank Morteza Behrooz, Ashley Gabriel, Ahuva Goldstand, Sumanth Gurram, Somya Jain, Devansh Kukreja, Joshua Lane, Lilian Luong, Mallika Malhotra, William Ngan, Omkar Parkhi, Nikhil Raina, Dirk Rowe, Neil Sejoor, Vanessa Stark, Bala Varadarajan, and Zachary Winstrom for their help in making the demo, dataset viewer, and other assets and tooling.
感谢 Aaron Adcock 和 Jitendra Malik 在讨论中提供的帮助。我们感谢 Vaibhav Aggarwal 和李洋浩在扩展模型方面提供的帮助。感谢傅承洋、胡家波和 Robert Kuo 在数据注释平台方面提供的帮助。感谢 Allen Goodman 和 Bram Wasti 在优化网络版模型方面的帮助。最后，我们感谢 Morteza Behrooz、Ashley Gabriel、Ahuva Goldstand、Sumanth Gurram、Somya Jain、Devansh Kukreja、Joshua Lane、Lilian Luong、Mallika Malhotra、William Ngan、Omkar Parkhi、Nikhil Raina、Dirk Rowe、Neil Sejoor、Vanessa Stark、Bala Varadarajan 和 Zachary Winstrom 在制作演示、数据集查看器以及其他资产和工具方面的帮助。

References 参考资料

[1] Edward H Adelson. On seeing stuff: the perception of materials by humans and machines. Human vision and electronic imaging VI, 2001.
爱德华-H-阿德尔森On seeing stuff: the perception of materials by humans and machines.人类视觉与电子成像 VI，2001 年。
[2] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an object? CVPR, 2010.
博格丹-阿列克谢、托马斯-德塞勒斯和维托里奥-法拉利。什么是对象？CVPR，2010。
[3] Pablo Arbeláez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. TPAMI, 2010.
Pablo Arbeláez、Michael Maire、Charless Fowlkes 和 Jitendra Malik。轮廓检测和分层图像分割。TPAMI，2010。
[4] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv:1607.06450, 2016.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton.层归一化》，arXiv:1607.06450，2016.
[5] Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of image transformers. arXiv:2106.08254, 2021.
Hangbo Bao, Li Dong, and Furu Wei.BEiT：图像变换器的BERT预训练。ArXiv:2106.08254，2021.
[6] Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. ZeroWaste dataset: Towards deformable object segmentation in cluttered scenes. CVPR, 2022.
Dina Bashkirova、Mohamed Abdelfattah、Ziliang Zhu、James Akl、Fadi Alladkani、Ping Hu、Vitaly Ablavsky、Berk Calli、Sarah Adel Bargal 和 Kate Saenko。零废弃数据集：杂乱场景中的可变形物体分割。CVPR，2022。
[7] Stuart Berg, Dominik Kutra, Thorben Kroeger, Christoph N. Straehle, Bernhard X. Kausler, Carsten Haubold, Martin Schiegg, Janez Ales, Thorsten Beier, Markus Rudy, Kemal Eren, Jaime I. Cervantes, Buote Xu, Fynn Beuttenmueller, Adrian Wolny, Chong Zhang, Ullrich Koethe, Fred A. Hamprecht, and Anna Kreshuk. ilastik: interactive machine learning for (bio)image analysis. Nature Methods, 2019.
Stuart Berg、Dominik Kutra、Thorben Kroeger、Christoph N. Straehle、Bernhard X。Kausler, Carsten Haubold, Martin Schiegg, Janez Ales, Thorsten Beier, Markus Rudy, Kemal Eren, Jaime I. Cervantes, Buote Xu, Fynn Beuttenmueller, Adrian Wolny, Chong Zhang, Ullrich Koethe, Fred A. Hamprecht, and Anna Kreshuk. ilastik: interactive machine learning for (bio)image analysis.自然方法》，2019 年。
[8] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv:2108.07258, 2021.
Rishi Bommasani、Drew A Hudson、Ehsan Adeli、Russ Altman、Simran Arora、Sydney von Arx、Michael S Bernstein、Jeannette Bohg、Antoine Bosselut、Emma Brunskill 等：论基础模型的机遇与风险。
[9] Gustav Bredell, Christine Tanner, and Ender Konukoglu. Iterative interaction training for segmentation editing networks. MICCAI, 2018.
Gustav Bredell、Christine Tanner 和 Ender Konukoglu。分割编辑网络的迭代交互训练。MICCAI，2018。
[10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. NeurIPS, 2020.
汤姆-布朗、本杰明-曼、尼克-莱德、梅兰妮-苏比亚、贾里德-D-卡普兰、普拉富拉-达里瓦尔、阿文德-尼拉坎坦、普拉纳夫-希亚姆、吉里什-萨斯特里、阿曼达-阿斯凯尔、桑迪尼-阿加瓦尔、阿里埃尔-赫伯特-沃斯、格雷琴-克鲁格、汤姆-亨尼根、雷旺-查尔德Aditya Ramesh、Daniel Ziegler、Jeffrey Wu、Clemens Winter、Chris Hesse、Mark Chen、Eric Sigler、Mateusz Litwin、Scott Gray、Benjamin Chess、Jack Clark、Christopher Berner、Sam McCandlish、Alec Radford、Ilya Sutskever 和 Dario Amodei。语言模型是少数学习者。NeurIPS, 2020.
[11] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. CVPR, 2018.
Zhaowei Cai 和 Nuno Vasconcelos.级联 R-CNN：探索高质量物体检测。CVPR，2018。
[12] Juan C. Caicedo, Allen Goodman, Kyle W. Karhohs, Beth A. Cimini, Jeanelle Ackerman, Marzieh Haghighi, CherKeng Heng, Tim Becker, Minh Doan, Claire McQuin, Mohammad Rohban, Shantanu Singh, and Anne E. Carpenter. Nucleus segmentation across imaging experiments: the 2018 data science bowl. Nature Methods, 2019.
Juan C. Caicedo、Allen Goodman、Kyle W. Karhohs、Beth A. Cimini、Jeanelle Ackerman、Marzieh Haghighi、CherKeng Heng、Tim Becker、Minh Doan、Claire McQuin、Mohammad Rohban、Shantanu Singh 和 Anne E. Carpenter。跨成像实验的细胞核分割：2018年数据科学碗。自然方法》，2019年。
[13] John Canny. A computational approach to edge detection. TPAMI, 1986.
约翰-坎尼边缘检测的计算方法TPAMI, 1986.
[14] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with Transformers. ECCV, 2020.
Nicolas Carion、Francisco Massa、Gabriel Synnaeve、Nicolas Usunier、Alexander Kirillov 和 Sergey Zagoruyko。用变形器进行端到端物体检测。ECCV, 2020.
[15] Guillaume Charpiat, Matthias Hofmann, and Bernhard Schölkopf. Automatic image colorization via multimodal predictions. ECCV, 2008.
Guillaume Charpiat、Matthias Hofmann 和 Bernhard Schölkopf.通过多模态预测实现自动图像着色。ECCV，2008。
[16] Neelima Chavali, Harsh Agrawal, Aroma Mahendru, and Dhruv Batra. Object-proposal evaluation protocol is’ gameable’. CVPR, 2016.
Neelima Chavali、Harsh Agrawal、Aroma Mahendru 和 Dhruv Batra。对象提案评估协议是'可游戏的'。CVPR，2016。
[17] Jiazhou Chen, Yanghui Xu, Shufang Lu, Ronghua Liang, and Liangliang Nan. 3D instance segmentation of MVS buildings. IEEE Transactions on Geoscience and Remote Sensing, 2022.
陈家洲、徐阳辉、卢淑芳、梁荣华、南亮亮。MVS建筑物的三维实例分割电气和电子工程师学会地球科学与遥感论文集》，2022 年。
[18] Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao. FocalClick: towards practical interactive image segmentation. CVPR, 2022.
Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao.FocalClick：走向实用的交互式图像分割。CVPR，2022。
[19] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. CVPR, 2022.
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar.用于通用图像分割的遮罩-注意力遮罩变换器。CVPR，2022。
[20] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
Bowen Cheng、Alex Schwing 和 Alexander Kirillov.每像素分类并非语义分割的全部。NeurIPS, 2021.
[21] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. ArXiv:2204.02311, 2022.
[22] Luca Ciampi, Carlos Santiago, Joao Costeira, Claudio Gennaro, and Giuseppe Amato. Domain adaptation for traffic density estimation. International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2021.
Luca Ciampi、Carlos Santiago、Joao Costeira、Claudio Gennaro 和 Giuseppe Amato。交通密度估算的领域适应。计算机视觉、成像和计算机图形理论与应用国际联合会议，2021 年。
[23] Luca Ciampi, Carlos Santiago, Joao Costeira, Claudio Gennaro, and Giuseppe Amato. Night and day instance segmented park (NDISPark) dataset: a collection of images taken by day and by night for vehicle detection, segmentation and counting in parking areas. Zenodo, 2022.
Luca Ciampi、Carlos Santiago、Joao Costeira、Claudio Gennaro 和 Giuseppe Amato。日夜实例分割公园（NDISPark）数据集：用于停车区车辆检测、分割和计数的日夜图像集合。Zenodo, 2022.
[24] Nadav Cohen, Yael Newman, and Ariel Shamir. Semantic segmentation in art paintings. Computer Graphics Forum, 2022.
纳达夫-科恩（Nadav Cohen）、雅伊尔-纽曼（Yael Newman）和阿里尔-沙米尔（Ariel Shamir）。艺术绘画中的语义分割。计算机图形论坛》，2022 年
[25] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. CVPR, 2016.
Marius Cordts、Mohamed Omran、Sebastian Ramos、Timo Rehfeld、Markus Enzweiler、Rodrigo Benenson、Uwe Franke、Stefan Roth 和 Bernt Schiele。用于城市场景语义理解的城市景观数据集。CVPR，2016。
[26] Bruno da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. ICML, 2012.
布鲁诺-达席尔瓦、乔治-科尼达里斯和安德鲁-巴尔托。学习参数化技能。ICML, 2012.
[27] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. IJCV, 2022.
Dima Damen、Hazel Doughty、Giovanni Maria Farinella、Antonino Furnari、Jian Ma、Evangelos Kazakos、Davide Moltisanti、Jonathan Munro、Toby Perrett、Will Price 和 Michael Wray。重构以自我为中心的视觉：EPIC-KITCHENS-100 的收集、流水线和挑战。IJCV，2022。
[28] Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. EPIC-KITCHENS VISOR benchmark: Video segmentations and object relations. NeurIPS, 2022.
Ahmad Darkhalil、Dandan Shan、Bin Zhu、Jian Ma、Amlan Kar、Richard Higgins、Sanja Fidler、David Fouhey 和 Dima Damen。EPIC-KITCHENS VISOR 基准：视频分割与物体关系。NeurIPS, 2022.
[29] Terrance De Vries, Ishan Misra, Changhan Wang, and Laurens Van der Maaten. Does object recognition work for everyone? CVPR workshops, 2019.
Terrance De Vries、Ishan Misra、Changhan Wang 和 Laurens Van der Maaten.物体识别对每个人都有效吗？CVPR研讨会，2019年。
[30] Mark Díaz, Ian Kivlichan, Rachel Rosen, Dylan Baker, Razvan Amironesei, Vinodkumar Prabhakaran, and Emily Denton. CrowdWorkSheets: Accounting for individual and collective identities underlying crowdsourced dataset annotation. ACM Conference on Fairness, Accountability, and Transparency, 2022.
Mark Díaz、Ian Kivlichan、Rachel Rosen、Dylan Baker、Razvan Amironesei、Vinodkumar Prabhakaran 和 Emily Denton。CrowdWorkSheets：基于众包数据集注释的个人和集体身份核算。ACM 公平、问责与透明会议，2022 年。
[31] Henghui Ding, Scott Cohen, Brian Price, and Xudong Jiang. PhraseClick: toward achieving flexible interactive segmentation by phrase and click. ECCV, 2020.
丁恒辉、斯科特-科恩、布莱恩-普赖斯、蒋旭东。PhraseClick：通过短语和点击实现灵活的交互式分割。ECCV，2020。
[32] Piotr Dollár and C Lawrence Zitnick. Fast edge detection using structured forests. TPAMI, 2014.
Piotr Dollár 和 C Lawrence Zitnick.使用结构化森林的快速边缘检测TPAMI, 2014.
[33] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
Alexey Dosovitskiy、Lucas Beyer、Alexander Kolesnikov、Dirk Weissenborn、Xiaohua Zhai、Thomas Unterthiner、Mostafa Dehghani、Matthias Minderer、Georg Heigold、Sylvain Gelly、Jakob Uszkoreit 和 Neil Houlsby。一幅图像胜过 16x16 个单词：大规模图像识别变换器。ICLR，2021。
[34] Alireza Fathi, Xiaofeng Ren, and James M. Rehg. Learning to recognize objects in egocentric activities. CVPR, 2011.
Alireza Fathi, Xiaofeng Ren, and James M. Rehg.在以自我为中心的活动中学习识别物体。CVPR，2011。
[35] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation. IJCV, 2004.
Pedro F Felzenszwalb 和 Daniel P Huttenlocher.基于图的高效图像分割。IJCV，2004。
[36] Thomas B. Fitzpatrick. The validity and practicality of sun-reactive skin types i through vi. Archives of Dermatology, 1988.
托马斯-B-菲茨帕特里克日光反应皮肤类型 I 至 VI 的有效性和实用性。皮肤病学档案》，1988 年。
[37] Marco Forte, Brian Price, Scott Cohen, Ning Xu, and François Pitié. Getting to 99% accuracy in interactive segmentation. arXiv:2003.07932, 2020.
Marco Forte, Brian Price, Scott Cohen, Ning Xu, and François Pitié.达到 99% 的交互式分割准确率。arXiv:2003.07932, 2020.
[38] Jean-Michel Fortin, Olivier Gamache, Vincent Grondin, François Pomerleau, and Philippe Giguère. Instance segmentation for autonomous log grasping in forestry operations. IROS, 2022.
Jean-Michel Fortin、Olivier Gamache、Vincent Grondin、François Pomerleau 和 Philippe Giguère。林业作业中自动抓取原木的实例分割。IROS，2022。
[39] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 2021.
Timnit Gebru、Jamie Morgenstern、Briana Vecchione、Jennifer Wortman Vaughan、Hanna Wallach、Hal Daumé Iii 和 Kate Crawford。数据集的数据表。ACM 通信》，2021 年。
[40] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. CVPR, 2021.
Golnaz Ghiasi、Yin Cui、Aravind Srinivas、Rui Qian、Tsung-Yi Lin、Ekin D Cubuk、Quoc V Le 和 Barret Zoph。简单复制粘贴是实例分割的强大数据增强方法。CVPR，2021。
[41] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR, 2014.
Ross Girshick、Jeff Donahue、Trevor Darrell 和 Jitendra Malik。用于精确物体检测和语义分割的丰富特征层次。CVPR，2014。
[42] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017.
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He.精确、大型迷你批量 SGD：1 小时内训练 ImageNet。arXiv:1706.02677, 2017.
[43] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Christian Fuegen, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, and Jitendra Malik. Ego4D: Around the World in 3,000 Hours of Egocentric Video. CVPR, 2022.
Kristen Grauman、Andrew Westbury、Eugene Byrne、Zachary Chavis、Antonino Furnari、Rohit Girdhar、Jackson Hamburger、Hao Jiang、Miao Liu、Xingyu Liu、Miguel Martin、Tushar Nagarajan、Ilija Radosavovic、Santhosh Kumar Ramakrishnan、Fiona Ryan、Jayant Sharma、Michael Wray、徐萌萌、徐忠聪、赵辰、Siddhant Bansal、Dhruv Batra、Vincent Cartillier、Sean Crane、Tien Do、Morrie Doulaty、Akshay Erapalli、Christoph Feichtenhofer、Adriano Fragomeni、傅启辰、Christian Fuegen、Abrham Gebreselasie、Cristina Gonzalez、James Hillis、黄旭华、James Hillis詹姆斯-希利斯、黄旭华、黄逸飞、贾文琪、邱蔚然、Jachym Kolar、Satwik Kottur、Anurag Kumar、Federico Landini、李超、李阳浩、李振强、Karttikeya Mangalam、Raghava Modhugu、Jonathan Munro、Tullie Murrell、Takumi Nishiyasu、Will Price、Paola Ruiz Puentes、Merey Ramazanova、Leda Sari、Kiran Somasundaram、Audrey Southerland、Yusuke Sugano、Ruijie Tao、Minh Vo、Yuchen Wang、Xindi Wu、Takuma Yagi、Yunyi Zhu、Pablo Arbelaez、David Crandall、Dima Damen、Giovanni Maria Farinella、Bernard Ghanem、Vamsi Krishna Ithapu、C.V. Jawahar、Hanbyul Joo、Kris Kitani、Haizhou Li、Richard Newcombe、Aude Oliva、Hyun Soo Park、James M. Rehg、Yoichi Sato、Jianbo Shi、Mike Zheng Shou、Antonio Torralba、Lorenzo Torresani、Mingfei Yan 和 Jitendra Malik。Ego4D：3000小时以自我为中心的视频环游世界。CVPR，2022。
[44] Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. CVPR, 2019.
Agrim Gupta、Piotr Dollar 和 Ross Girshick。LVIS：大型词汇实例分割数据集。CVPR，2019。
[45] Abner Guzman-Rivera, Dhruv Batra, and Pushmeet Kohli. Multiple choice learning: Learning to produce multiple structured outputs. NeurIPS, 2012.
Abner Guzman-Rivera、Dhruv Batra 和 Pushmeet Kohli。多选学习：学习产生多种结构化输出。NeurIPS, 2012.
[46] Timm Haucke, Hjalmar S. Kühl, and Volker Steinhage. SOCRATES: Introducing depth in visual wildlife monitoring using stereo vision. Sensors, 2022.
Timm Haucke、Hjalmar S. Kühl 和 Volker Steinhage。SOCRATES：利用立体视觉将深度引入野生动物视觉监测。传感器，2022 年。
[47] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. CVPR, 2022.
何开明、陈新磊、谢赛宁、李阳浩、Piotr Dollár 和 Ross Girshick。遮蔽式自动编码器是可扩展的视觉学习器。CVPR，2022。
[48] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. ICCV, 2017.
何开明、Georgia Gkioxari、Piotr Dollár 和 Ross Girshick。Mask R-CNN.ICCV，2017。
[49] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016.
何开明、张翔宇、任少清和孙健。图像识别的深度残差学习。CVPR，2016。
[50] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv:1606.08415, 2016.
Dan Hendrycks 和 Kevin Gimpel.高斯误差线性单元（Gelus）。arXiv:1606.08415，2016。
[51] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv:2203.15556, 2022.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. ArXiv:2203.15556, 2022.
[52] Jungseok Hong, Michael Fulton, and Junaed Sattar. TrashCan: A semantically-segmented dataset towards visual detection of marine debris. arXiv:2007.08097, 2020.
Jungseok Hong、Michael Fulton 和 Junaed Sattar。TrashCan：用于海洋废弃物视觉检测的语义分割数据集。arXiv:2007.08097，2020。
[53] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. ECCV, 2016.
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger.具有随机深度的深度网络ECCV，2016。
[54] Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. arXiv:2211.06220, 2022.
Jitesh Jain、Jiachen Li、MangTik Chiu、Ali Hassani、Nikita Orlov 和 Humphrey Shi。一个变换器：一个变换器统治通用图像分割。arXiv:2211.06220，2022。
[55] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. ICML, 2021.
Chao Jia、Yinfei Yang、Ye Xia、Yi-Ting Chen、Zarana Parekh、Hieu Pham、Quoc Le、Yun-Hsuan Sung、Zhen Li 和 Tom Duerig。利用噪声文本监督扩展视觉和视觉语言表征学习。ICML，2021年。
[56] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.神经语言模型的缩放定律》，arXiv:2001.08361，2020.
[57] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models. IJCV, 1988.
迈克尔-卡斯、安德鲁-维特金、德米特里-特尔佐普洛斯。蛇：主动轮廓模型IJCV，1988 年。
[58] Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, and Weicheng Kuo. Learning open-world object proposals without learning to classify. IEEE Robotics and Automation Letters, 2022.
Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, and Weicheng Kuo.在不学习分类的情况下学习开放世界物体建议。电气与电子工程师学会机器人与自动化通讯》，2022 年。
[59] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. CVPR, 2019.
Alexander Kirillov、Kaiming He、Ross Girshick、Carsten Rother 和 Piotr Dollár。全景分割。CVPR，2019。
[60] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
Alina Kuznetsova、Hassan Rom、Neil Alldrin、Jasper Uijlings、Ivan Krasin、Jordi Pont-Tuset、Shahab Kamali、Stefan Popov、Matteo Malloci、Alexander Kolesnikov、Tom Duerig 和 Vittorio Ferrari。开放图像数据集 V4：大规模统一图像分类、对象检测和视觉关系检测。IJCV，2020。
[61] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. arXiv:1910.09700, 2019.
Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres.量化机器学习的碳排放》，arXiv:1910.09700，2019.
[62] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. ECCV, 2022.
李阳昊、毛汉子、罗斯-吉尔希克、何开明。探索用于物体检测的平视变换器骨干。ECCV，2022年。
[63] Yin Li, Zhefan Ye, and James M. Rehg. Delving into egocentric actions. CVPR, 2015.
Yin Li, Zhefan Ye, and James M. Rehg.深入研究以自我为中心的行动。CVPR, 2015.
[64] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image segmentation with latent diversity. CVPR, 2018.
Zhuwen Li, Qifeng Chen, and Vladlen Koltun.潜在多样性的交互式图像分割。CVPR，2018。
[65] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. ICCV, 2017.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár.密集物体检测的焦点损失。ICCV，2017。
[66] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. ECCV, 2014.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick.微软 COCO：上下文中的通用对象。ECCV，2014。
[67] Qin Liu, Zhenlin Xu, Gedas Bertasius, and Marc Niethammer. SimpleClick: Interactive image segmentation with simple vision transformers. arXiv:2210.11006, 2022.
Qin Liu、Zhenlin Xu、Gedas Bertasius 和 Marc Niethammer。SimpleClick：使用简单视觉变换器的交互式图像分割。arXiv:2210.11006，2022。
[68] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ICLR, 2019.
伊利亚-洛希洛夫和弗兰克-胡特解耦权重衰减正则化。ICLR, 2019.
[69] Cathy H Lucas, Daniel OB Jones, Catherine J Hollyhead, Robert H Condon, Carlos M Duarte, William M Graham, Kelly L Robinson, Kylie A Pitt, Mark Schildhauer, and Jim Regetz. Gelatinous zooplankton biomass in the global oceans: geographic variation and environmental drivers. Global Ecology and Biogeography, 2014.
Cathy H Lucas、Daniel OB Jones、Catherine J Hollyhead、Robert H Condon、Carlos M Duarte、William M Graham、Kelly L Robinson、Kylie A Pitt、Mark Schildhauer 和 Jim Regetz。全球海洋中的胶状浮游动物生物量：地理差异和环境驱动因素。全球生态学与生物地理学》，2014年。
[70] Sabarinath Mahadevan, Paul Voigtlaender, and Bastian Leibe. Iteratively trained interactive segmentation. BMVC, 2018.
Sabarinath Mahadevan、Paul Voigtlaender 和 Bastian Leibe。迭代训练的交互式分割。BMVC，2018。
[71] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object segmentation. CVPR, 2018.
Kevis-Kokitsi Maninis、Sergi Caelles、Jordi Pont-Tuset 和 Luc Van Gool。深度极值切割：从极值点到物体分割。CVPR，2018。
[72] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. ICCV, 2001.
David Martin、Charless Fowlkes、Doron Tal 和 Jitendra Malik。人类分割自然图像数据库及其在评估分割算法和测量生态统计中的应用。ICCV，2001 年。
[73] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. 3DV, 2016.
Fausto Milletari、Nassir Navab 和 Seyed-Ahmad Ahmadi。V-Net：用于体积医学图像分割的全卷积神经网络。3DV，2016年。
[74] Massimo Minervini, Andreas Fischbach, Hanno Scharr, and Sotirios A. Tsaftaris. Finely-grained annotated datasets for image-based plant phenotyping. Pattern Recognition Letters, 2016.
Massimo Minervini、Andreas Fischbach、Hanno Scharr 和 Sotirios A. Tsaftaris。基于图像的植物表型精细注释数据集。模式识别通讯》，2016 年。
[75] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. Proceedings of the conference on fairness, accountability, and transparency, 2019.
Margaret Mitchell、Simone Wu、Andrew Zaldivar、Parker Barnes、Lucy Vasserman、Ben Hutchinson、Elena Spitzer、Inioluwa Deborah Raji 和 Timnit Gebru。用于模型报告的模型卡。公平、问责与透明会议论文集，2019。
[76] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. Extreme clicking for efficient object annotation. ICCV, 2017.
Dim P Papadopoulos、Jasper RR Uijlings、Frank Keller 和 Vittorio Ferrari。用于高效对象注释的极限点击。ICCV，2017。
[77] David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv:2104.10350, 2021.
David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean.碳排放与大型神经网络训练》，arXiv:2104.10350, 2021.
[78] Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017.
Matthew E Peters、Waleed Ammar、Chandra Bhagavatula 和 Russell Power。使用双向语言模型的半监督序列标记。计算语言学协会第 55 届年会论文集，2017 年。
[79] Mengyang Pu, Yaping Huang, Yuming Liu, Qingji Guan, and Haibin Ling. EDTER: Edge detection with transformer. CVPR, 2022.
蒲梦阳、黄亚平、刘玉明、关庆吉和凌海滨。EDTER：用变压器进行边缘检测。CVPR，2022。
[80] Mattia Pugliatti and Francesco Topputo. DOORS: Dataset fOr bOuldeRs Segmentation. Zenodo, 2022.
Mattia Pugliatti 和 Francesco Topputo.DOORS：用于建筑物分段的数据集。Zenodo, 2022.
[81] Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip Torr, and Song Bai. Occluded video instance segmentation: A benchmark. ICCV, 2022.
Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip Torr, and Song Bai.遮挡视频实例分割：一个基准。ICCV, 2022.
[82] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. ICML, 2021.
Alec Radford、Jong Wook Kim、Chris Hallacy、Aditya Ramesh、Gabriel Goh、Sandhini Agarwal、Girish Sastry、Amanda Askell、Pamela Mishkin、Jack Clark 等：从自然语言监督中学习可转移的视觉模型。ICML, 2021.
[83] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ICML, 2021.
Aditya Ramesh、Mikhail Pavlov、Gabriel Goh、Scott Gray、Chelsea Voss、Alec Radford、Mark Chen 和 Ilya Sutskever。零镜头文本到图像生成。ICML, 2021.
[84] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. NeurIPS, 2015.
任少卿、何开明、罗斯-吉尔希克和孙健。更快的 R-CNN：利用区域建议网络实现实时物体检测。NeurIPS, 2015.
[85] Xiaofeng Ren and Jitendra Malik. Learning a classification model for segmentation. ICCV, 2003.
Xiaofeng Ren 和 Jitendra Malik.学习分割分类模型。ICCV, 2003.
[86] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. ICCV, 2021.
Mike Roberts、Jason Ramapuram、Anurag Ranjan、Atulit Kumar、Miguel Angel Bautista、Nathan Paczan、Russ Webb 和 Joshua M. Susskind。Hypersim：用于整体室内场景理解的逼真合成数据集。ICCV，2021。
[87] Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru. A step toward more inclusive people annotations for fairness. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021.
Candice Schumann、Susanna Ricco、Utsav Prabhu、Vittorio Ferrari 和 Caroline Pantofaru。迈向更具包容性的公平人物注释。2021 年 AAAI/ACM 人工智能、伦理与社会会议论文集，2021 年。
[88] Sefik Ilkin Serengil and Alper Ozpinar. LightFace: A hybrid deep face recognition framework. ASYU, 2020.
Sefik Ilkin Serengil 和 Alper Ozpinar。LightFace：混合深度人脸识别框架。ASYU, 2020.
[89] Sefik Ilkin Serengil and Alper Ozpinar. HyperExtended LightFace: A facial attribute analysis framework. ICEET, 2021.
Sefik Ilkin Serengil 和 Alper Ozpinar.HyperExtended LightFace：面部属性分析框架。ICEET，2021年。
[90] Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi. TextonBoost: Joint appearance, shape and context modeling for mulit-class object recognition and segmentation. ECCV, 2006.
Jamie Shotton、John Winn、Carsten Rother 和 Antonio Criminisi。TextonBoost：用于多类物体识别和分割的联合外观、形状和上下文建模。ECCV，2006。
[91] Corey Snyder and Minh Do. STREETS: A novel camera network dataset for traffic flow. NeurIPS, 2019.
科里-斯奈德（Corey Snyder）和明-杜（Minh Do）。STREETS：交通流量的新型摄像头网络数据集。NeurIPS, 2019.
[92] Konstantin Sofiiuk, Ilya A Petrov, and Anton Konushin. Reviving iterative training with mask guidance for interactive segmentation. ICIP, 2022.
Konstantin Sofiiuk、Ilya A Petrov 和 Anton Konushin。为交互式分割恢复带有遮罩引导的迭代训练ICIP, 2022.
[93] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014.
Nitish Srivastava、Geoffrey Hinton、Alex Krizhevsky、Ilya Sutskever 和 Ruslan Salakhutdinov。辍学：防止神经网络过度拟合的简单方法。机器学习研究》杂志，2014 年。
[94] Chris Stauffer and W Eric L Grimson. Adaptive background mixture models for real-time tracking. CVPR, 1999.
Chris Stauffer 和 W Eric L Grimson.用于实时跟踪的自适应背景混合模型。CVPR，1999 年。
[95] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 2020.
Matthew Tancik、Pratul Srinivasan、Ben Mildenhall、Sara Fridovich-Keil、Nithin Raghavan、Utkarsh Singhal、Ravi Ramamoorthi、Jonathan Barron 和 Ren Ng。傅立叶特征让网络在低维域学习高频函数NeurIPS, 2020.
[96] Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou. Action recognition in RGB-D egocentric videos. ICIP, 2017.
Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou.RGB-D自我中心视频中的动作识别。ICIP, 2017.
[97] Yansong Tang, Zian Wang, Jiwen Lu, Jianjiang Feng, and Jie Zhou. Multi-stream deep neural networks for RGB-D egocentric action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2019.
唐岩松、王自安、陆继文、冯建江、周杰。用于 RGB-D 以自我为中心的动作识别的多流深度神经网络。IEEE 视频技术电路与系统论文集，2019.
[98] The World Bank. The world by income and regions, 2022. https://datatopics.worldbank.org/world-development-indicators/the-world-by-income-and-region.html.
世界银行。2022 年按收入和地区划分的世界。https://datatopics.worldbank.org/world-development-indicators/the-world-by-income-and-region.html
[99] Sebastian Thrun. Is learning the n-th thing any easier than learning the first? NeurIPS, 1995.
塞巴斯蒂安-特伦学习第 n 件事比学习第一件事容易吗？NeurIPS，1995年。
[100] Cameron Trotter, Georgia Atkinson, Matt Sharpe, Kirsten Richardson, A. Stephen McGough, Nick Wright, Ben Burville, and Per Berggren. NDD20: A large-scale few-shot dolphin dataset for coarse and fine-grained categorisation. arXiv:2005.13359, 2020.
Cameron Trotter, Georgia Atkinson, Matt Sharpe, Kirsten Richardson, A. Stephen McGough, Nick Wright, Ben Burville, and Per Berggren.NDD20: A large-scale few-shot dolphin dataset for coarse and fine-grained categorisation. arXiv:2005.13359, 2020.
[101] United States Environmental Protection Agency. Greenhouse Gas Equivalencies Calculator. https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator, 2022.
美国环境保护局。温室气体当量计算器。https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator, 2022.
[102] Koen EA van de Sande, Jasper RR Uijlings, Theo Gevers, and Arnold WM Smeulders. Segmentation as selective search for object recognition. ICCV, 2011.
Koen EA van de Sande、Jasper RR Uijlings、Theo Gevers 和 Arnold WM Smeulders。作为对象识别选择性搜索的分割。ICCV，2011。
[103] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.
Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N Gomez、Lukasz Kaiser 和 Illia Polosukhin。注意力就是你所需要的一切。NeurIPS, 2017.
[104] Boying Wang, Libo Zhang, Longyin Wen, Xianglong Liu, and Yanjun Wu. Towards real-world prohibited item detection: A large-scale x-ray benchmark. CVPR, 2021.
王伯英、张立波、文龙吟、刘祥龙、吴彦君。实现真实世界违禁物品检测：大规模X射线基准。CVPR，2021。
[105] Weiyao Wang, Matt Feiszli, Heng Wang, Jitendra Malik, and Du Tran. Open-world instance segmentation: Exploiting pseudo ground truth from learned pairwise affinity. CVPR, 2022.
王维尧、马特-费斯利、王恒、吉腾德拉-马利克和杜-陈。开放世界实例分割：从学习到的成对亲和力中利用伪地面实况。CVPR，2022。
[106] Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, and Georgia Gkioxari. Multiview compressive coding for 3D reconstruction. CVPR, 2023.
Chao-Yuan Wu、Justin Johnson、Jitendra Malik、Christoph Feichtenhofer 和 Georgia Gkioxari。三维重建的多视图压缩编码。CVPR，2023。
[107] Jianxiong Xiao, James Hays, Krista Ehinger, Aude Oliva, and Antonio Torralba. SUN database: Large-scale scene recognition from abbey to zoo. CVPR, 2010.
Jianxiong Xiao, James Hays, Krista Ehinger, Aude Oliva, and Antonio Torralba.SUN 数据库：从修道院到动物园的大规模场景识别。CVPR，2010。
[108] Saining Xie and Zhuowen Tu. Holistically-nested edge detection. ICCV, 2015.
Saining Xie 和 Zhuowen Tu.整体嵌套边缘检测。ICCV，2015。
[109] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S Huang. Deep interactive object selection. CVPR, 2016.
Ning Xu、Brian Price、Scott Cohen、Jimei Yang 和 Thomas S Huang。深度交互对象选择。CVPR，2016。
[110] Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky. Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the imagenet hierarchy. Proceedings of the 2020 conference on fairness, accountability, and transparency, 2020.
杨开宇、Klint Qinami、李菲菲、邓佳和 Olga Russakovsky.实现更公平的数据集：过滤和平衡人物子树在图像网层次结构中的分布。2020 年公平、问责与透明会议论文集，2020 年。
[111] Lei Yang, Yan Zi Wei, Yisheng HE, Wei Sun, Zhenhang Huang, Haibin Huang, and Haoqiang Fan. iShape: A first step towards irregular shape instance segmentation. arXiv:2109.15068, 2021.
iShape：迈向不规则形状实例分割的第一步. arXiv:2109.15068, 2021.
[112] Senthil Yogamani, Ciarán Hughes, Jonathan Horgan, Ganesh Sistu, Padraig Varley, Derek O’Dea, Michal Uricár, Stefan Milz, Martin Simon, Karl Amende, et al. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving. ICCV, 2019.
Senthil Yogamani、Ciarán Hughes、Jonathan Horgan、Ganesh Sistu、Padraig Varley、Derek O'Dea、Michal Uricár、Stefan Milz、Martin Simon、Karl Amende 等。 WoodScape：用于自动驾驶的多任务、多摄像头鱼眼数据集。ICCV，2019。
[113] Lingzhi Zhang, Shenghao Zhou, Simon Stent, and Jianbo Shi. Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. ECCV, 2022.
Lingzhi Zhang, Shenghao Zhou, Simon Stent, and Jianbo Shi.细粒度以自我为中心的手部物体分割：数据集、模型和应用。ECCV，2022年。
[114] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy. K-Net: Towards unified image segmentation. NeurIPS, 2021.
Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy.K-Net：实现统一图像分割。NeurIPS, 2021.
[115] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. arXiv:1707.09457, 2017.
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang.男人也喜欢购物：使用语料库级约束减少性别偏见放大。arXiv:1707.09457, 2017.
[116] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. TPAMI, 2017.
Bolei Zhou、Agata Lapedriza、Aditya Khosla、Aude Oliva 和 Antonio Torralba。地点用于场景识别的千万级图像数据库。TPAMI，2017。
[117] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. IJCV, 2019.
周博磊、赵航、泽维尔-普伊格、肖特特、萨尼亚-菲德勒、阿德拉-巴里乌索和安东尼奥-托拉尔巴。通过 ADE20K 数据集理解场景语义。IJCV，2019。

Appendix 附录

Table of contents: 目录

•

§A: Segment Anything Model and Task Details

- §A：分段任何模式和任务细节
•

§B: Automatic Mask Generation Details

- §B: 自动生成掩模详情
•

§C: RAI Additional Details

- §C:RAI 补充细节
•

§D: Experiment Implementation Details

- §D：实验实施细节
•

§E: Human Study Experimental Design

- §E：人类研究实验设计
•

§F: Dataset, Annotation, and Model Cards

- §F：数据集、注释和模型卡
•

§G: Annotation Guidelines

- §G：注释指南

Appendix A Segment Anything Model and Task Details
附录：Aegment Anything 模型和任务详情

Image encoder. 图像编码器

In general, the image encoder can be any network that outputs a $C{\times}H{\times}W$ image embedding. Motivated by scalability and access to strong pre-training, we use an MAE [47] pre-trained Vision Transformer (ViT) [33] with minimal adaptations to process high resolution inputs, specifically a ViT-H/16 with 14 ${\times}$ 14 windowed attention and four equally-spaced global attention blocks, following [62]. The image encoder’s output is a 16 ${\times}$ downscaled embedding of the input image. Since our runtime goal is to process each prompt in real-time, we can afford a high number of image encoder FLOPs because they are computed only once per image, not per prompt.
一般来说，图像编码器可以是任何能输出 $C{\times}H{\times}W$ 图像嵌入的网络。受可扩展性和强大的预训练能力的驱动，我们使用 MAE [ 47] 预训练的视觉转换器（ViT）[33]，并对其进行了最小化的调整，以处理高分辨率输入，特别是 ViT-H/16，它具有 14 ${\times}$ 14 窗口注意力和四个等间距的全局注意力块。图像编码器的输出是输入图像的 16 ${\times}$ 降频嵌入。由于我们的运行目标是实时处理每个提示，因此我们可以承受较高的图像编码器 FLOPs 数量，因为每个图像而不是每个提示只需计算一次。

Following standard practices (e.g., [40]), we use an input resolution of 1024 ${\times}$ 1024 obtained by rescaling the image and padding the shorter side. The image embedding is therefore 64 ${\times}$ 64. To reduce the channel dimension, following [62], we use a 1 ${\times}$ 1 convolution to get to 256 channels, followed by a 3 ${\times}$ 3 convolution also with 256 channels. Each convolution is followed by a layer normalization [4].
按照标准做法（如 [ 40]），我们使用的输入分辨率为 1024 ${\times}$ 1024，这是通过重新缩放图像并填充较短的一侧获得的。因此，图像嵌入为 64 ${\times}$ 64。为了减少通道维度，我们按照 [ 62] 的方法，使用 1 ${\times}$ 1 卷积来获得 256 个通道，然后再进行 3 ${\times}$ 3 卷积，也是 256 个通道。每次卷积后都要进行层归一化处理 [ 4]。

Prompt encoder. 提示编码器。

Sparse prompts are mapped to 256-dimensional vectorial embeddings as follows. A point is represented as the sum of a positional encoding [95] of the point’s location and one of two learned embeddings that indicate if the point is either in the foreground or background. A box is represented by an embedding pair: (1) the positional encoding of its top-left corner summed with a learned embedding representing “top-left corner” and (2) the same structure but using a learned embedding indicating “bottom-right corner”. Finally, to represent free-form text we use the text encoder from CLIP [82] (any text encoder is possible in general). We focus on geometric prompts for the remainder of this section and discuss text prompts in depth in §D.5.
稀疏提示映射到 256 维向量嵌入的方法如下。一个点表示为该点的位置编码[95]和两个学习嵌入的总和，这两个嵌入表示该点是在前景还是背景中。一个方框用一对嵌入式编码来表示：(1) 方框左上角的位置编码与表示 "左上角 "的学习嵌入式编码之和；(2) 同样的结构，但使用表示 "右下角 "的学习嵌入式编码。最后，为了表示自由形式的文本，我们使用了 CLIP [ 82] 中的文本编码器（一般来说，任何文本编码器都可以）。在本节的其余部分，我们将重点讨论几何提示，并在第 D.5 节深入讨论文本提示。

Dense prompts (i.e., masks) have a spatial correspondence with the image. We input masks at a 4 ${\times}$ lower resolution than the input image, then downscale an additional 4 ${\times}$ using two 2 ${\times}$ 2, stride-2 convolutions with output channels 4 and 16, respectively. A final 1 ${\times}$ 1 convolution maps the channel dimension to 256. Each layer is separated by GELU activations [50] and layer normalization. The mask and image embedding are then added element-wise. If there is no mask prompt, a learned embedding representing “no mask” is added to each image embedding location.
密集提示（即掩码）与图像具有空间对应关系。我们以比输入图像低 4 ${\times}$ 的分辨率输入掩码，然后使用两个 2 ${\times}$ 2、stride-2 卷积将输出通道分别为 4 和 16 的掩码再降级 4 ${\times}$ 。最后的 1 ${\times}$ 1 卷积将通道维度映射为 256。每一层都通过 GELU 激活[ 50] 和层归一化进行分隔。然后按元素顺序添加掩码和图像嵌入。如果没有掩码提示，则会在每个图像嵌入位置添加代表 "无掩码 "的学习嵌入。

Lightweight mask decoder.
轻型面具解码器

This module efficiently maps the image embedding and a set of prompt embeddings to an output mask. To combine these inputs, we take inspiration from Transformer segmentation models [14, 20] and modify a standard Transformer decoder [103]. Before applying our decoder, we first insert into the set of prompt embeddings a learned output token embedding that will be used at the decoder’s output, analogous to the [class] token in [33]. For simplicity, we refer to these embeddings (not including the image embedding) collectively as “tokens”.
该模块能有效地将图像嵌入和一组提示嵌入映射到输出掩码。为了结合这些输入，我们从变换器分割模型[14, 20]中汲取灵感，并修改了标准变换器解码器[103]。在应用我们的解码器之前，我们首先在提示嵌入集合中插入一个学习到的输出标记嵌入，该嵌入将用于解码器的输出，类似于[33]中的[类]标记。为简单起见，我们将这些嵌入（不包括图像嵌入）统称为 "标记"。

Our decoder design is shown in Fig. 14. Each decoder layer performs 4 steps: (1) self-attention on the tokens, (2) cross-attention from tokens (as queries) to the image embedding, (3) a point-wise MLP updates each token, and (4) cross-attention from the image embedding (as queries) to tokens. This last step updates the image embedding with prompt information. During cross-attention, the image embedding is treated as a set of 64 ${}^{\textrm{2}}$ 256-dimensional vectors. Each self/cross-attention and MLP has a residual connection [49], layer normalization, and a dropout [93] of 0.1 at training. The next decoder layer takes the updated tokens and the updated image embedding from the previous layer. We use a two-layer decoder.
我们的解码器设计如图 14 所示。每个解码器层执行 4 个步骤：(1) 对标记进行自关注，(2) 从标记（作为查询）到图像嵌入进行交叉关注，(3) 一个点向 MLP 更新每个标记，(4) 从图像嵌入（作为查询）到标记进行交叉关注。最后一步用提示信息更新图像嵌入。在交叉关注过程中，图像嵌入被视为一组 64 ${}^{\textrm{2}}$ 256 维向量。每个自注意/交叉注意和 MLP 在训练时都有残差连接[49]、层归一化和 0.1 的 dropout[93]。下一层解码器接收上一层更新的标记和更新的图像嵌入。我们使用双层解码器。

To ensure the decoder has access to critical geometric information the positional encodings are added to the image embedding whenever they participate in an attention layer. Additionally, the entire original prompt tokens (including their positional encodings) are re-added to the updated tokens whenever they participate in an attention layer. This allows for a strong dependence on both the prompt token’s geometric location and type.
为确保解码器能获得关键的几何信息，只要参与注意力层，位置编码就会被添加到图像嵌入中。此外，每当提示标记参与注意力层时，整个原始提示标记（包括其位置编码）都会被重新添加到更新的标记中。这使得提示标记的几何位置和类型都具有很强的依赖性。

After running the decoder, we upsample the updated image embedding by 4 ${\times}$ with two transposed convolutional layers (now it’s downscaled 4 ${\times}$ relative to the input image). Then, the tokens attend once more to the image embedding and we pass the updated output token embedding to a small 3-layer MLP that outputs a vector matching the channel dimension of the upscaled image embedding. Finally, we predict a mask with a spatially point-wise product between the upscaled image embedding and the MLP’s output.
运行解码器后，我们使用两个转置卷积层对更新后的图像嵌入进行 4 ${\times}$ 的高采样（现在相对于输入图像缩放了 4 ${\times}$ ）。然后，令牌再次参与图像嵌入，我们将更新后的输出令牌嵌入传递给一个小型的 3 层 MLP，该 MLP 输出一个与提升后的图像嵌入的通道维度相匹配的向量。最后，我们用放大图像嵌入和 MLP 输出之间的空间点积来预测掩码。

The transformer uses an embedding dimension of 256. The transformer MLP blocks have a large internal dimension of 2048, but the MLP is applied only to the prompt tokens for which there are relatively few (rarely greater than 20). However, in cross-attention layers where we have a 64 ${\times}$ 64 image embedding, we reduce the channel dimension of the queries, keys, and values by 2 ${\times}$ to 128 for computational efficiency. All attention layers use 8 heads.
转换器使用的嵌入维度为 256。转换器 MLP 块的内部维度较大，为 2048，但 MLP 只应用于提示标记，而提示标记的数量相对较少（很少超过 20 个）。不过，在交叉注意层中，我们有 64 ${\times}$ 64 个图像嵌入，为了提高计算效率，我们将查询、键和值的通道维度减少了 2 ${\times}$ 至 128。所有注意力层都使用 8 个头。

The transposed convolutions used to upscale the output image embedding are 2 ${\times}$ 2, stride 2 with output channel dimensions of 64 and 32 and have GELU activations. They are separated by layer normalization.
用于放大输出图像嵌入的转置卷积为 2 ${\times}$ 2，步长为 2，输出通道尺寸分别为 64 和 32，具有 GELU 激活。它们通过层归一化进行分离。

Making the model ambiguity-aware.
使模型具有模糊感知能力。

As described, a single input prompt may be ambiguous in the sense that it corresponds to multiple valid masks, and the model will learn to average over these masks. We eliminate this problem with a simple modification: instead of predicting a single mask, we use a small number of output tokens and predict multiple masks simultaneously. By default we predict three masks, since we observe that three layers (whole, part, and subpart) are often enough to describe nested masks. During training, we compute the loss (described shortly) between the ground truth and each of the predicted masks, but only backpropagate from the lowest loss. This is a common technique used for models with multiple outputs [15, 45, 64]. For use in applications, we’d like to rank predicted masks, so we add a small head (operating on an additional output token) that estimates the IoU between each predicted mask and the object it covers.
如前所述，单个输入提示可能是模棱两可的，因为它对应多个有效的掩码，而模型将学习对这些掩码进行平均。我们通过一个简单的修改就可以解决这个问题：我们使用少量的输出标记，同时预测多个掩码，而不是预测单个掩码。默认情况下，我们预测三个掩码，因为我们观察到三个层（整体、部分和子部分）往往足以描述嵌套掩码。在训练过程中，我们会计算地面实况和每个预测掩码之间的损失（稍后会介绍），但只会从损失最小的掩码开始反向传播。这是多输出模型常用的技术[15, 45, 64]。在应用中，我们希望对预测的掩码进行排序，因此我们添加了一个小头（在额外的输出标记上运行），用于估算每个预测的掩码与其覆盖的对象之间的 IoU。

Ambiguity is much rarer with multiple prompts and the three output masks will usually become similar. To minimize computation of degenerate losses at training and ensure the single unambiguous mask receives a regular gradient signal, we only predict a single mask when more than one prompt is given. This is accomplished by adding a fourth output token for an additional mask prediction. This fourth mask is never returned for a single prompt and is the only mask returned for multiple prompts.
如果有多个提示，模棱两可的情况就会少得多，而且三个输出掩码通常会变得相似。为了在训练时尽量减少退化损失的计算，并确保单一的无歧义掩码能接收到有规律的梯度信号，我们在给出多个提示时只预测一个掩码。为此，我们增加了第四个输出标记，用于额外的掩码预测。对于单个提示，第四个掩码不会返回，而对于多个提示，第四个掩码是唯一返回的掩码。

Losses. 损失

We supervise mask prediction with a linear combination of focal loss [65] and dice loss [73] in a 20:1 ratio of focal loss to dice loss, following [20, 14]. Unlike [20, 14], we observe that auxiliary deep supervision after each decoder layer is unhelpful. The IoU prediction head is trained with mean-square-error loss between the IoU prediction and the predicted mask’s IoU with the ground truth mask. It is added to the mask loss with a constant scaling factor of 1.0.
按照[20, 14]，我们使用焦点损失[65]和骰子损失[73]的线性组合对掩码预测进行监督，焦点损失和骰子损失的比例为 20:1。与[20, 14]不同的是，我们发现在每个解码器层之后进行辅助深度监督并无益处。IoU 预测头是用 IoU 预测和预测掩码的 IoU 与地面实况掩码之间的均方误差损失来训练的。它以 1.0 的恒定缩放因子添加到掩码损失中。

Training algorithm. 训练算法。

Following recent approaches [92, 37], we simulate an interactive segmentation setup during training. First, with equal probability either a foreground point or bounding box is selected randomly for the target mask. Points are sampled uniformly from the ground truth mask. Boxes are taken as the ground truth mask’s bounding box, with random noise added in each coordinate with standard deviation equal to 10% of the box sidelength, to a maximum of 20 pixels. This noise profile is a reasonable compromise between applications like instance segmentation, which produce a tight box around the target object, and interactive segmentation, where a user may draw a loose box.
按照最近的方法[92, 37]，我们在训练过程中模拟了交互式分割设置。首先，以相同的概率随机选择前景点或边界框作为目标掩膜。点从地面实况掩膜中均匀采样。方框作为地面实况掩膜的边界方框，在每个坐标上添加随机噪音，标准偏差等于方框边长的 10%，最大为 20 像素。这种噪音曲线是实例分割等应用与交互式分割之间的一种合理折衷，前者会在目标对象周围产生一个紧密的方框，而后者用户可能会画出一个松散的方框。

After making a prediction from this first prompt, subsequent points are selected uniformly from the error region between the previous mask prediction and the ground truth mask. Each new point is foreground or background if the error region is a false negative or false positive, respectively. We also supply the mask prediction from the previous iteration as an additional prompt to our model. To provide the next iteration with maximal information, we supply the unthresholded mask logits instead of the binarized mask. When multiple masks are returned, the mask passed to the next iteration and used to sample the next point is the one with the highest predicted IoU.
根据第一个提示进行预测后，从上一个掩码预测和地面实况掩码之间的误差区域均匀地选择后续点。如果误差区域为假阴性或假阳性，则每个新点分别为前景或背景。我们还提供上一次迭代的掩码预测，作为模型的额外提示。为了给下一次迭代提供最多的信息，我们提供未阈值化的掩码对数，而不是二值化掩码。当返回多个掩码时，传递给下一次迭代并用于下一个点采样的掩码是预测 IoU 最高的掩码。

We find diminishing returns after 8 iteratively sampled points (we have tested up to 16). Additionally, to encourage the model to benefit from the supplied mask, we also use two more iterations where no additional points are sampled. One of these iterations is randomly inserted among the 8 iteratively sampled points, and the other is always at the end. This gives 11 total iterations: one sampled initial input prompt, 8 iteratively sampled points, and two iterations where no new external information is supplied to the model so it can learn to refine its own mask predictions. We note that using a relatively large number of iterations is possible because our lightweight mask decoder requires less than 1% of the image encoder’s compute and, therefore, each iteration adds only a small overhead. This is unlike previous interactive methods that perform only one or a few interactive steps per optimizer update [70, 9, 37, 92].
我们发现，迭代采样 8 个点后，收益会逐渐减少（我们最多测试过 16 个点）。此外，为了鼓励模型从提供的掩码中获益，我们还进行了两次迭代，在这两次迭代中都没有额外的采样点。其中一次迭代是在 8 个迭代采样点中随机插入的，另一次迭代总是在最后。这样一共进行了 11 次迭代：1 次初始输入提示采样，8 次迭代采样点，以及 2 次不向模型提供新的外部信息的迭代，这样模型就能学会完善自己的掩码预测。我们注意到，使用相对较多的迭代次数是可能的，因为我们的轻量级掩码解码器所需的计算量不到图像编码器的 1%，因此每次迭代只会增加少量开销。这与以往每次优化器更新只执行一个或几个交互步骤的交互式方法不同[70, 9, 37, 92]。

Training recipe. 培训食谱

We use the AdamW [68] optimizer ( $\beta_{1}=0.9$ , $\beta_{2}=0.999$ ) and a linear learning rate warmup [42] for 250 iterations and a step-wise learning rate decay schedule. The initial learning rate (lr), after warmup, is ${8}\mathrm{e}^{-4}$ . We train for 90k iterations ( $\scriptstyle\sim$ 2 SA-1B epochs) and decrease the lr by a factor of 10 at 60k iterations and again at 86666 iterations. The batch size is 256 images. To regularize SAM, we set weight decay (wd) to 0.1 and apply drop path [53] (dp) with a rate of 0.4. We use a layer-wise learning rate decay [5] (ld) of 0.8. No data augmentation is applied. We initialize SAM from an MAE [47] pre-trained ViT-H. We distribute training across 256 GPUs, due to the large image encoder and 1024 ${\times}$ 1024 input size. To limit GPU memory usage, we train with up to 64 randomly sampled masks per GPU. Additionally, we find that lightly filtering SA-1B masks to discard any that cover more than 90% of the image qualitatively improves results.
我们使用 AdamW [ 68] 优化器 ( $\beta_{1}=0.9$ , $\beta_{2}=0.999$ ) 和线性学习率预热 [ 42] ，迭代次数为 250 次，学习率衰减时间为步进式。预热后的初始学习率（lr）为 ${8}\mathrm{e}^{-4}$ 。我们进行了 90k 次迭代训练（ $\scriptstyle\sim$ 2 SA-1B epochs），并在 60k 次迭代和 86666 次迭代时将 lr 降低了 10 倍。批次大小为 256 幅图像。为了使 SAM 正则化，我们将权重衰减（wd）设为 0.1，并以 0.4 的比率应用下降路径[53]（dp）。我们使用 0.8 的层学习衰减率[ 5] (ld)。不应用数据增强。我们通过 MAE [ 47] 预先训练的 ViT-H 来初始化 SAM。由于采用了大型图像编码器和 1024 ${\times}$ 1024 的输入大小，我们将训练分配给 256 个 GPU。为了限制 GPU 内存使用量，我们在每个 GPU 上最多使用 64 个随机采样掩码进行训练。此外，我们还发现，对 SA-1B 掩码进行轻度过滤，剔除覆盖图像 90% 以上的掩码，可以从质量上改善结果。

For ablations and others variations on training (e.g., text-to-mask §D.5), we deviate from the default recipe above as follows. When training with data from the first and second data engine stages only, we augment the input with large-scale jitter [40] with a scale range of [0.1, 2.0]. Intuitively, data augmentation may be helpful when training data is more limited. To train ViT-B and ViT-L, we use 180k iterations with batch size 128 distributed across 128 GPUs. We set lr = ${8}\mathrm{e}^{-4}$ / ${4}\mathrm{e}^{-4}$ , ld = 0.6/0.8, wd = 0.1, and dp = 0.6/0.4 for ViT-B/L, respectively.
对于消融和其他训练变化（例如文本到掩码 §D.5），我们偏离上述默认配方如下。当仅使用第一和第二数据引擎阶段的数据进行训练时，我们使用大规模抖动[40]来增强输入，其范围为[0.1, 2.0]。直观地说，当训练数据较为有限时，数据增强可能会有所帮助。为了训练 ViT-B 和 ViT-L，我们使用了 180k 次迭代，批量大小为 128，分布在 128 个 GPU 上。我们为 ViT-B/L 分别设置了 lr = ${8}\mathrm{e}^{-4}$ / ${4}\mathrm{e}^{-4}$ 、ld = 0.6/0.8、wd = 0.1 和 dp = 0.6/0.4。

Appendix B Automatic Mask Generation Details
附录 BA 自动生成掩码详情

Here we discuss details of the data engine’s fully automatic stage that was used to generate the released SA-1B.
在此，我们将讨论用于生成已发布的 SA-1B 的数据引擎全自动阶段的细节。

Cropping. 种植。

Masks were generated from a regular grid of 32 ${\times}$ 32 points on the full image and 20 additional zoomed-in image crops arising from 2 ${\times}$ 2 and 4 ${\times}$ 4 partially overlapping windows using 16 ${\times}$ 16 and 8 ${\times}$ 8 regular point grids, respectively. The original high-resolution images were used for cropping (this was the only time we used them). We removed masks that touch the inner boundaries of the crops. We applied standard greedy box-based NMS (boxes were used for efficiency) in two phases: first within each crop and second across crops. When applying NMS within a crop, we used the model’s predicted IoU to rank masks. When applying NMS across crops, we ranked masks from most zoomed-in (i.e., from a 4 ${\times}$ 4 crop) to least zoomed-in (i.e., the original image), based on their source crop. In both cases, we used an NMS threshold of 0.7.
遮罩是由完整图像上 32 ${\times}$ 32 个点的规则网格和 20 个额外的放大图像裁剪生成的，这些图像裁剪来自 2 ${\times}$ 2 和 4 ${\times}$ 4 个部分重叠的窗口，分别使用 16 ${\times}$ 16 和 8 ${\times}$ 8 个规则点网格。原始的高分辨率图像被用于裁剪（这是我们唯一一次使用它们）。我们删除了触及作物内部边界的遮罩。我们分两个阶段应用基于贪婪方框的标准 NMS（使用方框是为了提高效率）：第一阶段是在每种作物内部，第二阶段是跨作物。在作物内部应用 NMS 时，我们使用模型预测的 IoU 对掩蔽进行排序。在跨作物应用 NMS 时，我们根据遮罩的来源作物，将遮罩从最放大（即来自 4 ${\times}$ 4 的作物）到最小放大（即原始图像）进行排序。在这两种情况下，我们都使用了 0.7 的 NMS 阈值。

Filtering. 过滤。

We used three filters to increase mask quality. First, to keep only confident masks we filtered by the model’s predicted IoU score at a threshold of 88.0. Second, to keep only stable masks we compared two binary masks resulting from the same underlying soft mask by thresholding it at different values. We kept the prediction (i.e., the binary mask resulting from thresholding logits at 0) only if the IoU between its pair of -1 and +1 thresholded masks was equal to or greater than 95.0. Third, we noticed that occasionally an automatic mask would cover the entire image. These masks were generally uninteresting, and we filtered them by removing masks that covered 95% or more of an image. All filtering thresholds were selected to achieve both a large number of masks and high mask quality as judged by professional annotators using the method described in §5.
我们使用了三种过滤器来提高掩码质量。首先，为了只保留有把握的掩码，我们根据模型预测的 IoU 分数进行筛选，阈值为 88.0。其次，为了只保留稳定的掩码，我们比较了由同一基础软掩码产生的两个二进制掩码，将其阈值设定为不同的值。只有当一对-1 和+1 的阈值掩码之间的 IoU 等于或大于 95.0 时，我们才会保留预测结果（即阈值对数为 0 时产生的二进制掩码）。第三，我们注意到自动掩码偶尔会覆盖整个图像。这些掩码一般都是无趣的，我们通过删除覆盖图像 95% 或更多的掩码来对其进行过滤。所有过滤阈值的选择都是为了既能获得大量的掩码，又能获得由专业注释者使用第 5 节所述方法判断的高质量掩码。

Postprocessing. 后期处理。

We observed two error types that are easily mitigated with postprocessing. First, an estimated 4% of masks include small, spurious components. To address these, we removed connected components with area less than 100 pixels (including removing entire masks if the largest component is below this threshold). Second, another estimated 4% of masks include small, spurious holes. To address these, we filled holes with area less than 100 pixels. Holes were identified as components of inverted masks.
我们观察到两种错误类型，通过后处理很容易得到缓解。首先，估计有 4% 的掩码包含了小的虚假成分。为了解决这些问题，我们删除了面积小于 100 像素的连接组件（包括在最大组件小于此阈值时删除整个掩膜）。其次，估计还有 4% 的掩膜包含小的、虚假的孔洞。为了解决这些问题，我们填补了面积小于 100 像素的洞。孔洞被识别为倒置掩码的组成部分。

Automatic mask generation model.
自动面具生成模型

We trained a special version of SAM for fully automatic mask generation that sacrifices some inference speed for improved mask generation properties. We note the differences between our default SAM and the one used for data generation here: it was trained on manual and semi-automatic data only, it was trained for longer (177656 iterations instead of 90k) with large-scale jitter data augmentation [40], simulated interactive training used only point and mask prompts (no boxes) and sampled only 4 points per mask during training (reducing from our default of 9 to 4 sped up training iterations and had no impact on 1-point performance, though it would harm mIoU if evaluating with more points), and finally the mask decoder used 3 layers instead of 2.
我们为全自动掩码生成训练了一个特殊版本的 SAM，它牺牲了一些推理速度，但却改善了掩码生成性能。我们注意到我们的默认 SAM 与这里用于数据生成的 SAM 之间的差异：它只在手动和半自动数据上进行训练，训练时间更长（177656 次迭代，而不是 90k），使用大规模抖动数据增强[40]，模拟交互式训练只使用点和掩码提示（没有方框），训练期间每个掩码只采样 4 个点（从默认的 9 个点减少到 4 个点，加快了训练迭代速度，对单点性能没有影响，但如果使用更多点进行评估，则会损害 mIoU），最后，掩码解码器使用 3 层，而不是 2 层。

SA-1B examples. SA-1B 示例。

We show SA-1B samples in Fig. 2. For more examples, please see our dataset explorer.
我们在图 2 中展示了 SA-1B 样本。如需了解更多示例，请参阅我们的数据集资源管理器。

Appendix C RAI Additional Details
附录 CRAI 其他详细信息

Inferring geographic information for SA-1B.
为 SA-1B 推断地理信息。

While the images in SA-1B are not geo-tagged, each image has a caption describing its contents and where it was taken. We infer approximate image geo-locations from these captions using an Elmo-based named entity recognition model [78]. Each extracted location entity is mapped to every matching country, province, and city. Captions are mapped to a single country by first considering the matching countries, then provinces, and finally cities. We note that there are ambiguities and potential for biases with this method (e.g., “Georgia” may refer to the country or the US state). As such, we use the extracted locations to analyze the dataset as a whole, but do not release the inferred locations. The captions will not be released publicly as required by the image provider.
虽然 SA-1B 中的图像没有地理标记，但每张图像都有一个标题，描述图像内容和拍摄地点。我们使用基于 Elmo 的命名实体识别模型[78]，从这些标题中推断出近似的图像地理位置。每个提取的位置实体都会映射到每个匹配的国家、省份和城市。首先考虑匹配的国家，然后是省份，最后是城市，从而将字幕映射到单个国家。我们注意到，这种方法存在模糊性和潜在的偏差（例如，"佐治亚州 "可能指国家，也可能指美国的州）。因此，我们使用提取的地点来分析整个数据集，但不会公布推断的地点。根据图片提供商的要求，我们不会公开发布标题。

Inferring geographic information for COCO and Open Images.
为 COCO 和开放图像推断地理信息。

The COCO [66] and Open Images [60] datasets do not provide geo-locations. Following [29], we retrieve geographic metadata using the Flickr API. We retrieved locations for 24% of the COCO training set (19,562 images) and for Open Images we retrieved 18% of the training set (493,517 images, after only considering images with masks). We note that the geographic information is approximate, and the sample of images with this information may not fully match the full dataset distribution.
COCO [ 66] 和 Open Images [ 60] 数据集不提供地理位置。按照 [ 29] 的方法，我们使用 Flickr API 检索地理元数据。我们检索到了 COCO 训练集（19,562 张图片）中 24% 的位置信息，而对于 Open Images，我们检索到了训练集（493,517 张图片，仅考虑带遮罩的图片）中 18% 的位置信息。我们注意到，地理信息是近似的，包含这些信息的图像样本可能与整个数据集的分布不完全一致。

Inferring income information.
推断收入信息。

We use each image’s inferred country to look up its income level using the levels defined by The World Bank [98]. We collapse the upper-middle and lower-middle levels into a single middle level.
我们使用每幅图像推断出的国家，并根据世界银行[98]定义的收入水平查询其收入水平。我们将中上和中下收入水平合并为一个中等收入水平。

Fairness in segmenting people.
公平划分人群。

To investigate SAM’s fairness at segmenting people we use the More Inclusive Annotations for People (MIAP) [87] test set annotations for Open Images [60], which allows us to compare SAM’s performance across perceived gender presentation and perceived age group. MIAP provides box annotations, while we need ground truth masks for this analysis. To get ground truth masks, we select each person-category mask from Open Images if its corresponding bounding box is within a 1% margin (based on relative box side lengths) of an annotated bounding box in MIAP, resulting in 3.9k masks.
为了研究 SAM 在分割人物方面的公平性，我们使用了 Open Images 的 More Inclusive Annotations for People（MIAP）[87] 测试集注释[60]。MIAP 提供了方框注释，而我们的分析需要地面实况掩码。为了获得地面实况掩码，我们从 Open Images 中选择了每个人物类别掩码，如果其对应的边框与 MIAP 中注释的边框相差 1%（基于相对边框边长）的话，这样就得到了 3.9k 个掩码。

	mIoU at 在
	1 point 1 分	3 points 3 分
perceived gender presentation 性别认知
feminine 女性化	76.3 $\pm$ 1.1	90.7 $\pm$ 0.5
masculine 阳性	81.0 $\pm$ 1.2	92.3 $\pm$ 0.4

	mIoU at 在
	1 point 1 分	3 points 3 分
perceived age group 认知年龄组
older 年长	81.9 $\pm$ 3.8	92.8 $\pm$ 1.6
middle 中层	78.2 $\pm$ 0.8	91.3 $\pm$ 0.3
young 年轻	77.3 $\pm$ 2.7	91.5 $\pm$ 0.9

Table 6: SAM’s performance segmenting clothing across perceived gender presentation and age group. The intervals for perceived gender are disjoint, with mIoU for masculine being higher. Confidence intervals for age group overlap.
表 6：SAM 对不同性别和年龄组服装的细分结果。感知性别的置信区间不相连，男性的 mIoU 较高。年龄组的置信区间重叠。

Fairness in segmenting clothing.
服装分类的公平性。

We extend our analysis from §6 to clothing segmentation. We look at SAM’s performance on clothing relative to the attributes of those wearing the clothes. We use all 6.5k ground truth masks from Open Images that have a category under the clothing superclass and reside within a person box from MIAP. In Table 6 we compare performance across perceived gender presentation and age group. We find that SAM is better at segmenting clothing on those who present predominantly masculine, with disjoint 95% confidence intervals. The gap closes when moving from 1 to 3 point evaluation. Differences for perceived age group are not significant. Our results indicate there is a bias when segmenting clothing across perceived gender presentation with a one point prompt, and we encourage users of SAM to be mindful of this limitation.
我们将第 6 节中的分析扩展到服装细分。我们考察了 SAM 在服装上的表现与穿着服装的人的属性之间的关系。我们使用了 Open Images 中所有 6.5k 个基本真实遮罩，这些遮罩都属于服装超类，并且位于 MIAP 的人物框中。在表 6 中，我们比较了不同性别和年龄组的表现。我们发现，SAM 能更好地分割以男性为主的人的服装，95% 置信区间不相连。从 1 点评价到 3 点评价，差距逐渐缩小。感知年龄组的差异并不显著。我们的结果表明，用 1 点提示对不同性别的服装进行细分时存在偏差，我们鼓励 SAM 的用户注意这一局限性。

Appendix D Experiment Implementation Details
附录 DE 实验实施细节

D.1 Zero-Shot Single Point Valid Mask Evaluation
D.1 零点单点有效屏蔽评估

dataset 数据集	abbreviation 简写 & link 链接	image 图像 type 类型	description 描述	mask 面罩 type 类型	source split 源分割	# images # 图像 sampled 已采样	# masks sampled 已采样
Plant Phenotyping Datasets Leaf Segmentation [74] 植物表型数据集叶片分割 [ 74］	PPDLS	Plants 植物	Leaf segmentation for images of tobacco and ara plants. 烟草和阿拉植物图像的叶片分割。	Instance 实例	N/A 不适用	182	2347
BBBC038v1 from Broad Bioimage Benchmark Collection [12] 来自 Broad Bioimage Benchmark Collection 的 BBBC038v1 [ 12]	BBBC038v1	Microscopy 显微镜	Biological images of cells in a variety of settings testing robustness in nuclei segmentation. 各种环境下的细胞生物图像，测试细胞核分割的鲁棒性。	Instance 实例	Train 火车	227	10506
Dataset fOr bOuldeRs Segmentation [80] 用于黑体分割的数据集 [ 80]	DOORS	Boulders 巨石	Segmentation masks of single boulders positioned on the surface of a spherical mesh. 定位在球形网格表面的单个巨石的分割掩模。	Instance 实例	DS1	10000	10000
TimberSeg 1.0 [38] TimberSeg 1.0 [ 38］	TimberSeg	Logs 日志	Segmentation masks of individual logs in piles of timber in various environments and conditions. Images are taken from an operator’s point-of-view. 在不同的环境和条件下，对木材堆中的单个原木进行分段屏蔽。图像从操作员的视角拍摄。	Instance 实例	N/A 不适用	220	2487
Northumberland Dolphin Dataset 2020 [100] 诺森伯兰海豚数据集 2020 [ 100］	NDD20	Underwater 水下	Segmentation masks of two different dolphin species in images taken above and under water. 在水上和水下拍摄的图像中，对两种不同海豚物种进行分割掩码。	Instance 实例	N/A 不适用	4402	6100
Large Vocabulary Instance Segmentation [44] 大型词汇实例分割 [ 44］	LVIS	Scenes 场景	Additional annotations for the COCO [66] dataset to enable the study of long-tailed object detection and segmentation. COCO [ 66] 数据集的附加注释，用于研究长尾物体的检测和分割。	Instance 实例	Validation (v0.5) 验证（v0.5）	945	9642
STREETS [91] 街道 [ 91］	STREETS	Traffic 交通 camera 照相机	Segmentation masks of cars in traffic camera footage. 交通摄像头录像中的汽车分割掩码。	Instance 实例	N/A 不适用	819	9854
ZeroWaste-f [6] 零废弃f [ 6］	ZeroWaste-f 零废弃f	Recycling 回收利用	Segmentation masks in cluttered scenes of deformed recycling waste. 变形回收垃圾杂乱场景中的分割掩码。	Instance 实例	Train 火车	2947	6155
iShape [111] iShape [ 111］	iShape	Irregular 不规则 shapes 外形	Segmentation masks of irregular shapes like antennas, logs, fences, and hangers. 天线、圆木、栅栏和衣架等不规则形状的分割掩模。	Instance 实例	Validation 验证	754	9742
ADE20K [117] ADE20K [ 117］	ADE20K	Scenes 场景	Object and part segmentation masks for images from SUN [107] and Places [116] datasets. SUN [ 107] 和 Places [ 116] 数据集图像的物体和部件分割掩码。	Instance 实例	Validation 验证	302	10128
Occluded Video Instance Segmentation [81] 闭塞视频实例分割 [ 81］	OVIS	Occlusions 闭塞	Instance segmentation masks in videos, focusing on objects that are occluded. 视频中的实例分割掩码，重点关注被遮挡的对象。	Instance 实例	Train 火车	2044	10011
Hypersim [86] Hypersim [ 86］	Hypersim	Simulation 模拟	Photorealistic synthetic dataset of indoor scenes with instance masks. 带有实例掩码的逼真室内场景合成数据集。	Instance 实例	Evermotion archinteriors volumes 1-55 excluding 20,25,40,49 Evermotion archinteriors 第 1-55 卷，不包括 20、25、40、49 卷	338	9445
Night and Day Instance Segmented Park [22, 23] 昼夜实例分段公园 [ 22, 23]	NDISPark	Parking lots 停车场	Images of parking lots from video footage taken at day and night during different weather conditions and camera angles for vehicle segmentation. 根据白天和夜晚不同天气条件下拍摄的视频录像以及摄像机的拍摄角度，对停车场的图像进行车辆分割。	Instance 实例	Train 火车	111	2577
EPIC-KITCHENS VISOR [28, 27] 史诗级厨房遮阳板 [ 28, 27]	VISOR	Egocentric 以自我为中心	Segmentation masks for hands and active objects in ego-centric video from the cooking dataset EPIC-KITCHENS [27] 烹饪数据集 EPIC-KITCHENS 中以自我为中心的视频中手和活动物体的分割掩码[ 27] 。.	Instance 实例	Validation 验证	1864	10141
Plittersdorf dataset [46] 普利特斯多夫数据集 [ 46]	Plittersdorf 普利特斯多夫	Stereo 立体声 images 图像	Segmentation masks of wildlife in images taken with the SOCRATES stereo camera trap. 用 SOCRATES 立体相机陷阱拍摄的图像中野生动物的分割掩码。	Instance 实例	Train, validation, test 培训、验证、测试	187	546
Egocentric Hand-Object Segmentation [113] 以自我为中心的手部物体分割 [ 113]	EgoHOS 自我	Egocentric 以自我为中心	Fine-grained egocentric hand-object segmentation dataset. Dataset contains mask annotations for existing datasets. 精细的以自我为中心的手部物体分割数据集。数据集包含现有数据集的掩码注释。	Instance 实例	Train (including only Ego4D [43] and THU-READ [97, 96]) 火车（仅包括 Ego4D [ 43] 和 THU-READ [ 97, 96])	2940	9961
InstanceBuilding 2D [17] 实例构建 2D [ 17］	IBD	Drones 无人机	High-resolution drone UAV images annotated with roof instance segmentation masks. 用屋顶实例分割掩码标注的高分辨率无人机 UAV 图像。	Instance 实例	Train (2D annotations) 列车（二维注释）	467	11953
WoodScape [112] 木景 [ 112］	WoodScape 木景	Fisheye 鱼眼 driving 驾驶	Fisheye driving dataset with segmentation masks. Images are taken from four surround-view cameras. 带有分割掩码的鱼眼驾驶数据集。图像由四个环视摄像头拍摄。	Instance 实例	Set 1 第 1 套	107	10266
Cityscapes [25] 城市景观 [ 25］	Cityscapes 城市景观	Driving 驾驶	Stereo video of street scenes with segmentation masks. 街景立体视频与分割掩码。	Panoptic 泛光	Validation 验证	293	9973
PIDray [104] PIDray [ 104］	PIDRay	X-ray X光	Segmentation masks of prohibited items in X-ray images of baggage. 行李 X 光图像中违禁物品的分割掩码。	Instance 实例	Test (hard) 测试（硬）	3733	8892
Diverse Realism in Art Movements [24] 艺术运动中的多元现实主义[ 24］	DRAM	Paintings 绘画	Domain adaptation dataset for semantic segmentation of art paintings. 用于艺术绘画语义分割的领域适应性数据集。	Semantic 语义学	Test 测试	718	1179
TrashCan [52] 垃圾桶 [ 52］	TrashCan 垃圾桶	Underwater 水下	Segmentation masks of trash in images taken by underwater ROVs. Images are sourced from the J-EDI [69] dataset. 水下遥控潜水器拍摄的图像中的垃圾分割掩码。图片来源于 J-EDI [ 69] 数据集。	Instance 实例	Train (instance task) 培训（实例任务）	5936	9540
Georgia Tech Egocentric Activity Datasets [34, 63] 佐治亚理工学院以自我为中心的活动数据集 [ 34, 63]	GTEA	Egocentric 以自我为中心	Videos are composed of four different subjects performing seven types of daily activities with segmentation masks of hands. 视频由四名不同的受试者进行七种类型的日常活动组成，并对手部进行了分段屏蔽。	Instance 实例	Train (segmenting hands task) 训练（双手分段任务）	652	1208

Table 7: Segmentation datasets used to evaluate zero-shot segmentation with point prompts. The 23 datasets cover a broad range of domains; see column “image type”. To make our evaluation efficient, we subsample datasets that have more than 15k masks. Specifically, we randomly sampled images so that the total number of masks in the images is

\scriptstyle\sim

10k.
表 7：用于评估带点提示的零镜头分割的分割数据集。这 23 个数据集涵盖了广泛的领域；参见 "图像类型 "一栏。为了提高评估效率，我们对掩码数超过 15k 的数据集进行了子采样。具体来说，我们随机抽取图像，使图像中的掩码总数为

\scriptstyle\sim

10k。

Datasets. 数据集。

We built a new segmentation benchmark to evaluate the zero-shot transfer capabilities of our model using a suite of 23 diverse segmentation datasets from prior work. A description of each dataset is given in Table 7. For examples, see main text Fig. 8. This suite covers a range of domains including egocentric [34, 28, 113], microscopy [12], X-ray [104], underwater [52, 100], aerial [17], simulation [86], driving [25], and painting [24] images. For efficient evaluation we subsampled datasets with more than 15k masks. Specifically, we randomly picked images so that the total number of masks in the sampled images was $\scriptstyle\sim$ 10k. We blurred faces of people in all the datasets.
我们建立了一个新的分割基准，使用先前工作中的 23 个不同分割数据集来评估我们模型的零镜头传输能力。表 7 给出了每个数据集的说明。示例见正文图 8。这套数据集涵盖了一系列领域，包括自我中心图像[34, 28, 113]、显微镜图像[12]、X 射线图像[104]、水下图像[52, 100]、航空图像[17]、模拟图像[86]、驾驶图像[25]和绘画图像[24]。为了有效评估，我们对掩码超过 15k 的数据集进行了子采样。具体来说，我们随机抽取图像，使抽样图像中的掩码总数为 $\scriptstyle\sim$ 10k。我们对所有数据集中的人脸进行了模糊处理。

Point sampling. 点取样。

Our default point sampling follows standard practice in interactive segmentation [109, 64, 92]. The first point is chosen deterministically as the point farthest from the object boundary. Each subsequent point is the farthest from the boundary of the error region between ground truth and the previous prediction. Some experiments (where specified) use a more challenging sampling strategy in which the first point is a random point, rather than a deterministically selected “center” point. Each subsequent point is selected as described above. This setting better reflects use cases in which the first point is not reliably near the center of the mask, such as prompting from eye gaze.
我们的默认点取样遵循交互式分割的标准做法[109, 64, 92]。第一个点被确定地选择为距离物体边界最远的点。随后的每个点都是距离地面实况和前一个预测之间误差区域边界最远的点。有些实验（如有说明）使用了更具挑战性的采样策略，即第一个点是一个随机点，而不是一个确定选择的 "中心 "点。之后每个点的选择如上所述。这种设置能更好地反映出第一点并不可靠地靠近掩码中心的使用情况，例如通过眼睛注视进行提示。

Evaluation. 评估。

We measure IoU between a prediction after $N$ point prompts and a ground truth mask, where $N=\{1,2,3,5,9\}$ and points are sampled iteratively with either of the strategies described above. The per-dataset mIoU is the per-mask IoU averaged across all objects in the dataset. Finally, we report the top-line metric by averaging the per-dataset mIoUs across all 23 datasets. Our evaluation differs from the standard interactive segmentation evaluation protocol which measures the average number of points needed to achieve $X$ % IoU, with up to 20 points. We focus on predictions after just one, or possibly a few points, since many of our use cases involve a single or very few prompts. Given our application focus, which requires real-time prompt processing, we expect the best interactive segmentation models to outperform SAM when using a large number of points.
我们测量的是 $N$ 点提示后的预测与地面实况掩码之间的 IoU，其中 $N=\{1,2,3,5,9\}$ 和点是用上述任一策略迭代采样的。每个数据集的 mIoU 是数据集中所有对象的每个掩码 IoU 的平均值。最后，我们通过对所有 23 个数据集的每个数据集 mIoU 取平均值来报告顶线指标。我们的评估有别于标准的交互式分割评估协议，后者测量的是达到 $X$ % IoU 所需的平均点数，最多可达 20 点。我们只关注一个点或几个点之后的预测结果，因为我们的许多用例只涉及一个或很少的提示。考虑到我们的应用重点需要实时提示处理，我们希望最好的交互式分割模型在使用大量点数时能够优于 SAM。

Baselines. 基线。

We use three recent strong interactive baselines: RITM [92], FocalClick [18], and SimpleClick [67]. For each, we use the largest models trained on the broadest datasets publicly released by the authors. For RITM, we use HRNet32 IT-M trained on the combination of COCO [66] and LVIS [44] introduced by the authors. For FocalClick, we use SegFormerB3-S2 trained on a “combined dataset” that includes 8 different segmentation datasets [18]. For SimpleClick, we use ViT-H448 trained on a combination of COCO and LVIS. We follow the suggested default strategies for data pre-processing (i.e., data augmentations or image resizing) and do not change or adapt any parameters for our evaluation. In our experiments, we observe that RITM outperforms other baselines on our 23 dataset suite with 1 point evaluation. Therefore, we use RITM as the default baseline. When evaluating with more points we report results for all baselines.
我们使用了三个最新的强交互基线：RITM [ 92]、FocalClick [ 18] 和 SimpleClick [ 67]。我们分别使用了在作者公开发布的最广泛数据集上训练的最大模型。对于 RITM，我们使用在作者介绍的 COCO [ 66] 和 LVIS [ 44] 组合上训练的 HRNet32 IT-M。对于 FocalClick，我们使用在 "组合数据集 "上训练的 SegFormerB3-S2，该数据集包括 8 个不同的分割数据集[ 18]。对于 SimpleClick，我们使用在 COCO 和 LVIS 组合上训练的 ViT-H448。我们按照建议的默认策略进行数据预处理（即数据增强或图像大小调整），在评估时没有更改或调整任何参数。在我们的实验中，我们观察到 RITM 在 23 个数据集套件上的表现优于其他基线，评估结果为 1 分。因此，我们将 RITM 作为默认基线。当评估点数较多时，我们会报告所有基线的结果。

Single point ambiguity and oracle evaluation.
单点模糊性和甲骨文评估。

In addition to IoU after $N$ points prompts, we report SAM’s “oracle” performance at 1 point by evaluating the predicted mask that best matches ground truth from amongst SAM’s three predictions (rather than using the one that SAM itself ranks first, as we do by default). This protocol addresses possible single point prompt ambiguity by relaxing the requirement to guess the one right mask among several valid objects.
除了 $N$ 点提示后的 IoU 外，我们还报告了 SAM 在 1 点时的 "oracle "性能，方法是从 SAM 的三个预测中评估与地面实况最匹配的预测掩码（而不是像我们默认的那样，使用 SAM 自己排在第一位的掩码）。该协议通过放宽从多个有效对象中猜测一个正确掩码的要求，解决了可能出现的单点提示模糊问题。

D.2 Zero-Shot Edge Detection
D.2 零镜头边缘检测

Dataset and metrics. 数据集和衡量标准。

We perform zero-shot edge detection experiments on BSDS500 [72, 3]. The ground truth for each image comes from the manual annotations of five different subjects. We report results on the 200 image test subset using the four standard metrics for edge detection [3, 32]: optimal dataset scale (ODS), optimal image scale (OIS), average precision (AP), and recall at 50% precision (R50).
我们在 BSDS500 [ 72, 3] 上进行了零镜头边缘检测实验。每张图像的基本真相来自五个不同受试者的手动注释。我们使用边缘检测的四个标准指标[3, 32]：最佳数据集比例（ODS）、最佳图像比例（OIS）、平均精度（AP）和精度为 50%时的召回率（R50）来报告 200 张图像测试子集的结果。

Method. 方法

For zero-shot transfer, we use a simplified version of our automatic mask generation pipeline. We prompt SAM with a 16 ${\times}$ 16 regular grid of foreground points, which yields 768 predicted masks (three per point). We do not filter by predicted IoU or stability. Redundant masks are removed by NMS. Then we apply a Sobel filter to the remaining masks’ unthresholded probability maps and set values to zero if they do not intersect with the outer boundary pixels of a mask. Finally, we take a pixel-wise max over all the predictions, linearly normalize the result to [0,1], and apply edge NMS [13] to thin the edges.
对于零镜头传输，我们使用了简化版的自动掩膜生成管道。我们使用 16 ${\times}$ 16 个前景点的规则网格来提示 SAM，这样可以生成 768 个预测掩码（每个点三个）。我们不根据预测的 IoU 或稳定性进行过滤。冗余掩码由 NMS 剔除。然后，我们对剩余掩码的未阈值概率图应用索贝尔滤波器，并将与掩码外部边界像素不相交的值设为零。最后，我们对所有预测值进行像素最大化处理，将结果线性归一化为 [0,1]，并应用边缘 NMS[ 13] 来减薄边缘。

Visualizations. 可视化。

In Fig. 15, we show additional examples of zero-shot edge predictions from SAM. These qualitative examples further illustrate how SAM tends to output sensible edge maps, despite not being trained for edge detection. We see that the edges can align well with the human annotations. Although, as previously mentioned, since SAM is not trained for edge detection it does not learn the biases of the BSDS500 dataset and often outputs more edges than are present in the ground truth annotations.
在图 15 中，我们展示了来自 SAM 的零镜头边缘预测的其他示例。这些定性示例进一步说明，尽管没有经过边缘检测训练，但 SAM 仍倾向于输出合理的边缘图。我们可以看到，边缘与人类注释非常吻合。不过，如前所述，由于 SAM 没有经过边缘检测训练，因此它不会学习 BSDS500 数据集的偏差，输出的边缘往往多于地面实况注释中存在的边缘。

D.3 Zero-Shot Object Proposals
D.3 零投篮物体提案

Dataset and metrics. 数据集和衡量标准。

We report the standard average recall (AR) metric for masks at 1000 proposals on the LVIS v1 validation set [44]. Since LVIS has high-quality masks for 1203 object classes, it provides a challenging test for object proposal generation. We focus on AR@1000 due to the open-world nature of our model, which will likely produce many valid masks outside even the 1203 classes in LVIS. To measure performance on frequent, common, and rare categories, we use AR@1000 but measured against a ground truth set containing just the corresponding LVIS categories.
我们报告了在 LVIS v1 验证集[44]上提出 1000 个建议的掩码的标准平均召回率（AR）指标。由于 LVIS 为 1203 个对象类别提供了高质量的掩码，因此它为对象建议的生成提供了一个具有挑战性的测试。由于我们的模型具有开放世界的性质，可能会在 LVIS 的 1203 个类别之外生成许多有效的掩码，因此我们将重点放在 AR@1000 上。为了衡量频繁、常见和罕见类别的性能，我们使用了 AR@1000，但只针对包含相应 LVIS 类别的地面实况集进行衡量。

Baseline. 基准线。

We use cascade ViTDet-H as a baseline, the strongest model from [62] by AP on LVIS. As noted in the main text, an object detector trained in-domain can “game” AR [16] and is expected to be a stronger baseline than other models that focus on open-world proposals or segmentation [58, 105]. To produce 1000 proposals, we disable score thresholding in the three cascade stages and as raise the maximum number of predictions per stage to 1000.
我们使用级联 ViTDet-H 作为基线，这是 LVIS 上 AP 的最强模型[ 62]。正如正文所述，在领域内训练的物体检测器可以 "游戏 "AR[ 16]，与其他专注于开放世界提案或分割的模型相比，它有望成为更强的基线[ 58, 105]。为了生成 1000 个提案，我们禁用了三个级联阶段的分数阈值，并将每个阶段的最大预测数提高到 1000 个。

Method. 方法

We use a modified version of SAM’s automatic mask generation pipeline for zero-shot transfer. First, to make inference time comparable to that of ViTDet we do not process image crops. Second, we remove filtering by predicted IoU and stability. This leaves two tunable parameters to get $\scriptstyle\sim$ 1000 masks per image: the input point grid and the NMS threshold duplicate mask suppression. We choose a 64 ${\times}$ 64 point grid and an NMS threshold of 0.9, which produces $\scriptstyle\sim$ 900 masks per image on average. At evaluation, if greater than 1000 masks have been proposed in an image, they are ranked by the average of their confidence and stability scores, then truncated to the top 1000 proposals.
我们使用经过修改的 SAM 自动掩膜生成管道进行零镜头传输。首先，为了使推理时间与 ViTDet 相当，我们不处理图像裁剪。其次，我们删除了通过预测 IoU 和稳定性进行的过滤。这就为每幅图像获得 $\scriptstyle\sim$ 1000 个掩码留下了两个可调参数：输入点网格和 NMS 复制掩码抑制阈值。我们选择 64 ${\times}$ 64 点网格和 0.9 的 NMS 阈值，这样平均每幅图像可生成 $\scriptstyle\sim$ 900 个掩码。在评估时，如果图像中提出的掩码超过 1000 个，则根据其置信度和稳定性得分的平均值进行排序，然后截取前 1000 个提案。

We hypothesize that SAM’s ability to output multiple masks is especially valuable for this task, since recall should benefit from proposals generated at multiple scales from a single input point. To test this, we compare to an ablated version SAM that only outputs a single mask instead of three (SAM - single-output). Since this model produces fewer masks, we further increase the number of points sampled and NMS threshold to 128 ${\times}$ 128 and 0.95, respectively, obtaining $\scriptstyle\sim$ 950 masks per image on average. Additionally, single-output SAM does not produce the IoU score used to rank masks for NMS in the automatic mask generation pipeline, so instead masks are ranked randomly. Testing suggests this has similar performance to more sophisticated methods of ranking masks, such as using the max logit value of the mask as a proxy for model confidence.
我们假设，SAM 输出多个掩码的能力对这项任务特别有价值，因为从单个输入点生成的多个尺度的建议应能使记忆受益。为了验证这一点，我们将 SAM 与只输出一个掩码而非三个掩码（SAM - 单输出）的消融版本进行了比较。由于该模型产生的掩码较少，我们进一步将采样点数和 NMS 阈值分别提高到 128 ${\times}$ 128 和 0.95，平均每幅图像可获得 $\scriptstyle\sim$ 950 个掩码。此外，单输出 SAM 不会产生 IoU 分数，用于在自动掩码生成管道中为 NMS 的掩码排序，因此掩码的排序是随机的。测试表明，这种方法的性能与更复杂的掩码排序方法类似，例如使用掩码的最大对数值作为模型置信度的代表。

ground truth 基本事实	ViTDet	SAM	ground truth 基本事实	ViTDet	SAM

Figure 16: Zero-shot instance segmentation on LVIS v1. SAM produces higher quality masks than ViTDet. As a zero-shot model, SAM does not have the opportunity to learn specific training data biases; see top-right as an example where SAM makes a modal prediction, whereas the ground truth in LVIS is amodal given that mask annotations in LVIS have no holes.
图 16：LVIS v1 上的零镜头实例分割。作为一个零镜头模型，SAM 没有机会学习特定的训练数据偏差；见右上角的例子，SAM 做出了模态预测，而 LVIS 中的地面实况是模态的，因为 LVIS 中的掩码注释没有孔。

D.4 Zero-Shot Instance Segmentation
D.4 零镜头实例分割

Method. 方法

For zero-shot instance segmentation, we prompt SAM with the boxes output by a fully-supervised ViTDet-H on COCO and LVIS v1 validation splits. We apply an additional mask refinement iteration by feeding the most confident predicted mask, together with the box prompt, back to the mask decoder to produce the final prediction. We show zero-shot instance segmentations predicted on LVIS in Fig. 16. Compared to ViTDet, SAM tends to produce higher quality masks with cleaner boundaries. We confirm this observation with human studies in §7.4. Note that as a zero-shot model, SAM is not able to learn annotation biases in a dataset. For instance, we see that SAM makes a valid modal prediction for the plate, whereas LVIS masks cannot contain holes by design so the plate is annotated amodally.
对于零镜头实例分割，我们用 COCO 和 LVIS v1 验证分片上的全监督 ViTDet-H 输出的方框提示 SAM。我们通过将最有把握的预测掩码连同方框提示反馈给掩码解码器，进行额外的掩码细化迭代，以生成最终预测结果。我们在图 16 中展示了在 LVIS 上预测的零镜头实例分割。与 ViTDet 相比，SAM 往往能生成质量更高、边界更清晰的掩码。我们将在第 7.4 节中通过人类研究证实这一观察结果。需要注意的是，作为一个零镜头模型，SAM 无法学习数据集中的注释偏差。例如，我们看到 SAM 对板块进行了有效的模态预测，而 LVIS 掩膜在设计上不能包含孔洞，因此板块的标注是模态的。

D.5 Zero-Shot Text-to-Mask
D.5零镜头文本到掩码

Model and training. 模式和培训。

We use the largest publicly available CLIP model [82] (ViT-L/14@336px) to compute text and image embeddings, which we $\ell^{2}$ normalize prior to use. To train SAM, we use masks from the first two stages of our data engine. Moreover, we discard all masks with an area smaller than $\textrm{100}^{\textrm{2}}$ pixels. We train this model with large-scale jitter [40] for 120k iterations with batch size 128. All other training parameters follow our default settings.
我们使用最大的公开可用 CLIP 模型[82]（ViT-L/14@336px）来计算文本和图像嵌入，并在使用前对其进行 $\ell^{2}$ 归一化。为了训练 SAM，我们使用了数据引擎前两个阶段的掩码。此外，我们舍弃了所有面积小于 $\textrm{100}^{\textrm{2}}$ 像素的掩码。我们使用大规模抖动[40]对该模型进行了 120k 次迭代训练，批量大小为 128。所有其他训练参数均采用默认设置。

Generating training prompts.
生成培训提示。

To extract an input prompt we first expand the bounding box around each mask by a random factor from 1 ${\times}$ to 2 ${\times}$ , square-crop the expanded box to maintain its aspect ratio, and resize it to 336 ${\times}$ 336 pixels. Before feeding the crop to the CLIP image encoder, with 50% probability we zero-out pixels outside the mask. To ensure the embedding focuses on the object, we use masked attention in the last layer to restrict attention from the output token to the image positions inside the mask. Finally, our prompt is the output token embedding. For training we supply the CLIP-based prompt first, followed by additional iterative point prompts to refine the prediction.
为了提取输入提示，我们首先将每个掩码周围的边界框按 1 ${\times}$ 到 2 ${\times}$ 的随机因子进行扩展，然后对扩展后的边界框进行方形裁剪以保持其长宽比，并将其调整为 336 ${\times}$ 336 像素。在将裁剪结果输入 CLIP 图像编码器之前，我们以 50% 的概率将掩码外的像素清零。为确保嵌入聚焦于对象，我们在最后一层使用掩码注意力，将注意力从输出标记限制到掩码内的图像位置。最后，我们的提示就是输出标记嵌入。在训练过程中，我们首先提供基于 CLIP 的提示，然后提供额外的迭代点提示来完善预测。

Inference. 推论

During inference we use the CLIP text encoder without any modifications to create a prompt for SAM. We rely on the fact that text and image embeddings are aligned by CLIP, which allows us to train without any explicit text supervision while using text-based prompts for inference.
在推理过程中，我们使用 CLIP 文本编码器为 SAM 创建提示，而不做任何修改。我们依靠 CLIP 对文本和图像嵌入进行对齐的事实，这使得我们在使用基于文本的提示进行推理时，无需任何明确的文本监督即可进行训练。

D.6 Probing the Latent Space of SAM
D.6 探究 SAM 的潜在空间

Finally, we perform an initial investigation to qualitatively probe the latent space learned by SAM. In particular, we are interested in whether SAM is able to capture any semantics in its representation even though is not trained with explicit semantic supervision. To do so, we compute mask embeddings by extracting an image embedding from SAM from an image crop around a mask and its horizontally flipped version, multiplying the image embedding by the binary mask, and averaging over spatial locations. In Fig. 17, we show 3 examples of a query mask and similar masks (in the latent space) in the same image. We observe that the nearest neighbors for each query show some, albeit imperfect, shape and semantic similarity. Although these results are preliminary, they indicate that the representations from SAM may be useful for a variety of purposes, such as further data labeling, understanding the contents of datasets, or as features for downstream tasks.
最后，我们进行了一项初步调查，对 SAM 学习到的潜在空间进行定性分析。特别是，我们感兴趣的是，即使 SAM 没有经过明确的语义监督训练，它是否也能在其表示中捕捉到任何语义。为此，我们从掩码周围的图像裁剪及其水平翻转版本中提取出 SAM 的图像嵌入，将图像嵌入乘以二进制掩码，然后对空间位置进行平均，从而计算出掩码嵌入。在图 17 中，我们展示了同一图像中查询掩码和类似掩码（在潜空间中）的 3 个示例。我们观察到，每个查询的近邻都显示出一定的形状和语义相似性，尽管并不完美。虽然这些结果还只是初步的，但它们表明，SAM 的表征可能对各种用途都有用，例如进一步的数据标注、理解数据集的内容或作为下游任务的特征。

Appendix E Human Study Experimental Design
附录人类研究实验设计

Here we describe details of the human study used to evaluate mask quality in §7.1 and §7.4. The purpose of the human study is to address two limitations of using IoU to ground truth as a measure of predicted mask quality. The first limitation is that, for ambiguous inputs such as a single point, the model may be strongly penalized for returning a valid mask of a different object than the ground truth. The second limitation is that ground truth masks may include various biases, such as systematic errors in the edge quality or decisions to modally or amodally segment occluding objects. A model trained in-domain can learn these biases and obtain a higher IoU without necessarily producing better masks. Human review can obtain a measure of mask quality independent of an underlying ground truth mask in order to alleviate these issues.
在此，我们将在第 7.1 节和第 7.4 节详细介绍用于评估掩膜质量的人体研究。人体研究的目的是解决将 IoU 与地面实况相比作为预测掩膜质量的衡量标准的两个局限性。第一个局限是，对于单点等模棱两可的输入，模型可能会因为返回与地面实况不同的有效掩膜而受到严重惩罚。第二个限制是，地面实况掩膜可能包含各种偏差，例如边缘质量的系统误差，或对遮挡物体进行模态或非模态分割的决定。域内训练的模型可以学习这些偏差并获得更高的 IoU，但不一定能生成更好的掩膜。人工审核可以获得独立于基本地面实况掩膜的掩膜质量度量，从而缓解这些问题。

Models. 型号

For single-point evaluation, we use RITM [92], single-output SAM, and SAM to test two hypotheses. First, we hypothesize that SAM produces visually higher quality masks than baseline interactive segmentation models when given a single point, even when metrics such as IoU with ground truth do not reveal this. Second, we hypothesize that SAM’s ability to disambiguate masks improves mask quality for single point inputs, since single output SAM may return masks that average over ambiguous masks.
对于单点评估，我们使用 RITM [ 92]、单输出 SAM 和 SAM 来测试两个假设。首先，我们假设，在给定单点的情况下，SAM 生成的掩码在视觉上比基线交互式分割模型的质量更高，即使在 IoU 与地面实况等指标没有显示出这一点的情况下也是如此。其次，我们假设 SAM 的掩码消歧能力可以提高单点输入的掩码质量，因为单点输出的 SAM 可能会返回模棱两可掩码的平均值。

For instance segmentation experiments, we evaluate cascade ViTDet-H [62] and SAM in order to test the hypothesis that SAM produces visually higher quality masks, even if it obtains a lower AP due to the inability to learn specific annotation biases of the validation dataset.
例如，在分割实验中，我们对级联 ViTDet-H [ 62] 和 SAM 进行了评估，以检验 SAM 能否生成视觉质量更高的遮罩，即使由于无法学习验证数据集的特定注释偏差而获得较低的 AP。

Datasets. 数据集。

For single-point experiments, we select 7 datasets from our set of 23 datasets, since the full suite is too large for human review. We choose LVIS v0.5 [17], VISOR [28, 27], DRAM [24], IBD [17], NDD20 [100], OVIS [81], and iShape [111], which provide a diverse collection of images, including scene-level, ego-centric, drawn, overhead, underwater, and synthetic imagery. Additionally, this set includes datasets both where SAM outperforms RITM with IoU metrics and vice-versa. For instance segmentation experiments, we use the LVIS v1 validation set, allowing for direct comparison to ViTDet, which was trained on LVIS.
在单点实验中，我们从 23 个数据集中选择了 7 个数据集，因为整个数据集太大，不适合人工审核。我们选择了 LVIS v0.5 [ 17]、VISOR [ 28, 27]、DRAM [ 24]、IBD [ 17]、NDD20 [ 100]、OVIS [ 81] 和 iShape [ 111]，这些数据集提供了多种多样的图像，包括场景级图像、以自我为中心的图像、绘制图像、开销图像、水下图像和合成图像。此外，这组数据集还包括 SAM 在 IoU 指标上优于 RITM 的数据集，反之亦然。在实例分割实验中，我们使用了 LVIS v1 验证集，以便与在 LVIS 上训练的 ViTDet 进行直接比较。

Methodology. 方法。

We presented masks generated by the models to professional annotators and asked them to rate each mask using provided guidelines (see §G for the complete guidelines). Annotators were sourced from the same company that collected manually annotated masks for the data engine. An annotator was provided access to an image, the predicted mask of a single model, and the input to the model (either a single point or single box) and asked to judge the mask on three criterion: Does the mask correspond to a valid object? Does the mask have a clean boundary? and Does the mask correspond to the input? They then submitted a rating from 1-10 indicating the overall mask quality.
我们将模型生成的掩码展示给专业注释者，并要求他们使用提供的指南对每个掩码进行评分（完整指南见第 G 节）。注释者来自为数据引擎收集人工注释掩码的同一家公司。注释者可以访问图像、单个模型的预测掩码和模型的输入（单点或单框），并根据三个标准对掩码进行评判：遮罩是否与有效对象相对应？掩膜的边界是否清晰？然后，他们提交了 1-10 分的评分，表示遮罩的整体质量。

A score of 1 indicates a mask that corresponds to no object at all; a low score (2-4) indicates that the mask has huge errors, such including huge regions of other objects or having large areas of nonsensical boundaries; a middle score (5-6) indicates masks that are mostly sensible but still have significant semantic or boundary errors; a high score (7-9) indicates masks with only minor boundary errors; and a score of 10 is for masks with no visible errors. Annotators were provided with five different views, each designed to help identify different error types.
1 分表示掩码完全不对应任何对象；低分（2-4 分）表示掩码有很大的错误，例如包含其他对象的大片区域或有大面积不合理的边界；中分（5-6 分）表示掩码大部分是合理的，但仍有明显的语义或边界错误；高分（7-9 分）表示掩码只有轻微的边界错误；10 分表示掩码没有明显的错误。为注释者提供了五种不同的视图，每种视图旨在帮助识别不同的错误类型。

For single point experiments, 1000 masks per dataset were selected randomly from the same subsets used for benchmarking zero-shot interactive segmentation (see §D.1 for details on these subsets). The model input was the centermost point, calculated as the largest value of the distance transform from the edge of the mask. For instance segmentation experiments, 1000 masks were selected from the LVIS v1 validation set, and the model input was the LVIS ground truth box. In all experiments, masks with a size smaller than $\textrm{24}^{\textrm{2}}$ pixels were excluded from sampling, to prevent showing raters a mask that was too small to judge accurately. For both memory and display reasons, large images were rescaled to have a max side-length of 2000 before predicting a mask. In all experiments, the same inputs were fed to each model to produce a predicted mask.
在单点实验中，每个数据集随机选取 1000 个掩码，这些掩码与用于零点交互式分割基准测试的子集相同（有关这些子集的详细信息，请参见第 D.1 节）。模型输入是最中心点，计算方法是距离掩膜边缘的距离变换的最大值。在实例分割实验中，从 LVIS v1 验证集中选择了 1000 个掩膜，模型输入为 LVIS 地面真值框。在所有实验中，尺寸小于 $\textrm{24}^{\textrm{2}}$ 像素的遮罩都不在取样范围内，以防止向评分者展示太小而无法准确判断的遮罩。由于内存和显示的原因，在预测遮罩之前，大图像被重新调整为最大边长为 2000 的图像。在所有实验中，每个模型都采用相同的输入来生成预测掩码。

For comparison, the ground truth masks from each dataset were also submitted for rating. For single-point experiments, this gave 4000 total rating jobs per dataset (1000 masks each for RITM, SAM single-output, SAM, and ground truth); for instance segmentation experiments, it gave 3000 total jobs (ViTDet, SAM, and ground truth).
为便于比较，每个数据集的地面实况掩码也被提交进行评级。在单点实验中，每个数据集的评级工作总数为 4000 个（RITM、SAM 单输出、SAM 和地面实况各 1000 个掩码）；在实例分割实验中，评级工作总数为 3000 个（ViTDet、SAM 和地面实况）。

For each dataset, these jobs were inserted with random ordering into a queue from which 30 annotators drew jobs. In initial testing of the review study, we provided each job to five different annotators and found reasonable consistency in scores: the average standard deviation in score over the five annotators was 0.83. Additionally, the annotation company deployed quality assurance testers who spot checked a fraction of results for extreme departures from the guidelines. Thus for our experiments each job (i.e., rating one mask in one image) was completed by only a single annotator. Average time spent per annotator per job was 90 seconds, longer than our initial target of 30 seconds, but still sufficiently fast to collect a large number of ratings on each of the 7 selected datasets.
对于每个数据集，这些工作都会以随机顺序插入一个队列，由 30 位注释者从中抽取工作。在审查研究的初步测试中，我们将每项工作提供给五位不同的注释者，结果发现得分具有合理的一致性：五位注释者的平均得分标准偏差为 0.83。此外，注释公司还派遣了质量保证测试人员，他们抽查了部分结果，以确定是否与指南有极端偏差。因此，在我们的实验中，每项工作（即对一幅图像中的一个遮罩进行评分）仅由一名注释员完成。每位注释者完成每项工作所花费的平均时间为 90 秒，比我们最初设定的 30 秒目标要长，但仍足以在 7 个选定数据集上收集到大量评分。

	SAM $>$ baseline SAM $>$ 基线		SAM $>$ SAM single out. SAM $>$ SAM 单出。
dataset 数据集	p-value p 值	$\mathrm{CI}_{99}(\Delta\mu)$	p-value p 值	$\mathrm{CI}_{99}(\Delta\mu)$
point input (RITM [92] baseline): 点输入（RITM [ 92] 基线）：
LVIS v0.5 [44]	4e-69	(1.40, 1.84)	2e-11	(0.29, 0.64)
VISOR [28, 27] 遮阳板 [ 28, 27]	7e-98	(1.81, 2.24)	7e-26	(0.58, 0.94)
DRAM [24] DRAM [ 24］	1e-76	(1.54, 2.00)	2e-24	(0.62, 1.03)
IBD [17] IBD [ 17］	2e-57	(1.03, 1.39)	1e-15	(0.32, 0.62)
NDD20 [100] NDD20 [ 100］	2e-86	(1.88, 2.37)	5e-08	(0.19, 0.55)
OVIS [81] 奥维斯 [ 81］	2e-64	(1.38, 1.84)	3e-10	(0.27, 0.63)
iShape [111] iShape [ 111］	2e-88	(1.97, 2.47)	7e-23	(0.65, 1.10)
box input (ViTDet-H [62] baseline): 框输入（ViTDet-H [ 62] 基线）：
LVIS v1 [44]	2e-05	(0.11, 0.42)	N/A 不适用	N/A 不适用

Table 8: Statistical tests showing significance that SAM has higher mask quality ratings than baseline and single-output SAM. P-values are calculated by paired t-test, while confidence intervals for the difference in mean scores are calculated by paired bootstrap on 10k samples. All p-values are significant, and all confidence intervals exclude zero.
表 8：统计检验表明，与基线和单一输出 SAM 相比，SAM 的掩码质量评分更高。P 值通过配对 t 检验计算，而平均分差异的置信区间则通过对 10k 个样本进行配对 bootstrap 计算。所有 P 值均显著，所有置信区间均不包括零。

Results. 结果

Fig. 18 shows histograms over ratings for each dataset in the single-point experiments. We run statistical tests for two hypotheses: (1) that SAM gets higher scores than the baseline model (RITM or ViTDet) and (2) that SAM gets higher scores than single-output SAM. P-values are calculated via a paired t-test on the means of the model scores, which we supplement with a paired bootstrap test on 10k samples to find the 99% confidence interval for the difference of means. Table 8 shows p-values and confidence intervals for these tests. All statistical tests are strongly significant, and all confidence intervals exclude zero.
图 18 显示了单点实验中每个数据集的评分直方图。我们对两个假设进行了统计检验：(1) SAM 得分高于基准模型（RITM 或 ViTDet）；(2) SAM 得分高于单输出 SAM。P 值通过对模型得分的均值进行配对 t 检验来计算，我们还对 10k 个样本进行了配对 bootstrap 检验，以找出均值差异的 99% 置信区间。表 8 显示了这些检验的 p 值和置信区间。所有统计检验结果都非常显著，所有置信区间都不包括零。

For instance segmentation, Fig. 11 of the main text shows the histogram for ratings. To compare to COCO ground truth, we additionally include 794 ratings of COCO ground truth masks that were collected during our testing of the human review process. These masks were presented to raters using an identical setup as the LVIS results. For fair comparison, results for LVIS in Fig. 11 were subsampled to the same 794 inputs for each model and ground truth. For Table 8, the full 1000 ratings are used to run statistical tests, which show that SAM’s mask quality improvement over ViTDet is statistically significant.
在实例分割方面，正文图 11 显示了评分直方图。为了与 COCO 地面实况进行比较，我们还增加了 794 个 COCO 地面实况掩码的评分，这些评分是我们在测试人工审核过程中收集的。我们使用与 LVIS 结果相同的设置向评分者展示了这些掩码。为了进行公平比较，图 11 中的 LVIS 结果是以每个模型和地面实况的 794 个相同输入为子样本的。在表 8 中，我们使用了全部 1000 个评分来进行统计测试，结果表明，与 ViTDet 相比，SAM 在掩膜质量方面的改进具有显著的统计学意义。

Appendix F Dataset, Annotation, and Model Cards
附录 F 数据集、注释和模型卡片

In §F.1 we provide a Dataset Card for SA-1B, following [39], in a list of questions and answers. Next, we provide a Data Annotation Card in §F.2 for the first two stages of our data engine described in §4, following CrowdWorkSheets [30], again as a list of questions and answers. We provide a Model Card following [75] in Table 9.
在§F.1中，我们仿照[ 39]的做法，以问题和答案列表的形式提供了SA-1B的数据集卡片。接下来，我们在§F.2中为§4中描述的数据引擎的前两个阶段提供了一张数据注释卡，它遵循CrowdWorkSheets[30]，同样是问题和答案列表。在表 9 中，我们按照 [ 75] 提供了一个模型卡。

F.1 Dataset Card for SA-1B
F.1 SA-1B 数据集卡

Motivation 动机

1.

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description. The contributions of our dataset to the vision community are fourfold: (1) We release a dataset of 11M images and 1.1B masks, by far the largest segmentation dataset to date. (2) The dataset we release is privacy protecting: we have blurred faces and license plates in all images. (3) The dataset is licensed under a broad set of terms of use which can be found at https://ai.facebook.com/datasets/segment-anything. (4) The data is more geographically diverse than its predecessors, and we hope it will bring the community one step closer to creating fairer and more equitable models.

1.创建数据集的目的是什么？是否有特定的任务？是否有需要填补的特定空白？请加以说明。我们的数据集对视觉社区的贡献有四个方面：（1）我们发布的数据集包含 1,100 万张图像和 11B 个遮罩，是迄今为止最大的分割数据集。(2）我们发布的数据集具有隐私保护功能：我们在所有图像中都模糊了人脸和车牌。(3) 数据集根据广泛的使用条款获得许可，使用条款请访问 https://ai.facebook.com/datasets/segment-anything。(4) 与之前的数据相比，这些数据在地理位置上更加多样化，我们希望这些数据能让社会在创建更公平、更公正的模型方面更进一步。
2.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? The dataset was created by the FAIR team of Meta AI. The underlying images were collected and licensed from a third party photo company.

2.谁创建了数据集（如哪个团队、研究小组），代表哪个实体（如公司、机构、组织）？该数据集由 Meta AI 的 FAIR 团队创建。基础图像由第三方图片公司收集和授权。
3.

Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number. Meta AI funded the creation of the dataset.

3.谁资助了数据集的创建？如果有相关资助，请提供资助者姓名、资助名称和编号。Meta AI 资助了数据集的创建。
4.

Any other comments? No.

4.还有其他意见吗？没有了

Composition 组成

1.

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description. All of the instances in the dataset are photos. The photos vary in subject matter; common themes of the photo include: locations, objects, scenes. All of the photos are distinct, however there are some sets of photos that were taken of the same subject matter.

1.组成数据集的实例代表什么（如文档、照片、人物、国家）？是否有多种类型的实例（如电影、用户和评分；人和他们之间的互动；节点和边）？请提供说明。数据集中的所有实例都是照片。照片的主题各不相同；照片的共同主题包括：地点、物体、场景。所有照片都各不相同，但也有一些照片的主题相同。
2.

How many instances are there in total (of each type, if appropriate)? There are 11 million images.

2.总共有多少实例（每种类型，如果合适）？有 1 100 万张图片。
3.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable). The dataset is composed of images licensed from a photo provider. The dataset contains all instances licensed. The images are photos, i.e. not artwork, although there are a few exceptions. The dataset includes all generated masks for each image in the dataset. We withheld $\scriptstyle\sim$ 2k randomly selected images for testing purposes.

3.数据集是否包含所有可能的实例，还是从一个更大的集合中抽取的实例样本（不一定是随机的）？如果数据集是样本，那么更大的集合是什么？样本是否代表更大的集合（如地理覆盖范围）？如果是，请说明如何验证/核实这种代表性。如果不具有代表性，请说明不具有代表性的原因（例如，为了涵盖更多样化的实例，因为没有实例或无法获得实例）。数据集由从图片提供商处获得授权的图片组成。数据集包含获得许可的所有实例。图片均为照片，即非艺术品，但也有少数例外。数据集中包含为数据集中每张图片生成的所有遮罩。为测试目的，我们随机抽取了 $\scriptstyle\sim$ 2k 张图片。
4.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description. Each instance in the dataset is an image. The images were processed to blur faces and license plates to protect the identities of those in the image.

4.每个实例包含哪些数据？"原始 "数据（如未经处理的文本或图像）还是特征？无论哪种情况，请提供说明。数据集中的每个实例都是一幅图像。图像经过处理，模糊了人脸和车牌，以保护图像中人物的身份。
5.

Is there a label or target associated with each instance? If so, please provide a description. Each image is annotated with masks. There are no categories or text associated with the masks. The average image has $\scriptstyle\sim$ 100 masks, and there are $\scriptstyle\sim$ 1.1B masks in total.

5.每个实例是否有相关标签或目标？如果有，请提供说明。每张图片都有面具注释。没有与掩码相关的类别或文本。平均每幅图像有 $\scriptstyle\sim$ 100 个掩码，总共有 $\scriptstyle\sim$ 1.1B 个掩码。
6.

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. Yes. Each image is accompanied by a short caption that describes the content and place of the photo in a free form text. Per our agreement with the photo provider we are not allowed to release these captions. However, we use them in our paper to analyze the geographical distribution of the dataset.

6.个别实例中是否缺少任何信息？如果有，请提供说明，解释丢失信息的原因（例如，因为无法获得）。这不包括故意删除的信息，但可能包括经编辑的文本等。是。每张图片都附有简短的说明，以自由形式的文字描述照片的内容和地点。根据我们与照片提供商的协议，我们不能公开这些说明。不过，我们在论文中使用它们来分析数据集的地理分布。
7.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit. No, there are no known relationships between instances in the dataset.

7.是否明确了单个实例之间的关系（如用户的电影评分、社交网络链接）？如果是，请说明如何明确这些关系。没有，数据集中的实例之间没有已知的关系。
8.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. Errors: The masks are generated by a segmentation model, so there may be errors or inconsistencies in the masks. Redundancies: While no two images are the same, there are instances of images of the same subject taken close together in time.

8.数据集中是否存在错误、噪声源或冗余？如果有，请加以说明。错误：掩码是由分割模型生成的，因此掩码中可能存在错误或不一致。冗余：虽然没有两幅图像是相同的，但也存在同一主体的图像在时间上相距很近的情况。
9.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate. The dataset is self-contained.

9.数据集是自成一体，还是链接或依赖外部资源（如网站、推特、其他数据集）？如果数据集链接或依赖外部资源，a) 是否保证这些资源将长期存在并保持不变；b) 是否有完整数据集的官方存档版本（即包括数据集创建时存在的外部资源）；c) 是否有可能适用于数据集消费者的与任何外部资源相关的限制（如许可证、费用）？请提供所有外部资源的说明和与之相关的任何限制，并酌情提供链接或其他访问点。数据集是独立的。
10.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description. No.

10.数据集是否包含可能被视为机密的数据（如受法律特权或医患保密保护的数据、包含个人非公开通信内容的数据）？如果是，请说明。没有。
11.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why. We have two safety measures to prevent objectionable content: (1) Photos are licensed from a photo provider and had to meet the terms of service of the photo provider. We requested that all objectionable content be filtered from the images we licensed. (2) If a user observes objectionable image(s) in the dataset, we invite them to report the image(s) at segment-anything@meta.com for removal. Despite the measures taken, we observe that a small portion of images contains scenes of protests or other gatherings that focus on a diverse spectrum of religious beliefs or political opinions that may be offensive. We were not able to produce a filtering strategy that removes all such images and rely on users to report this type of content.

11.数据集是否包含如果直接查看可能会令人反感、侮辱、威胁或可能引起焦虑的数据？如果是，请说明原因。我们采取了两项安全措施来防止令人反感的内容：(1) 照片是从照片提供商处获得授权的，必须符合照片提供商的服务条款。我们要求从我们授权的图片中过滤掉所有令人反感的内容。(2) 如果用户在数据集中发现令人反感的图片，我们请他们在 segment-anything@meta.com 上报告，以便删除。尽管采取了上述措施，我们仍发现一小部分图片包含抗议或其他集会的场景，这些场景集中反映了不同的宗教信仰或政治观点，可能会引起反感。我们无法制定删除所有此类图片的过滤策略，只能依靠用户报告此类内容。
12.

Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset. The dataset does not identify any subpopulations of the people in the photos.

12.数据集是否确定了任何亚人群（如按年龄、性别）？如果有，请说明如何识别这些亚群，并描述它们在数据集中的分布情况。数据集未识别照片中的任何亚人群。
13.

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how. No. Images were subjected to a face blurring model to remove any personally identifiable information. If a user observes any anonymization issue, we invite them to report the issue and the image id(s) at segment-anything@meta.com.

13.是否有可能从数据集中直接或间接（即结合其他数据）识别出个人（即一个或多个自然人）？如果是，请说明如何识别。不会。图像经过人脸模糊处理，以去除任何个人身份信息。如果用户发现任何匿名化问题，请将问题和图像 ID 报告到 segment-anything@meta.com。
14.

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description. The dataset contains scenes of protests, or other gatherings that may suggest religious beliefs, political opinions or union memberships. However, the faces of all people in the dataset have been anonymized via facial blurring, so it is not possible to identify any person in the dataset.

数据集是否包含在任何方面可能被视为敏感的数据（例如，显示种族或族裔出身、性取向、宗教信仰、政治观点或工会成员身份或地点的数据；财务或健康数据；生物识别或遗传数据；政府身份识别形式，如社会安全号码；犯罪历史）？如果是，请提供说明。数据集包含抗议或其他集会场景，可能暗示宗教信仰、政治观点或工会成员身份。不过，数据集中所有人物的面部都已通过面部模糊化技术进行了匿名处理，因此无法识别数据集中的任何人。
15.

Any other comments? No.

15.还有其他意见吗？没有了

Collection Process 收集过程

1.

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If the data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how. The released masks associated with each image were automatically inferred by our segmentation model, SAM. The masks that were collected using model-assisted manual annotation will not be released. Quality was validated as described in §5.

1.与每个实例相关的数据是如何获得的？数据是可直接观察到的（如原始文本、电影评分）、受试者报告的（如调查回复），还是从其他数据间接推断/得出的（如语音部分标记、基于模型的年龄或语言猜测）？如果数据由受试者报告或从其他数据间接推断/得出，那么这些数据是否经过验证/核实？如果是，请说明如何验证。与每张图像相关的已发布掩码是由我们的分割模型 SAM 自动推断的。使用模型辅助人工标注收集的掩膜将不会发布。质量验证如第 5 节所述。
2.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)? How were these mechanisms or procedures validated? The images in the dataset are licensed from an image provider. They are all photos taken by photographers with different cameras.

2.使用了哪些机制或程序来收集数据（例如，硬件设备或传感器、人工整理、软件程序、软件 API）？这些机制或程序是如何验证的？数据集中的图片是从图片提供商处获得授权的。它们都是摄影师用不同相机拍摄的照片。
3.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? We withheld $\scriptstyle\sim$ 2k randomly selected images for testing purposes. The rest of the licensed images are included in the dataset.

3.如果数据集是从一个更大的数据集中抽取的样本，那么抽样策略是什么（例如，确定性的，具有特定抽样概率的概率性）？为了测试目的，我们扣留了 $\scriptstyle\sim$ 2k 幅随机抽取的图片。其余的授权图片都包含在数据集中。
4.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? The released masks were automatically inferred by SAM. For details on our model-assisted manual annotation process see our Data Annotation Card in §F.2. Note these masks will not be released.

4.谁参与了数据收集过程（如学生、群众工作者、承包商），他们如何获得报酬（如群众工作者的报酬是多少）？发布的掩码由 SAM 自动推断。有关模型辅助人工注释过程的详情，请参阅第 F.2 节中的 "数据注释卡"。请注意，这些掩码将不会发布。
5.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. The licensed photos vary in their date taken over a wide range of years up to 2022.

5.数据是在什么时间范围内收集的？该时间范围是否与实例相关数据的创建时间范围一致（例如，最近抓取的旧新闻文章）？如果不一致，请说明与实例相关的数据是在什么时间段创建的。授权照片的拍摄日期从 2022 年到 2022 年不等。
6.

Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation. If the dataset does not relate to people, you may skip the remaining questions in this section. We underwent an internal privacy review to evaluate and determine how to mitigate any potential risks with respect to the privacy of people in the photos. Blurring faces and license plates protects the privacy of the people in the photos.

6.是否进行了伦理审查（例如由机构审查委员会进行审查）？如果是，请提供这些审查过程的描述，包括审查结果，以及任何证明文件的链接或其他访问点。如果数据集与人无关，您可以跳过本部分的其余问题。我们进行了内部隐私审查，以评估并确定如何降低照片中人物隐私的潜在风险。模糊人脸和车牌可保护照片中人物的隐私。
7.

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? We licensed the data from a third party photo provider.

7.你们是直接从有关个人那里收集数据，还是通过第三方或其他来源（如网站）获得数据？我们从第三方照片提供商处获得了数据许可。
8.

Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself. The images are licensed from a third party who provided appropriate representations regarding the collection of any notices and consents as required from individuals. In addition, all identifiable information (e.g. faces, license plates) was blurred. Under the terms of the dataset license it is prohibited to attempt to identify or associate an image with a particular individual.

8.是否向有关个人通知了数据收集事宜？如果是，请描述（或用截图或其他信息显示）通知是如何提供的，并提供链接或其他访问点，或以其他方式复制通知本身的确切语言。图片是从第三方获得许可的，该第三方就收集任何通知和个人同意提供了适当的说明。此外，所有可识别信息（如脸部、车牌）都被模糊处理。根据数据集许可条款，禁止试图将图像与特定个人进行识别或关联。
9.

Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented. The images are licensed from a third party who provided appropriate representations regarding the collection of any notices and consents as required from individuals. In addition, all identifiable information (e.g. faces, license plates) was blurred from all images. For avoidance of doubt, under the terms of the dataset license it is prohibited to attempt to identify or associate an image with a particular individual.

9.有关个人是否同意收集和使用其数据？如果是，请描述（或用截图或其他信息显示）如何要求和提供同意，并提供链接或其他访问点，或以其他方式复制个人同意的确切语言。图片是从第三方获得许可的，该第三方就收集任何通知和征得个人同意的要求提供了适当的说明。此外，所有图像中的所有可识别信息（如面孔、车牌）均已模糊处理。为避免疑义，根据数据集许可条款，禁止试图将图像与特定个人进行识别或关联。
10.

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate). We invite users to report at segment-anything@meta.com for image(s) removal.

10.如果征得了同意，是否向同意的个人提供了一种机制，使其今后或在某些用途上可以撤销同意？如果有，请提供说明，以及该机制的链接或其他接入点（如果合适）。我们邀请用户在 segment-anything@meta.com 上报告删除图片的情况。
11.

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation. To eliminate any potential impact on people whose photos are included in the dataset, identifiable information (faces, license plates) has been blurred.

11.是否对数据集及其使用对数据主体的潜在影响进行过分析（如数据保护影响分析）？如果是，请提供分析说明，包括结果，以及任何支持文件的链接或其他访问点。为消除对照片包含在数据集中的人的任何潜在影响，可识别信息（脸部、车牌）已被模糊化。
12.

Any other comments? No.

12.还有其他意见吗？没有了

Preprocessing / Cleaning / Labeling
预处理/清洁/贴标签

1.

Was any preprocessing / cleaning / labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remaining questions in this section. We resized the high-resolution licensed images such that the shorter side is 1500 pixels and only processed the images to remove any identifiable and personal information from the photos (faces, license plates).

1.是否对数据进行过任何预处理/清理/标记（例如离散化或分块、标记化、语音部分标记、SIFT 特征提取、删除实例、处理缺失值）？如果是，请提供说明。如果没有，请跳过本节的其余问题。我们调整了高分辨率授权图片的大小，使其短边为 1500 像素，并仅对图片进行了处理，以删除照片中任何可识别的个人信息（人脸、车牌）。
2.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data. No, as we removed the data for safety reasons and to respect privacy, we do not release the unaltered photos.

2.除了预处理/清理/标记的数据外，是否还保存了 "原始 "数据（例如，用于支持未预料到的未来用途）？如果是，请提供 "原始 "数据的链接或其他访问点。不会，因为出于安全原因和尊重隐私的考虑，我们删除了数据，因此不会公开未经修改的照片。
3.

Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point. We used the RetinaFace [88, 89] model (https://github.com/serengil/retinaface) to detect faces. The model used to blur license plates has not been made public.

3.用于预处理/清理/标记数据的软件是否可用？如果有，请提供链接或其他访问点。我们使用 RetinaFace [ 88, 89] 模型（https://github.com/serengil/retinaface）来检测人脸。用于模糊车牌的模型尚未公开。

Uses 用途

1.

Has the dataset been used for any tasks already? If so, please provide a description. The dataset was used to train our segmentation model, SAM.

1.该数据集是否已用于任何任务？如果是，请说明。该数据集用于训练我们的分割模型 SAM。
2.

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point. No. However, all users of the dataset must cite it, so its use is trackable via citation explorers.

2.是否有链接到使用该数据集的任何或所有论文或系统的资料库？如果有，请提供链接或其他访问点。没有。不过，数据集的所有用户都必须引用数据集，因此数据集的使用情况可通过引用探索器进行跟踪。
3.

What (other) tasks could the dataset be used for? We intend the dataset to be a large-scale segmentation dataset. However, we invite the research community to gather additional annotations for the dataset.

3.数据集可用于哪些（其他）任务？我们打算将该数据集作为一个大规模的分割数据集。不过，我们邀请研究界为数据集收集更多注释。
4.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other risks or harms (e.g., legal risks, financial harms)? If so, please provide a description. Is there anything a dataset consumer could do to mitigate these risks or harms? We have an analysis of the approximate geographic and income level coverage of our dataset in §6. While we believe our dataset to be more representative than most of the publicly existing datasets at this time, we acknowledge that we do not have parity across all groups, and we encourage users to be mindful of potential biases their models have learned using this dataset.

4.数据集的构成或其收集和预处理/清理/标记方式是否会影响未来的使用？例如，数据集消费者是否需要了解一些信息，以避免使用数据集造成对个人或群体的不公平待遇（如刻板印象、服务质量问题）或其他风险或损害（如法律风险、经济损害）？如果是，请说明。数据集消费者可以做些什么来减少这些风险或危害？我们在第 6 节中分析了数据集的大致地理和收入水平覆盖范围。虽然我们相信我们的数据集比目前大多数公开存在的数据集更具代表性，但我们承认我们并没有实现所有群体的均等性，我们鼓励用户注意他们的模型在使用该数据集时可能产生的偏差。
5.

Are there tasks for which the dataset should not be used? If so, please provide a description. Full terms of use for the dataset including prohibited use cases can be found at https://ai.facebook.com/datasets/segment-anything.

5.是否存在不应使用数据集的任务？如果有，请提供说明。数据集的完整使用条款（包括禁止使用的情况）可在 https://ai.facebook.com/datasets/segment-anything 上找到。
6.

Any other comments? No.

6.还有其他意见吗？没有了

Distribution 分发

1.

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description. The dataset will be available for the research community.

1.数据集是否会分发给创建数据集的实体（如公司、机构、组织）以外的第三方？如果是，请提供说明。数据集将提供给研究界。
2.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)? The dataset is available at https://ai.facebook.com/datasets/segment-anything.

2.数据集将如何分发（例如，网站上的压缩包、API、GitHub）？数据集是否有数字对象标识符（DOI）？数据集可从 https://ai.facebook.com/datasets/segment-anything 获取。
3.

When will the dataset be distributed? The dataset will be released in 2023.

3.数据集何时发布？数据集将于 2023 年发布。
4.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions. Yes. The license agreement and terms of use for the dataset can be found at https://ai.facebook.com/datasets/segment-anything. Users must agree to the terms of use before downloading or using the dataset.

4.数据集是否将根据版权或其他知识产权 (IP) 许可和/或适用的使用条款 (ToU) 分发？如果是，请说明该许可和/或 ToU，并提供链接或其他访问点，或以其他方式复制任何相关许可条款或 ToU，以及与这些限制相关的任何费用。是。数据集的许可协议和使用条款见 https://ai.facebook.com/datasets/segment-anything。用户在下载或使用数据集之前必须同意使用条款。
5.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions. Full terms of use and restrictions on use of the SA-1B dataset can be found at https://ai.facebook.com/datasets/segment-anything.

5.是否有任何第三方对与实例相关的数据施加基于 IP 的限制或其他限制？如果有，请描述这些限制，并提供链接或其他访问点，或以其他方式再现任何相关许可条款以及与这些限制相关的任何费用。关于 SA-1B 数据集的完整使用条款和使用限制，请访问 https://ai.facebook.com/datasets/segment-anything。
6.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation. The license and restrictions on use of the SA-1B dataset can be found at https://ai.facebook.com/datasets/segment-anything.

6.数据集或个别实例是否适用任何出口管制或其他监管限制？如果有，请说明这些限制，并提供链接或其他访问点，或以其他方式复制任何支持文件。SA-1B 数据集的使用许可和限制见 https://ai.facebook.com/datasets/segment-anything。
7.

Any other comments? No.

7.还有其他意见吗？没有了

Maintenance 维护

1.

Who will be supporting/hosting/maintaining the dataset? The dataset will be hosted at https://ai.facebook.com/datasets/segment-anything and maintained by Meta AI.

1.谁将支持/托管/维护数据集？数据集将由 https://ai.facebook.com/datasets/segment-anything 托管，并由 Meta AI 维护。
2.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)? Please email segment-anything@meta.com.

2.如何联系数据集的所有者/收藏者/管理者（如电子邮件地址）？请发送电子邮件至 segment-anything@meta.com。
3.

Is there an erratum? If so, please provide a link or other access point. No.

3.是否有勘误表？如果有，请提供链接或其他接入点。没有。
4.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (e.g., mailing list, GitHub)? To aid reproducibility of research using SA-1B, the only updates will be to remove reported images.

4.数据集是否会更新（如纠正标签错误、添加新实例、删除实例）？如果会，请说明更新的频率、更新人以及如何将更新信息传达给数据集用户（如邮件列表、GitHub）？为了帮助使用 SA-1B 进行研究的可重复性，唯一的更新将是删除报告的图像。
5.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced. There are no limits on data retention. We took measures to remove personally identifiable information from any images of people. Users may report content for potential removal here: segment-anything@meta.com.

5.如果数据集与人有关，是否对保留与实例相关的数据有适用的限制（例如，有关个人是否被告知其数据将保留一段固定的时间，然后删除）？如果是，请说明这些限制并解释如何执行。对数据保留没有限制。我们已采取措施删除任何人物图片中的个人身份信息。用户可以在这里报告可能删除的内容：segment-anything@meta.com。
6.

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers. No, as the only updates will be to remove potentially harmful content, we will not keep older versions with the content.

6.数据集的旧版本是否会继续得到支持/托管/维护？如果是，请说明如何继续。如果不会，请说明将如何向数据集消费者告知其过时情况。不会，因为唯一的更新是删除可能有害的内容，我们不会保留旧版本的内容。
7.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description. We encourage users to gather further annotations for SA-1B. Any users who generate annotations will be liable for hosting and distributing their annotations.

7.如果其他人想对数据集进行扩展/编辑/构建/贡献，是否有机制让他们这样做？如果有，请提供说明。这些贡献是否会得到验证/核实？如果是，请说明如何验证。如果不会，原因是什么？是否有向数据集用户传达/分发这些贡献的流程？如果有，请提供说明。我们鼓励用户为 SA-1B 收集更多注释。任何生成注释的用户都有责任托管和分发其注释。
8.

Any other comments? No.

8.还有其他意见吗？没有了

F.2 Data Annotation Card F.2 数据注释卡

Task Formulation 任务制定

1.

At a high level, what are the subjective aspects of your task? Segmenting objects present in an image is inherently a subjective task. For instance, one annotator may segment two boots as one mask, whereas another may segment each boot separately. Depending on annotators’s skills, the quality of the mask and the number of masks per image are different between annotators. Despite these subjective aspects of the task, we believed efficient annotation was possible as the data was annotated in a per-mask fashion with the main focus on the diversity of the data rather than completeness.

1.在高层次上，您的任务有哪些主观方面？分割图像中的物体本身就是一项主观性很强的任务。例如，一位标注者可能会将两只靴子作为一个遮罩进行标注，而另一位标注者则可能会将每只靴子单独标注。根据标注者的技能，不同标注者的遮罩质量和每幅图像的遮罩数量也不尽相同。尽管这项任务存在这些主观因素，但我们相信高效注释是可能的，因为数据是按掩码注释的，主要关注点是数据的多样性而非完整性。
2.

What assumptions do you make about annotators? Our annotators worked full time on our annotation task with very small attrition rate. This made it possible to train the annotators providing feedback and answering their questions on a regular basis. Specifically: (1) By giving a clear understanding of the goals of this work and providing clear guidelines, including visuals and video recordings of the tasks, annotators had enough context to understand and perform the tasks reasonably. (2) Sharing objectives and key results and meeting weekly with annotators increased the likelihood that annotators improved annotation quality and quantity over time.

2.您对注释者有什么假设？我们的注释员全职完成注释任务，自然减员率非常低。因此，我们可以定期对注释员进行培训，提供反馈并回答他们的问题。具体来说：(1) 通过清楚地了解这项工作的目标并提供明确的指导，包括任务的视觉效果和视频记录，注释员有足够的背景来理解并合理地执行任务。(2) 分享目标和主要成果，每周与注释员会面，增加了注释员随着时间推移提高注释质量和数量的可能性。
3.

How did you choose the specific wording of your task instructions? What steps, if any, were taken to verify the clarity of task instructions and wording for annotators? As our task was annotating images, the annotation guidelines included visual examples. Our research team completed 30 annotation tasks to identify any obvious challenges using the annotation tool, collectively decide how to handle complex cases, and refine the guidelines. The research team met with the annotators weekly for feedback sessions. Videos of the research team performing the task were shared live with the annotators, followed by Q&A sessions. Annotators were able to give feedback on unclear aspects, both during the feedback session and asynchronously.

3.您是如何选择任务说明的具体措辞的？如果有的话，采取了哪些措施来验证任务说明和注释者的措辞是否清晰？由于我们的任务是注释图像，因此注释指南中包含了视觉示例。我们的研究团队完成了 30 项注释任务，以确定使用注释工具时遇到的任何明显挑战，集体决定如何处理复杂情况，并完善指南。研究团队每周与注释员会面，听取反馈意见。研究团队与注释员实时共享执行任务的视频，随后进行问答。注释者可以在反馈会议期间或以异步方式就不明确的方面提出反馈意见。
4.

What, if any, risks did your task pose for annotators and were they informed of the risks prior to engagement with the task? No identified risks. Images were filtered for objectionable content prior to the annotation phase.

4.您的任务给注释者带来了哪些风险（如果有的话），他们在参与任务之前是否被告知了这些风险？没有发现风险。在注释阶段之前，已对图片进行了过滤，以去除令人反感的内容。
5.

What are the precise instructions that were provided to annotators? We provide only high-level instructions: Given an image, we aim at segmenting every possible object. Annotators generate a mask for every potential object they can identify. An object can be segmented using our interactive segmentation tool either by using corrective foreground/background clicks to add/remove parts of the mask or by drawing a bounding box around the object. Masks can be refined using pixel-precise tools.

5.提供给注释者的精确说明是什么？我们只提供高级指令：给定一幅图像，我们的目标是分割出每一个可能的物体。标注者为他们能识别的每个潜在物体生成一个掩码。可以使用我们的交互式分割工具对物体进行分割，方法是使用纠正前景/背景点击来添加/移除遮罩的部分，或者在物体周围绘制一个边界框。可以使用像素精确工具对遮罩进行细化。

Selecting Annotations 选择注释

1.

Are there certain perspectives that should be privileged? If so, how did you seek these perspectives out? We chose to work with annotators that have worked on other vision annotation tasks before.

1.是否有某些观点应享有特权？如果有，你们是如何寻找这些视角的？我们选择与曾经参与过其他视觉标注任务的标注者合作。
2.

Are there certain perspectives that would be harmful to include? If so, how did you screen these perspectives out? No.

2.纳入某些观点是否有害？如果有，您是如何筛选出这些观点的？没有。
3.

Were sociodemographic characteristics used to select annotators for your task? If so, please detail the process. No.

3.是否利用社会人口特征来选择任务的注释者？如果是，请详细说明过程。没有。
4.

If you have any aggregated socio-demographic statistics about your annotator pool, please describe. Do you have reason to believe that sociodemographic characteristics of annotators may have impacted how they annotated the data? Why or why not? We worked with 130 annotators. The annotators were all based in Kenya. We do not believe sociodemographic characteristics of annotators meaningfully impacted the annotated data.

4.如果您有任何关于注释者库的社会人口统计汇总数据，请加以说明。您是否有理由认为注释者的社会人口特征可能影响了他们注释数据的方式？为什么？我们与 130 名注释者合作。注释者都在肯尼亚。我们不认为注释者的社会人口特征会对注释数据产生有意义的影响。
5.

Consider the intended context of use of the dataset and the individuals and communities that may be impacted by a model trained on this dataset. Are these communities represented in your annotator pool? The Segment Anything 1B (SA-1B) dataset is to be used for research purposes only. The SA-1B dataset is one of the most geographically diverse segmentation dataset, as discussed in §6. In addition, we analyze the responsible AI axes of a model trained on the dataset in §6.

5.考虑数据集的预期使用环境，以及可能受到在此数据集上训练的模型影响的个人和社群。这些群体在您的注释者库中是否有代表？Segment Anything 1B (SA-1B) 数据集仅用于研究目的。如第 6 节所述，SA-1B 数据集是最具地域多样性的细分数据集之一。此外，我们还在第 6 节中分析了在该数据集上训练的模型的人工智能责任轴。

Platform and Infrastructure Choices
平台和基础设施选择

1.

What annotation platform did you utilize? At a high level, what considerations informed your decision to choose this platform? Did the chosen platform sufficiently meet the requirements you outlined for annotator pools? Are any aspects not covered? We used a proprietary annotation platform.

1.您使用了哪个注释平台？在高层次上，您决定选择该平台的考虑因素是什么？所选平台是否充分满足了您对注释器库提出的要求？是否有任何方面未涵盖？我们使用了一个专有的注释平台。
2.

What, if any, communication channels did your chosen platform offer to facilitate communication with annotators? How did this channel of communication influence the annotation process and/or resulting annotations? We manually reviewed annotations and shared feedback with the annotators on a weekly basis. We communicated common mistakes or inconsistencies and the corresponding corrections. In addition, the annotators were given feedback for improvements daily by the annotation QA team. Outside the weekly feedback sessions, annotators had access to a spreadsheet and chat group to facilitate communication with the research team. This process greatly improved the average speed and quality of the annotations.

2.您选择的平台提供了哪些交流渠道（如果有的话）来促进与注释者的交流？这种交流渠道对注释过程和/或注释结果有何影响？我们每周人工审核注释并与注释者分享反馈意见。我们通报了常见错误或不一致之处以及相应的更正。此外，注释质量保证团队每天都会向注释者提供改进反馈。在每周的反馈会议之外，注释员还可以访问电子表格和聊天群组，以促进与研究团队的交流。这一过程大大提高了注释的平均速度和质量。
3.

How much were annotators compensated? Did you consider any particular pay standards, when determining their compensation? If so, please describe. Annotators were compensated with an hourly wage set by the vendor. The vendor is a Certified B Corporation.

3.注释员的报酬是多少？在确定他们的报酬时，您是否考虑过任何特定的薪酬标准？如果有，请说明。注释员的报酬由供应商按小时支付。供应商是一家经过认证的 B 公司。

Dataset Analysis and Evaluation
数据集分析与评估

1.

How do you define the quality of annotations in your context, and how did you assess the quality in the dataset you constructed? Annotators were first placed into training. They followed a 1-day training session led by the vendor and then were asked to annotate a large number of examples from a training queue. Annotators graduated from training to production after the vendor QA team, in collaboration with the research team, manually spot-checked the annotator’s masks to ensure quality. On average, annotators spent one week in training before graduating. Production quality assessment followed a similar process: the vendor QA team and the research team manually reviewed the annotations weekly, sharing feedback weekly.

1.您是如何定义注释质量的，又是如何评估您构建的数据集的质量的？注释者首先要接受培训。他们参加由供应商主持的为期 1 天的培训课程，然后被要求对培训队列中的大量示例进行注释。在供应商质量保证团队与研究团队合作，对注释者的掩码进行人工抽查以确保质量后，注释者才从培训阶段进入生产阶段。注释员在毕业前平均要接受一周的培训。生产质量评估遵循类似的流程：供应商质量保证团队和研究团队每周对注释进行人工审核，每周分享反馈意见。
2.

Have you conducted any analysis on disagreement patterns? If so, what analyses did you use and what were the major findings? Did you analyze potential sources of disagreement? We pointed out common mistakes during weekly meetings with the annotators.

2.您是否对分歧模式进行过分析？如果有，您使用了哪些分析方法，主要结论是什么？您是否分析了分歧的潜在来源？我们在每周与注释者的会议上指出了常见错误。
3.

How do the individual annotator responses relate to the final labels released in the dataset? The annotations were only used to train early versions of the SAM model and we do not currently plan to release them.

3.注释者的个人回复与数据集中发布的最终标签有何关系？注释仅用于训练早期版本的 SAM 模型，我们目前不打算发布这些注释。

Dataset Release and Maintenance
数据集发布和维护

1.

Do you have reason to believe the annotations in this dataset may change over time? Do you plan to update your dataset? No, except to remove objectionable images.

1.您是否有理由认为该数据集中的注释可能会随着时间的推移而改变？您是否计划更新您的数据集？没有，除了删除令人反感的图像。
2.

Are there any conditions or definitions that, if changed, could impact the utility of your dataset? We do not believe so.

2.是否有任何条件或定义如果改变，会影响数据集的实用性？我们认为没有。
3.

Will you attempt to track, impose limitations on, or otherwise influence how your dataset is used? If so, how? The SA-1B dataset will be released under a license agreement allowing use for certain research purposes and protections for researchers. Researchers must agree to the terms of the license agreement to access the dataset.

3.您是否会试图跟踪、限制或以其他方式影响数据集的使用？如果会，如何影响？SA-1B 数据集将根据许可协议发布，允许用于某些研究目的，并为研究人员提供保护。研究人员必须同意许可协议的条款才能访问数据集。
4.

Were annotators informed about how the data is externalized? If changes to the dataset are made, will they be informed? No, we do not plan to release the manual annotations at the moment.

4.注释者是否被告知数据是如何外部化的？如果对数据集进行修改，是否会通知他们？不，我们目前不打算发布人工注释。
5.

Is there a process by which annotators can later choose to withdraw their data from the dataset? If so, please detail. No.

5.是否有注释者日后可以选择从数据集中撤回其数据的程序？如果有，请详细说明。没有。

Model Overview 机型概览
Name 名称	SAM or Segment Anything Model SAM 或 Segment Anything Model
Version 版本	1.0
Date 日期	2023
Organization 组织机构	The FAIR team of Meta AI Meta AI 的 FAIR 团队
Mode type 模式类型	Promptable segmentation model 可提示的细分模型
Architecture 建筑学	See §3 见第 3 节
Repository 存储库	https://github.com/facebookresearch/segment-anything
Citation 引用	https://research.facebook.com/publications/segment-anything
License 许可证	Apache 2.0
Intended Use 预期用途
Primary intended uses 主要预期用途	SAM is intended to be used for any prompt-based segmentation task. We explored its use in segmenting objects from a point (§7.1), edge detection (§7.2), segmenting all objects (§7.3), and segmenting detected objects (§7.4). We explored how SAM can integrate with other vision models to segment objects from text (§7.5). SAM 可用于任何基于提示的分割任务。我们探讨了它在以下方面的应用：从点分割物体（§7.1）、边缘检测（§7.2）、分割所有物体（§7.3）以及分割检测到的物体（§7.4）。我们探讨了SAM如何与其他视觉模型相结合，从文本中分割物体（§7.5）。
Primary intended users 主要预期用户	SAM was primarily developed for research. The license for SAM can be found at https://github.com/facebookresearch/segment-anything. SAM 主要用于研究。有关 SAM 的许可证，请访问 https://github.com/facebookresearch/segment-anything。
Out-of-scope use cases 范围外用例	See terms of use for SAM found at https://github.com/facebookresearch/segment-anything. See Use Cases under Ethical Considerations. 请参见 https://github.com/facebookresearch/segment-anything 中的 SAM 使用条款。请参阅 "道德考量 "下的 "使用案例"。
Caveats and recommendations 注意事项和建议	SAM has impressive zero-shot performance across a wide range of tasks. We note, however, that in the zero-shot setting there may be multiple valid ground truth masks for a given input. We recommend users take this into consideration when using SAM for zero-shot segmentation. SAM can miss fine structures and can hallucinate small disconnected components. See §8 for a discussion of limitations. 在各种任务中，SAM 的零拍摄性能都令人印象深刻。不过我们注意到，在零镜头设置中，给定输入可能会有多个有效的地面真实掩码。我们建议用户在使用 SAM 进行零镜头分割时考虑到这一点。SAM 可能会遗漏细微结构，也可能会幻化出断开的小部件。有关局限性的讨论，请参见第 8 节。
Relevant Factors 相关因素
Groups 组别	SAM was designed to segment any object. This includes stuff and things. SAM 设计用于分割任何物体。这包括东西和事物。
Instrumentation and environment 仪器和环境	We benchmarked SAM on a diverse set of datasets and found that SAM can handle a variety of visual data including simulations, paintings, underwater images, microscopy images, driving data, stereo images, fish-eye images. See §D.1 and Table 7 for information on the benchmarks used. 我们在各种数据集上对 SAM 进行了基准测试，发现 SAM 可以处理各种视觉数据，包括模拟、绘画、水下图像、显微镜图像、驾驶数据、立体图像和鱼眼图像。有关所用基准的信息，请参阅§D.1 和表 7。
Metrics 衡量标准
Model performance measures 模型性能测量	We evaluated SAM on a variety of metrics based on the downstream task in our experiments. 在实验中，我们根据下游任务的各种指标对 SAM 进行了评估。 • mIoU: We used the mean intersection-over-union after a given number of prompts to evaluate the segmentation quality of a mask when prompted with points. - mIoU：我们使用给定提示次数后的平均交叉-重合度来评估用点提示时掩膜的分割质量。 • Human evaluation: We performed a human study (detailed in §E) to evaluate the real world performance of SAM. We compared the masks generated by SAM to a baseline state-of-the-art interactive segmentation model, RITM [92], using a perceptual quality scale from 1 to 10. - 人体评估：我们进行了一项人体研究（详见 E 节），以评估 SAM 的实际性能。我们将 SAM 生成的遮罩与最先进的交互式分割模型 RITM [ 92] 进行了比较，采用的是 1 到 10 的感知质量等级。 • AP: We used average precision to evaluate instance segmentation for a given box and edge detection. - AP我们使用平均精度来评估给定方框和边缘检测的实例分割。 • AR@1000: We used average recall to evaluate object proposal generation. - AR@1000：我们使用平均召回率来评估对象建议的生成。 • ODS, OIS, AP, R50: We used the standard edge detection evaluation metrics from BSDS500 [72, 3]. - ODS、OIS、AP、R50：我们使用了 BSDS500 的标准边缘检测评估指标 [ 72, 3]。
Evaluation Data 评估数据
Data sources 数据来源	See §D.1. 见§D.1。
Training Data 培训数据
Data source 数据来源	See Data Card in §F.1. 见第 F.1 节中的数据卡。
Ethical Considerations 伦理方面的考虑
Data 数据	We trained SAM on licensed images. The images were filtered for objectionable content by the provider, but we acknowledge the possibility of false negatives. We performed a geographic analysis of the SA-1B dataset in §6. While SA-1B is more geographically diverse than many of its predecessors, we acknowledge that some geographic regions and economic groups are underrepresented. 我们在授权图片上对 SAM 进行了训练。提供商已对图片进行了不良内容过滤，但我们承认可能会出现假阴性。我们在第 6 节中对 SA-1B 数据集进行了地理分析。虽然 SA-1B 在地理上比许多前代数据集更加多样化，但我们承认某些地理区域和经济群体的代表性不足。
Cost and impact of compute 计算成本和影响	SAM was trained on 256 A100 GPUS for 68 hours. We acknowledge the environmental impact and cost of training large scale models. The environmental impact of training the released SAM model is approximately 6963 kWh resulting in an estimated 2.8 metric tons of carbon dioxide given the specific data center used, using the calculation described in [77] and the ML CO₂ Impact calculator [61]. This is equivalent to $\scriptstyle\sim$ 7k miles driven by the average gasoline-powered passenger vehicle in the US [101]. We released the SAM models to both reduce the need for retraining and lower the barrier to entry for large scale vision research. SAM 在 256 个 A100 GPUS 上进行了 68 小时的训练。我们承认大规模模型训练对环境的影响和成本。根据[77]和 ML CO ₂ Impact 计算器[61]中描述的计算方法，在使用特定数据中心的情况下，训练已发布的 SAM 模型对环境的影响约为 6963 千瓦时，估计会产生 2.8 公吨二氧化碳。这相当于美国汽油驱动的乘用车平均行驶 7k 英里[101]。我们发布 SAM 模型既是为了减少再培训的需要，也是为了降低大规模视觉研究的准入门槛。
Risks and harms 风险和危害	We evaluated SAM for fairness in §6. Downstream use cases of SAM will create their own potential for biases and fairness concerns. As such we recommend users run their own fairness evaluation when using SAM for their specific use case. 我们在第 6 节中对 SAM 的公平性进行了评估。SAM 的下游用例可能会产生偏差和公平性问题。因此，我们建议用户在将 SAM 用于特定用例时，自行进行公平性评估。
Use cases 使用案例	We implore users to use their best judgement for downstream use of the model. 我们恳请用户在下游使用该模型时做出最佳判断。

Table 9: Model Card for SAM, following the procedure detailed in [75].
表 9：按照 [ 75] 中详细说明的程序制作的 SAM 模型卡。

Appendix G Annotation Guidelines
附录说明指南

We provide the complete guidelines given to annotations for the human review of mask quality in Fig. 19 and Fig. 20.
我们在图 19 和图 20 中提供了用于人工审核掩膜质量的完整注释指南。

ADE20K [117] ADE20K [ 117］	BBBC038v1 [12] BBBC038v1 [ 12］	Cityscapes [25] 城市景观 [ 25］	DOORS [80] 门 [ 80］	DRAM [24] DRAM [ 24］	EgoHOS [113] EgoHOS [ 113］	GTEA [34, 63]	Hypersim [86] Hypersim [ 86］

IBD [17] IBD [ 17］	iShape [111] iShape [ 111］	LVIS [44] LVIS [ 44］	NDD20 [100] NDD20 [ 100］	NDISPark [22, 23]	OVIS [81] 奥维斯 [ 81］	PPDLS [74] PPDLS [ 74］	Plittersdorf [46] 普利特斯多夫 [ 46］

STREETS [91] 街道 [ 91］	TimberSeg [38] 木材分类[ 38］	TrashCan [52] 垃圾桶 [ 52］	VISOR [28, 27] 遮阳板 [ 28, 27]	WoodScape [112] 木景 [ 112］	PIDRay [104] PIDRay [ 104］	ZeroWaste-f [6] 零废弃f [ 6］

image 图像	ground truth 基本事实	SAM	image 图像	ground truth 基本事实	SAM

Segment Anything 任何分段

Abstract 摘要

1 Introduction 1引言

Task (§2). 任务（§2）。

Model (§3). 模型（§3）。

Data engine (§4). 数据引擎（§4）。

Dataset (§5). 数据集（§5）。

Responsible AI (§6). 负责任的人工智能（§6）。

Experiments (§7). 实验（§7）。

Release. 发布。

2 Segment Anything Task 2分段 任何任务

Task. 任务。

Pre-training. 预培训。

Zero-shot transfer. 零镜头传输。

Related tasks. 相关任务

Discussion. 讨论。

3 Segment Anything Model 3 段任何模式

Image encoder. 图像编码器

Prompt encoder. 提示编码器。

Mask decoder. 掩码解码器

Resolving ambiguity. 解决模棱两可的问题。

Efficiency. 效率。

Losses and training. 损失和培训。

4 Segment Anything Data Engine4 段任何数据引擎

Assisted-manual stage. 辅助手动阶段。

Semi-automatic stage. 半自动舞台

Fully automatic stage. 全自动舞台。

5 Segment Anything Dataset5段任何数据集

Images 图片

Masks 面具

Mask quality. 面罩质量。

Mask properties. 面具特性

6 Segment Anything RAI Analysis6 分部任何 RAI 分析

Geographic and income representation.地域和收入代表性。

Fairness in segmenting people.公平划分人群。

7 Zero-Shot Transfer Experiments7 零射程转移实验

Implementation. 实施。

7.1 Zero-Shot Single Point Valid Mask Evaluation7.1 零点单点有效屏蔽评估

Task. 任务。

Datasets. 数据集。

Results. 结果

7.2 Zero-Shot Edge Detection7.2 零镜头边缘检测

Approach. 方法。

Results. 结果

7.3 Zero-Shot Object Proposals7.3 零投篮物体提案

Approach. 方法。

Results. 结果

7.4 Zero-Shot Instance Segmentation7.4 零镜头实例分割

Approach. 方法。

Results. 结果

7.5 Zero-Shot Text-to-Mask7.5 零镜头文本到掩码

Approach. 方法。

Results. 结果

7.6 Ablations 7.6 切割

8 Discussion 8讨论

Foundation models. 基础模型。

Compositionality. 构成性。

Limitations. 局限性。

Conclusion. 结论

Acknowledgments. 致谢。

References 参考资料

Appendix 附录

Table of contents: 目录

Appendix A Segment Anything Model and Task Details附录：Aegment Anything 模型和任务详情

Image encoder. 图像编码器

Prompt encoder. 提示编码器。

Lightweight mask decoder.轻型面具解码器

Making the model ambiguity-aware.使模型具有模糊感知能力。

Losses. 损失

Training algorithm. 训练算法。

Training recipe. 培训食谱

Appendix B Automatic Mask Generation Details附录 BA 自动生成掩码详情

Cropping. 种植。

Filtering. 过滤。

Postprocessing. 后期处理。

Automatic mask generation model.自动面具生成模型

SA-1B examples. SA-1B 示例。

Appendix C RAI Additional Details附录 CRAI 其他详细信息

Inferring geographic information for SA-1B.为 SA-1B 推断地理信息。

Inferring geographic information for COCO and Open Images.为 COCO 和开放图像推断地理信息。

2 Segment Anything Task 2分段任何任务

4 Segment Anything Data Engine
4 段任何数据引擎

5 Segment Anything Dataset
5段任何数据集

6 Segment Anything RAI Analysis
6 分部任何 RAI 分析

Geographic and income representation.
地域和收入代表性。

Fairness in segmenting people.
公平划分人群。

7 Zero-Shot Transfer Experiments
7 零射程转移实验

7.1 Zero-Shot Single Point Valid Mask Evaluation
7.1 零点单点有效屏蔽评估

7.2 Zero-Shot Edge Detection
7.2 零镜头边缘检测

7.3 Zero-Shot Object Proposals
7.3 零投篮物体提案

7.4 Zero-Shot Instance Segmentation
7.4 零镜头实例分割

7.5 Zero-Shot Text-to-Mask
7.5 零镜头文本到掩码

Appendix A Segment Anything Model and Task Details
附录：Aegment Anything 模型和任务详情

Lightweight mask decoder.
轻型面具解码器

Making the model ambiguity-aware.
使模型具有模糊感知能力。

Appendix B Automatic Mask Generation Details
附录 BA 自动生成掩码详情

Automatic mask generation model.
自动面具生成模型

Appendix C RAI Additional Details
附录 CRAI 其他详细信息

Inferring geographic information for SA-1B.
为 SA-1B 推断地理信息。

Inferring geographic information for COCO and Open Images.
为 COCO 和开放图像推断地理信息。

Inferring income information.
推断收入信息。

Fairness in segmenting people.
公平划分人群。

Fairness in segmenting clothing.
服装分类的公平性。

Appendix D Experiment Implementation Details
附录 DE 实验实施细节

D.1 Zero-Shot Single Point Valid Mask Evaluation
D.1 零点单点有效屏蔽评估

Single point ambiguity and oracle evaluation.
单点模糊性和甲骨文评估。

D.2 Zero-Shot Edge Detection
D.2 零镜头边缘检测

D.3 Zero-Shot Object Proposals
D.3 零投篮物体提案

D.4 Zero-Shot Instance Segmentation
D.4 零镜头实例分割

D.5 Zero-Shot Text-to-Mask
D.5零镜头文本到掩码

Generating training prompts.
生成培训提示。

D.6 Probing the Latent Space of SAM
D.6 探究 SAM 的潜在空间

Appendix E Human Study Experimental Design
附录人类研究实验设计

Appendix F Dataset, Annotation, and Model Cards
附录 F 数据集、注释和模型卡片

F.1 Dataset Card for SA-1B
F.1 SA-1B 数据集卡

Preprocessing / Cleaning / Labeling
预处理/清洁/贴标签

Platform and Infrastructure Choices
平台和基础设施选择

Dataset Analysis and Evaluation
数据集分析与评估

Dataset Release and Maintenance
数据集发布和维护

Appendix G Annotation Guidelines
附录说明指南