这是用户在 2024-6-27 14:00 为 https://ar5iv.labs.arxiv.org/html/2304.02643?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Segment Anything 任何分段

Alexander Kirillov1,2,4  Eric Mintun2  Nikhila Ravi1,2  Hanzi Mao2  Chloe Rolland3  Laura Gustafson3
Tete Xiao3      Spencer Whitehead      Alexander C. Berg      Wan-Yen Lo      Piotr Dollár4      Ross Girshick4
1project lead   2joint first author   3equal contribution   4directional lead
1 项目负责人 2 共同第一作者 3 同等贡献 4 定向负责人

Meta AI Research, FAIR
元人工智能研究,FAIR
Abstract 摘要

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive – often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.
我们介绍了 Segment Anything (SA) 项目:一个用于图像分割的新任务、模型和数据集。通过在数据收集循环中使用我们的高效模型,我们建立了迄今为止最大的分割数据集(迄今为止),其中包含 1100 万张授权图像上的 10 亿多个掩码,并且尊重隐私。该模型的设计和训练具有可提示性,因此它可以在新的图像分布和任务中进行零转移。我们在大量任务中评估了它的能力,发现它的零镜头性能令人印象深刻--通常可与之前的完全监督结果相媲美,甚至更胜一筹。我们在 https://segment-anything.com 上发布了 Segment Anything Model(SAM)和相应的数据集(SA-1B),其中包括 1B 个遮罩和 1,100 万张图像,以促进计算机视觉基础模型的研究。

[Uncaptioned image]
Figure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a promptable segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range of tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of over 1 billion masks.
图 1:我们的目标是通过引入三个相互关联的组件来建立一个分割基础模型:一个可提示的分割任务;一个分割模型(SAM),该模型为数据注释提供动力,并可通过提示工程将零镜头转移到一系列任务中;以及一个数据引擎,用于收集 SA-1B(我们拥有超过 10 亿个掩码的数据集)。

1 Introduction 1引言

Large language models pre-trained on web-scale datasets are revolutionizing NLP with strong zero-shot and few-shot generalization [10]. These “foundation models” [8] can generalize to tasks and data distributions beyond those seen during training. This capability is often implemented with prompt engineering in which hand-crafted text is used to prompt the language model to generate a valid textual response for the task at hand. When scaled and trained with abundant text corpora from the web, these models’ zero and few-shot performance compares surprisingly well to (even matching in some cases) fine-tuned models [10, 21]. Empirical trends show this behavior improving with model scale, dataset size, and total training compute [56, 10, 21, 51].
在网络规模的数据集上预先训练的大型语言模型正以强大的零点泛化和少量泛化能力为 NLP 带来革命性的变化[10]。这些 "基础模型"[8] 可以泛化到训练期间所见之外的任务和数据分布。这种能力通常是通过提示工程来实现的,在这种工程中,手工制作的文本被用来提示语言模型为手头的任务生成有效的文本响应。当使用来自网络的大量文本语料进行扩展和训练时,这些模型的零点和少点性能与经过微调的模型相比(在某些情况下甚至不相上下)出奇地好[10, 21]。经验趋势表明,随着模型规模、数据集大小和总训练计算量的增加,这种性能也在提高[56, 10, 21, 51]。

Foundation models have also been explored in computer vision, albeit to a lesser extent. Perhaps the most prominent illustration aligns paired text and images from the web. For example, CLIP [82] and ALIGN [55] use contrastive learning to train text and image encoders that align the two modalities. Once trained, engineered text prompts enable zero-shot generalization to novel visual concepts and data distributions. Such encoders also compose effectively with other modules to enable downstream tasks, such as image generation (e.g., DALL·E [83]). While much progress has been made on vision and language encoders, computer vision includes a wide range of problems beyond this scope, and for many of these, abundant training data does not exist.
计算机视觉领域也对基础模型进行了探索,尽管程度较低。最突出的例子可能是将网络上成对的文本和图像进行对齐。例如,CLIP [ 82] 和 ALIGN [ 55] 使用对比学习来训练文本和图像编码器,使两种模式对齐。一旦经过训练,工程化文本提示就能实现对新的视觉概念和数据分布的零点概括。这种编码器还能与其他模块有效组合,以完成图像生成等下游任务(如 DALL-E [ 83])。虽然在视觉和语言编码器方面已经取得了很大进展,但计算机视觉还包括许多超出这一范围的问题,而且其中许多问题并不存在丰富的训练数据。

In this work, our goal is to build a foundation model for image segmentation. That is, we seek to develop a promptable model and pre-train it on a broad dataset using a task that enables powerful generalization. With this model, we aim to solve a range of downstream segmentation problems on new data distributions using prompt engineering.
在这项工作中,我们的目标是为图像分割建立一个基础模型。也就是说,我们试图开发一个可提示的模型,并在一个广泛的数据集上使用一个能实现强大泛化的任务对其进行预训练。有了这个模型,我们就能在新的数据分布上利用提示工程解决一系列下游分割问题。

The success of this plan hinges on three components: task, model, and data. To develop them, we address the following questions about image segmentation:
该计划的成功取决于三个要素:任务、模型和数据。为了开发这三个部分,我们要解决以下有关图像分割的问题:

  1. 1.

    What task will enable zero-shot generalization?


    1.什么任务可以实现零点概括?
  2. 2.

    What is the corresponding model architecture?


    2.相应的模型结构是什么?
  3. 3.

    What data can power this task and model?


    3.哪些数据可以为这项任务和模型提供动力?

These questions are entangled and require a comprehensive solution. We start by defining a promptable segmentation task that is general enough to provide a powerful pre-training objective and to enable a wide range of downstream applications. This task requires a model that supports flexible prompting and can output segmentation masks in real-time when prompted to allow for interactive use. To train our model, we need a diverse, large-scale source of data. Unfortunately, there is no web-scale data source for segmentation; to address this, we build a “data engine”, i.e., we iterate between using our efficient model to assist in data collection and using the newly collected data to improve the model. We introduce each interconnected component next, followed by the dataset we created and the experiments that demonstrate the effectiveness of our approach.
这些问题错综复杂,需要全面的解决方案。我们首先定义了一个可提示的分割任务,它的通用性足以提供强大的预训练目标,并支持广泛的下游应用。这项任务需要一个支持灵活提示的模型,并能在提示时实时输出分割掩码,以便交互使用。为了训练我们的模型,我们需要多样化、大规模的数据源。遗憾的是,目前还没有网络规模的分割数据源;为了解决这个问题,我们建立了一个 "数据引擎",即在使用我们的高效模型来协助数据收集和使用新收集的数据来改进模型之间反复切换。接下来,我们将介绍每个相互关联的组件,然后是我们创建的数据集和证明我们方法有效性的实验。

Task (§2). 任务(§2)。

In NLP and more recently computer vision, foundation models are a promising development that can perform zero-shot and few-shot learning for new datasets and tasks often by using “prompting” techniques. Inspired by this line of work, we propose the promptable segmentation task, where the goal is to return a valid segmentation mask given any segmentation prompt (see Fig. 1a). A prompt simply specifies what to segment in an image, e.g., a prompt can include spatial or text information identifying an object. The requirement of a valid output mask means that even when a prompt is ambiguous and could refer to multiple objects (for example, a point on a shirt may indicate either the shirt or the person wearing it), the output should be a reasonable mask for at least one of those objects. We use the promptable segmentation task as both a pre-training objective and to solve general downstream segmentation tasks via prompt engineering.
在 NLP 领域以及最近的计算机视觉领域,基础模型是一种前景广阔的发展方向,通常通过使用 "提示 "技术,可以对新数据集和任务进行零次或少量学习。受此启发,我们提出了可提示分割任务,其目标是在给出任何分割提示的情况下返回有效的分割掩码(见图 1a)。提示只需指定图像中要分割的内容,例如,提示可以包括识别物体的空间或文本信息。对有效输出掩码的要求意味着,即使提示模棱两可,可能指向多个对象(例如,衬衫上的一个点既可能指向衬衫,也可能指向穿着衬衫的人),输出也应至少是其中一个对象的合理掩码。我们将可提示分割任务作为预训练目标,并通过提示工程解决一般的下游分割任务。

Model (§3). 模型(§3)。

The promptable segmentation task and the goal of real-world use impose constraints on the model architecture. In particular, the model must support flexible prompts, needs to compute masks in amortized real-time to allow interactive use, and must be ambiguity-aware. Surprisingly, we find that a simple design satisfies all three constraints: a powerful image encoder computes an image embedding, a prompt encoder embeds prompts, and then the two information sources are combined in a lightweight mask decoder that predicts segmentation masks. We refer to this model as the Segment Anything Model, or SAM (see Fig. 1b). By separating SAM into an image encoder and a fast prompt encoder / mask decoder, the same image embedding can be reused (and its cost amortized) with different prompts. Given an image embedding, the prompt encoder and mask decoder predict a mask from a prompt in similar-to\scriptstyle\sim50ms in a web browser. We focus on point, box, and mask prompts, and also present initial results with free-form text prompts. To make SAM ambiguity-aware, we design it to predict multiple masks for a single prompt allowing SAM to naturally handle ambiguity, such as the shirt vs. person example.
可提示的分割任务和在现实世界中使用的目标对模型架构提出了限制。特别是,模型必须支持灵活的提示,需要实时计算掩码以允许交互式使用,并且必须具有模糊感知能力。令人惊讶的是,我们发现一个简单的设计就能满足所有三个限制条件:一个功能强大的图像编码器计算图像嵌入,一个提示编码器嵌入提示,然后将这两个信息源结合到一个轻量级掩码解码器中,预测分割掩码。我们将这一模型称为 "任何分割模型",或 SAM(见图 1b)。通过将 SAM 分离为图像编码器和快速提示编码器/掩码解码器,相同的图像嵌入可以通过不同的提示重复使用(并摊销其成本)。在给定图像嵌入的情况下,提示编码器和掩码解码器可在 50 毫秒内从网页浏览器的提示中预测出掩码。我们的重点是点、方框和掩码提示,同时也展示了自由格式文本提示的初步结果。为了使 SAM 具有歧义感知能力,我们设计了它来预测单个提示的多个掩码,使 SAM 能够自然地处理歧义,例如衬衫与人的例子。

Data engine (§4). 数据引擎(§4)。

To achieve strong generalization to new data distributions, we found it necessary to train SAM on a large and diverse set of masks, beyond any segmentation dataset that already exists. While a typical approach for foundation models is to obtain data online [82], masks are not naturally abundant and thus we need an alternative strategy. Our solution is to build a “data engine”, i.e., we co-develop our model with model-in-the-loop dataset annotation (see Fig. 1c). Our data engine has three stages: assisted-manual, semi-automatic, and fully automatic. In the first stage, SAM assists annotators in annotating masks, similar to a classic interactive segmentation setup. In the second stage, SAM can automatically generate masks for a subset of objects by prompting it with likely object locations and annotators focus on annotating the remaining objects, helping increase mask diversity. In the final stage, we prompt SAM with a regular grid of foreground points, yielding on average similar-to\scriptstyle\sim100 high-quality masks per image.
为了实现对新数据分布的强泛化,我们发现有必要在大量不同的掩码集上训练 SAM,而不局限于已有的任何分割数据集。虽然基础模型的典型方法是在线获取数据[82],但掩码并不天然丰富,因此我们需要一种替代策略。我们的解决方案是建立一个 "数据引擎",即通过模型在环数据集注释共同开发我们的模型(见图 1c)。我们的数据引擎分为三个阶段:辅助手动、半自动和全自动。在第一阶段,SAM 协助注释者注释掩码,类似于经典的交互式分割设置。在第二阶段,SAM 可以通过提示可能的对象位置来自动生成对象子集的遮罩,而注释者则专注于注释其余的对象,这有助于提高遮罩的多样性。在最后阶段,我们用规则的前景点网格提示 SAM,平均每幅图像可生成 100 个高质量的遮罩。

Dataset (§5). 数据集(§5)。

Our final dataset, SA-1B, includes more than 1B masks from 11M licensed and privacy-preserving images (see Fig. 2). SA-1B, collected fully automatically using the final stage of our data engine, has 400×{\times} more masks than any existing segmentation dataset [66, 44, 117, 60], and as we verify extensively, the masks are of high quality and diversity. Beyond its use in training SAM to be robust and general, we hope SA-1B becomes a valuable resource for research aiming to build new foundation models.
我们的最终数据集 SA-1B 包括来自 1100 万张授权和隐私保护图像的超过 1B 个掩码(见图 2)。SA-1B 是利用我们数据引擎的最后阶段全自动收集的,它比任何现有的分割数据集都多出 400 ×{\times} 个掩码[66, 44, 117, 60],而且正如我们广泛验证的那样,这些掩码具有很高的质量和多样性。除了用于训练 SAM 的鲁棒性和通用性外,我们还希望 SA-1B 成为旨在建立新基础模型的研究的宝贵资源。

Responsible AI (§6). 负责任的人工智能(§6)。

We study and report on potential fairness concerns and biases when using SA-1B and SAM. Images in SA-1B span a geographically and economically diverse set of countries and we found that SAM performs similarly across different groups of people. Together, we hope this will make our work more equitable for real-world use cases. We provide model and dataset cards in the appendix.
我们研究并报告了使用 SA-1B 和 SAM 时可能出现的公平性问题和偏差。SA-1B 中的图像跨越了地理和经济上不同的国家,我们发现 SAM 在不同人群中的表现类似。我们希望这将使我们的工作在实际应用案例中更加公平。我们在附录中提供了模型和数据集卡片。

Experiments (§7). 实验(§7)。

We extensively evaluate SAM. First, using a diverse new suite of 23 segmentation datasets, we find that SAM produces high-quality masks from a single foreground point, often only slightly below that of the manually annotated ground truth. Second, we find consistently strong quantitative and qualitative results on a variety of downstream tasks under a zero-shot transfer protocol using prompt engineering, including edge detection, object proposal generation, instance segmentation, and a preliminary exploration of text-to-mask prediction. These results suggest that SAM can be used out-of-the-box with prompt engineering to solve a variety of tasks involving object and image distributions beyond SAM’s training data. Nevertheless, room for improvement remains, as we discuss in §8.
我们对 SAM 进行了广泛评估。首先,我们使用 23 个不同的新分割数据集,发现 SAM 能从单个前景点生成高质量的遮罩,通常仅略低于人工标注的地面实况。其次,我们发现,在零镜头传输协议下,SAM 在各种下游任务(包括边缘检测、对象建议生成、实例分割以及文本到掩码预测的初步探索)上都取得了一致的优异定量和定性结果。这些结果表明,除了 SAM 的训练数据之外,SAM 还能与即时工程一起用于解决涉及对象和图像分布的各种任务。不过,正如我们在第8节中讨论的那样,仍有改进的余地。

Release. 发布。

We are releasing the SA-1B dataset for research purposes and making SAM available under a permissive open license (Apache 2.0) at https://segment-anything.com. We also showcase SAM’s capabilities with an online demo.
我们发布了用于研究目的的 SA-1B 数据集,并根据许可开放许可证(Apache 2.0)在 https://segment-anything.com 上提供 SAM。我们还通过在线演示展示了 SAM 的功能。

<<50 masks  << 50 个面具 Refer to caption Refer to caption Refer to caption Refer to caption

50-100 masks 50-100 个面具 Refer to caption Refer to caption Refer to caption Refer to caption

100-200 masks 100-200 个面具 Refer to caption Refer to caption Refer to caption Refer to caption

200-300 masks 200-300 个面具 Refer to caption Refer to caption Refer to caption Refer to caption

300-400 masks 300-400 个面具 Refer to caption Refer to caption Refer to caption Refer to caption

400-500 masks 400-500 个面具 Refer to caption Refer to caption Refer to caption Refer to caption

>>500 masks  >> 500 个面具 Refer to caption Refer to caption Refer to caption Refer to caption

Figure 2: Example images with overlaid masks from our newly introduced dataset, SA-1B. SA-1B contains 11M diverse, high-resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks. These masks were annotated fully automatically by SAM, and as we verify by human ratings and numerous experiments, are of high quality and diversity. We group images by number of masks per image for visualization (there are similar-to\scriptstyle\sim100 masks per image on average).
图 2:新推出的数据集 SA-1B 中带有叠加掩码的示例图像。SA-1B 包含 1,100 万张多样化、高分辨率、授权和隐私保护图像,以及 11B 个高质量分割掩码。这些掩码由 SAM 全自动注释,我们通过人工评分和大量实验验证了它们的高质量和多样性。为便于可视化,我们按每张图像的掩码数量对图像进行分组(平均每张图像有 100 个掩码)。

2 Segment Anything Task 2分段 任何任务

We take inspiration from NLP, where the next token prediction task is used for foundation model pre-training and to solve diverse downstream tasks via prompt engineering [10]. To build a foundation model for segmentation, we aim to define a task with analogous capabilities.
我们从 NLP 中汲取灵感,在 NLP 中,下一个标记预测任务被用于基础模型的预训练,并通过提示工程解决各种下游任务[10]。为了建立一个用于分割的基础模型,我们的目标是定义一个具有类似能力的任务。

Task. 任务。

We start by translating the idea of a prompt from NLP to segmentation, where a prompt can be a set of foreground / background points, a rough box or mask, free-form text, or, in general, any information indicating what to segment in an image. The promptable segmentation task, then, is to return a valid segmentation mask given any prompt. The requirement of a “valid” mask simply means that even when a prompt is ambiguous and could refer to multiple objects (e.g., recall the shirt vs. person example, and see Fig. 3), the output should be a reasonable mask for at least one of those objects. This requirement is similar to expecting a language model to output a coherent response to an ambiguous prompt. We choose this task because it leads to a natural pre-training algorithm and a general method for zero-shot transfer to downstream segmentation tasks via prompting.
我们首先将提示的概念从 NLP 转化为分割,提示可以是一组前景/背景点、一个粗略的方框或遮罩、自由形式的文本,或者,一般来说,指示图像中要分割的内容的任何信息。因此,可提示分割任务就是在给出任何提示的情况下,返回一个有效的分割掩码。对 "有效 "掩码的要求简单来说就是,即使提示模棱两可,可能指向多个对象(例如,回顾衬衫与人的对比示例,见图 3),输出也应该是至少其中一个对象的合理掩码。这一要求类似于期望语言模型对模棱两可的提示输出连贯的反应。我们之所以选择这项任务,是因为它带来了一种自然的预训练算法和一种通过提示将零镜头转移到下游分割任务的通用方法。

Pre-training. 预培训。

The promptable segmentation task suggests a natural pre-training algorithm that simulates a sequence of prompts (e.g., points, boxes, masks) for each training sample and compares the model’s mask predictions against the ground truth. We adapt this method from interactive segmentation [109, 70], although unlike interactive segmentation whose aim is to eventually predict a valid mask after enough user input, our aim is to always predict a valid mask for any prompt even when the prompt is ambiguous. This ensures that a pre-trained model is effective in use cases that involve ambiguity, including automatic annotation as required by our data engine §4. We note that performing well at this task is challenging and requires specialized modeling and training loss choices, which we discuss in §3.
可提示分割任务提出了一种自然的预训练算法,即为每个训练样本模拟一系列提示(如点、方框、掩码),并将模型的掩码预测与地面实况进行比较。我们从交互式分割法[109, 70]中借鉴了这一方法,但与交互式分割法不同的是,交互式分割法的目的是在足够多的用户输入后最终预测出一个有效的掩码,而我们的目的则是始终为任何提示预测出一个有效的掩码,即使该提示是模棱两可的。这就确保了预训练模型在涉及模棱两可的用例(包括我们的数据引擎 §4 所要求的自动注释)中的有效性。我们注意到,完成这项任务具有挑战性,需要选择专门的建模和训练损失,我们将在§3中讨论这一点。

Zero-shot transfer. 零镜头传输。

Intuitively, our pre-training task endows the model with the ability to respond appropriately to any prompt at inference time, and thus downstream tasks can be solved by engineering appropriate prompts. For example, if one has a bounding box detector for cats, cat instance segmentation can be solved by providing the detector’s box output as a prompt to our model. In general, a wide array of practical segmentation tasks can be cast as prompting. In addition to automatic dataset labeling, we explore five diverse example tasks in our experiments in §7.
直观地说,我们的预训练任务赋予了模型在推理时对任何提示做出适当反应的能力,因此下游任务可以通过设计适当的提示来解决。例如,如果我们有一个猫的边界框检测器,就可以通过将检测器的边界框输出作为对模型的提示来解决猫的实例分割问题。一般来说,大量的实际分割任务都可以作为提示任务。除了自动数据集标注,我们还在第 7 节的实验中探索了五种不同的示例任务。

Refer to caption
Figure 3: Each column shows 3 valid masks generated by SAM from a single ambiguous point prompt (green circle).
图 3:每栏显示了 SAM 从一个模糊点提示(绿色圆圈)生成的 3 个有效掩码。

Related tasks. 相关任务

Segmentation is a broad field: there’s interactive segmentation [57, 109], edge detection [3], super pixelization [85], object proposal generation [2], foreground segmentation [94], semantic segmentation [90], instance segmentation [66], panoptic segmentation [59], etc. The goal of our promptable segmentation task is to produce a broadly capable model that can adapt to many (though not all) existing and new segmentation tasks via prompt engineering. This capability is a form of task generalization [26]. Note that this is different than previous work on multi-task segmentation systems. In a multi-task system, a single model performs a fixed set of tasks, e.g., joint semantic, instance, and panoptic segmentation [114, 19, 54], but the training and test tasks are the same. An important distinction in our work is that a model trained for promptable segmentation can perform a new, different task at inference time by acting as a component in a larger system, e.g., to perform instance segmentation, a promptable segmentation model is combined with an existing object detector.
分割是一个广泛的领域:有交互式分割[ 57, 109]、边缘检测[ 3]、超级像素化[ 85]、对象建议生成[ 2]、前景分割[ 94]、语义分割[ 90]、实例分割[ 66]、全景分割[ 59]等。我们的可提示分割任务的目标是生成一个具有广泛能力的模型,该模型可以通过提示工程适应许多(尽管不是全部)现有的和新的分割任务。这种能力是任务泛化的一种形式[26]。请注意,这与之前关于多任务分割系统的研究不同。在多任务系统中,单个模型执行一组固定的任务,例如联合语义分割、实例分割和全视角分割[114, 19, 54],但训练和测试任务是相同的。我们工作中的一个重要区别是,为可提示分割训练的模型在推理时可以通过作为更大系统中的一个组件来执行新的、不同的任务,例如,为了执行实例分割,可提示分割模型与现有的对象检测器相结合。

Discussion. 讨论。

Prompting and composition are powerful tools that enable a single model to be used in extensible ways, potentially to accomplish tasks unknown at the time of model design. This approach is analogous to how other foundation models are used, e.g., how CLIP [82] is the text-image alignment component of the DALL\cdot[83] image generation system. We anticipate that composable system design, powered by techniques such as prompt engineering, will enable a wider variety of applications than systems trained specifically for a fixed set of tasks. It’s also interesting to compare promptable and interactive segmentation through the lens of composition: while interactive segmentation models are designed with human users in mind, a model trained for promptable segmentation can also be composed into a larger algorithmic system as we will demonstrate.
提示和组合是强大的工具,可以使单个模型以可扩展的方式使用,从而完成模型设计时未知的任务。这种方法类似于其他基础模型的使用方式,例如 CLIP [ 82] 是 DALL \cdot E [ 83] 图像生成系统的文本图像对齐组件。我们预计,与专门为一组固定任务而训练的系统相比,以提示工程(prompt engineering)等技术为动力的可组合系统设计将能实现更广泛的应用。从组合的角度来比较可提示分割和交互式分割也很有趣:交互式分割模型是以人类用户为中心设计的,而为可提示分割而训练的模型也可以组合成一个更大的算法系统,正如我们将演示的那样。

Refer to caption
Figure 4: Segment Anything Model (SAM) overview. A heavyweight image encoder outputs an image embedding that can then be efficiently queried by a variety of input prompts to produce object masks at amortized real-time speed. For ambiguous prompts corresponding to more than one object, SAM can output multiple valid masks and associated confidence scores.
图 4:任何分段模型 (SAM) 概览。重量级图像编码器可输出图像嵌入,然后可通过各种输入提示进行高效查询,从而以摊销后的实时速度生成对象掩码。对于对应多个对象的模糊提示,SAM 可以输出多个有效的掩码和相关的置信度分数。

3 Segment Anything Model 3 段任何模式

We next describe the Segment Anything Model (SAM) for promptable segmentation. SAM has three components, illustrated in Fig. 4: an image encoder, a flexible prompt encoder, and a fast mask decoder. We build on Transformer vision models [14, 33, 20, 62] with specific tradeoffs for (amortized) real-time performance. We describe these components at a high-level here, with details in §A.
接下来,我们将介绍用于可提示分割的 "任意分割模型"(SAM)。SAM 有三个组成部分,如图 4 所示:图像编码器、灵活的提示编码器和快速掩码解码器。我们以 Transformer 视觉模型为基础[14, 33, 20, 62],对(摊销)实时性能进行了特定的权衡。我们在此对这些组件进行了高层次的描述,详情请参见 §A。

Image encoder. 图像编码器

Motivated by scalability and powerful pre-training methods, we use an MAE [47] pre-trained Vision Transformer (ViT) [33] minimally adapted to process high resolution inputs [62]. The image encoder runs once per image and can be applied prior to prompting the model.
在可扩展性和强大的预训练方法的激励下,我们使用 MAE [ 47] 预训练的视觉转换器(ViT)[ 33] 进行最小化调整,以处理高分辨率输入 [ 62]。图像编码器每张图像运行一次,可在提示模型之前应用。

Prompt encoder. 提示编码器。

We consider two sets of prompts: sparse (points, boxes, text) and dense (masks). We represent points and boxes by positional encodings [95] summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP [82]. Dense prompts (i.e., masks) are embedded using convolutions and summed element-wise with the image embedding.
我们考虑了两组提示:稀疏(点、方框、文本)和密集(掩码)。我们用位置编码[95]表示点和方框,并与每种提示类型的已学嵌入相加;用 CLIP 的现成文本编码器[82]表示自由格式文本。密集提示(即掩码)使用卷积法嵌入,并与图像嵌入相加。

Mask decoder. 掩码解码器

The mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask. This design, inspired by [14, 20], employs a modification of a Transformer decoder block [103] followed by a dynamic mask prediction head. Our modified decoder block uses prompt self-attention and cross-attention in two directions (prompt-to-image embedding and vice-versa) to update all embeddings. After running two blocks, we upsample the image embedding and an MLP maps the output token to a dynamic linear classifier, which then computes the mask foreground probability at each image location.
掩码解码器能有效地将图像嵌入、提示嵌入和输出标记映射到掩码上。这一设计受[ 14, 20] 的启发,采用了变形解码器块[ 103] 的修改,然后是动态掩码预测头。我们修改后的解码器块使用提示自注意和双向交叉注意(提示到图像嵌入,反之亦然)来更新所有嵌入。运行两个区块后,我们对图像嵌入进行上采样,然后将输出标记映射到动态线性分类器的 MLP,该分类器会计算每个图像位置的掩码前景概率。

Resolving ambiguity. 解决模棱两可的问题。

With one output, the model will average multiple valid masks if given an ambiguous prompt. To address this, we modify the model to predict multiple output masks for a single prompt (see Fig. 3). We found 3 mask outputs is sufficient to address most common cases (nested masks are often at most three deep: whole, part, and subpart). During training, we backprop only the minimum loss [15, 45, 64] over masks. To rank masks, the model predicts a confidence score (i.e., estimated IoU) for each mask.
对于一个输出,如果给出一个模棱两可的提示,模型将平均预测出多个有效掩码。为了解决这个问题,我们修改了模型,以预测单个提示的多个输出掩码(见图 3)。我们发现 3 个掩码输出足以应对大多数常见情况(嵌套掩码通常最多有三个深度:整体、部分和子部分)。在训练过程中,我们只对掩码的最小损失[15, 45, 64]进行反推。为了对掩码进行排序,模型会预测每个掩码的置信度得分(即估计 IoU)。

Efficiency. 效率。

The overall model design is largely motivated by efficiency. Given a precomputed image embedding, the prompt encoder and mask decoder run in a web browser, on CPU, in similar-to\scriptstyle\sim50ms. This runtime performance enables seamless, real-time interactive prompting of our model.
整个模型的设计主要是为了提高效率。给定一个预先计算的图像嵌入,提示编码器和掩码解码器在网络浏览器的 CPU 上运行,只需 similar-to\scriptstyle\sim 50 毫秒。这种运行性能使我们的模型能够实现无缝、实时的交互式提示。

Losses and training. 损失和培训。

We supervise mask prediction with the linear combination of focal loss [65] and dice loss [73] used in [14]. We train for the promptable segmentation task using a mixture of geometric prompts (for text prompts see §7.5). Following [92, 37], we simulate an interactive setup by randomly sampling prompts in 11 rounds per mask, allowing SAM to integrate seamlessly into our data engine.
我们使用[14]中使用的焦点损失[65]和骰子损失[73]的线性组合来监督掩码预测。我们使用混合的几何提示(文本提示见第 7.5 节)对可提示分割任务进行训练。按照[92, 37]的方法,我们通过在每个掩码中随机抽取 11 轮提示来模拟交互式设置,从而使 SAM 能够无缝集成到我们的数据引擎中。

4 Segment Anything Data Engine
4 段任何数据引擎

As segmentation masks are not abundant on the internet, we built a data engine to enable the collection of our 1.1B mask dataset, SA-1B. The data engine has three stages: (1) a model-assisted manual annotation stage, (2) a semi-automatic stage with a mix of automatically predicted masks and model-assisted annotation, and (3) a fully automatic stage in which our model generates masks without annotator input. We go into details of each next.
由于互联网上的分割掩码并不丰富,我们建立了一个数据引擎,以便收集 1.1B 掩码数据集 SA-1B。数据引擎分为三个阶段:(1) 模型辅助人工标注阶段;(2) 半自动阶段,包括自动预测的掩码和模型辅助标注;(3) 全自动阶段,我们的模型无需标注者输入即可生成掩码。接下来我们将详细介绍每个阶段。

Assisted-manual stage. 辅助手动阶段。

In the first stage, resembling classic interactive segmentation, a team of professional annotators labeled masks by clicking foreground / background object points using a browser-based interactive segmentation tool powered by SAM. Masks could be refined using pixel-precise “brush” and “eraser” tools. Our model-assisted annotation runs in real-time directly inside a browser (using precomputed image embeddings) enabling a truly interactive experience. We did not impose semantic constraints for labeling objects, and annotators freely labeled both “stuff” and “things” [1]. We suggested annotators label objects they could name or describe, but did not collect these names or descriptions. Annotators were asked to label objects in order of prominence and were encouraged to proceed to the next image once a mask took over 30 seconds to annotate.
第一阶段类似于经典的交互式分割,由一组专业注释人员使用由 SAM 支持的基于浏览器的交互式分割工具,通过点击前景/背景对象点来标记掩码。可以使用像素精确的 "画笔 "和 "橡皮擦 "工具对遮罩进行细化。我们的模型辅助标注直接在浏览器中实时运行(使用预先计算的图像嵌入),实现了真正的交互式体验。我们没有对标注对象施加语义限制,标注者可以自由标注 "东西 "和 "事物"[ 1]。我们建议标注者标注他们可以命名或描述的对象,但不收集这些名称或描述。我们要求标注者按照突出顺序标注对象,并鼓励标注者在一个遮罩的标注时间超过 30 秒后继续标注下一幅图像。

At the start of this stage, SAM was trained using common public segmentation datasets. After sufficient data annotation, SAM was retrained using only newly annotated masks. As more masks were collected, the image encoder was scaled from ViT-B to ViT-H and other architectural details evolved; in total we retrained our model 6 times. Average annotation time per mask decreased from 34 to 14 seconds as the model improved. We note that 14 seconds is 6.5×{\times} faster than mask annotation for COCO [66] and only 2×{\times} slower than bounding-box labeling with extreme points [76, 71]. As SAM improved, the average number of masks per image increased from 20 to 44 masks. Overall, we collected 4.3M masks from 120k images in this stage.
在这一阶段开始时,使用常见的公共分割数据集对 SAM 进行训练。在获得足够的数据注释后,只使用新注释的掩码对 SAM 进行重新训练。随着收集的掩码越来越多,图像编码器也从 ViT-B 升级到 ViT-H,其他架构细节也在不断变化;我们总共对模型进行了 6 次重新训练。随着模型的改进,每个掩码的平均标注时间从 34 秒减少到 14 秒。我们注意到,14 秒比 COCO 的掩码注释快 6.5 ×{\times} [66],比用极值点进行边界框标注仅慢 2 ×{\times} [76, 71]。随着 SAM 的改进,每幅图像的平均掩码数量从 20 个增加到 44 个。总体而言,我们在这一阶段从 120k 幅图像中收集了 430 万个掩码。

Semi-automatic stage. 半自动舞台

In this stage, we aimed to increase the diversity of masks in order to improve our model’s ability to segment anything. To focus annotators on less prominent objects, we first automatically detected confident masks. Then we presented annotators with images prefilled with these masks and asked them to annotate any additional unannotated objects. To detect confident masks, we trained a bounding box detector [84] on all first stage masks using a generic “object” category. During this stage we collected an additional 5.9M masks in 180k images (for a total of 10.2M masks). As in the first stage, we periodically retrained our model on newly collected data (5 times). Average annotation time per mask went back up to 34 seconds (excluding the automatic masks) as these objects were more challenging to label. The average number of masks per image went from 44 to 72 masks (including the automatic masks).
在这一阶段,我们的目标是增加遮罩的多样性,以提高我们的模型分割任何物体的能力。为了让注释者专注于不太突出的物体,我们首先自动检测出有信心的遮罩。然后,我们向注释者展示预填充了这些掩码的图像,并要求他们注释任何其他未注释的物体。为了检测有把握的遮罩,我们使用通用的 "物体 "类别对所有第一阶段的遮罩进行了边界框检测器[84]训练。在这一阶段,我们又在 180k 幅图像中收集了 590 万个遮罩(共计 1020 万个遮罩)。与第一阶段一样,我们定期在新收集的数据上重新训练模型(5 次)。每个掩码的平均标注时间回升到 34 秒(不包括自动掩码),因为这些对象的标注更具挑战性。每幅图像的平均遮罩数从 44 个增加到 72 个(包括自动遮罩)。

Fully automatic stage. 全自动舞台。

In the final stage, annotation was fully automatic. This was feasible due to two major enhancements to our model. First, at the start of this stage, we had collected enough masks to greatly improve the model, including the diverse masks from the previous stage. Second, by this stage we had developed the ambiguity-aware model, which allowed us to predict valid masks even in ambiguous cases. Specifically, we prompted the model with a 32×{\times}32 regular grid of points and for each point predicted a set of masks that may correspond to valid objects. With the ambiguity-aware model, if a point lies on a part or subpart, our model will return the subpart, part, and whole object. The IoU prediction module of our model is used to select confident masks; moreover, we identified and selected only stable masks (we consider a mask stable if thresholding the probability map at 0.5δ0.5𝛿0.5-\delta and 0.5+δ0.5𝛿0.5+\delta results in similar masks). Finally, after selecting the confident and stable masks, we applied non-maximal suppression (NMS) to filter duplicates. To further improve the quality of smaller masks, we also processed multiple overlapping zoomed-in image crops. For further details of this stage, see §B. We applied fully automatic mask generation to all 11M images in our dataset, producing a total of 1.1B high-quality masks. We describe and analyze the resulting dataset, SA-1B, next.
在最后阶段,注释工作完全自动化。之所以能够做到这一点,是因为我们对模型进行了两大改进。首先,在这一阶段开始时,我们已经收集了足够多的掩码来大大改进模型,包括上一阶段的各种掩码。其次,在这一阶段,我们已经开发出了模棱两可的感知模型,即使在模棱两可的情况下,我们也能预测出有效的掩码。具体来说,我们用 32 ×{\times} 32 个有规则的网格点来提示模型,并为每个点预测一组可能与有效对象相对应的遮罩。有了模糊感知模型,如果一个点位于一个部件或子部件上,我们的模型就会返回子部件、部件和整个对象。我们模型的 IoU 预测模块用于选择有把握的掩码;此外,我们只识别和选择稳定的掩码(如果在 0.5δ0.5𝛿0.5-\delta0.5+δ0.5𝛿0.5+\delta 处对概率图进行阈值处理后得到相似的掩码,我们就认为该掩码是稳定的)。最后,在选择了可靠和稳定的掩码后,我们使用非最大抑制(NMS)来过滤重复的掩码。为了进一步提高较小掩码的质量,我们还处理了多个重叠的放大图像裁剪。有关这一阶段的更多详情,请参阅§B。我们对数据集中的所有 1,100 万张图像进行了全自动掩膜生成,总共生成了 1.1B 个高质量掩膜。接下来,我们将描述并分析生成的数据集 SA-1B。

Refer to caption
Figure 5: Image-size normalized mask center distributions.
图 5:图像大小归一化掩膜中心分布。
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Dataset mask properties. The legend references the number of images and masks in each dataset. Note, that SA-1B has 11×{\times} more images and 400×{\times} more masks than the largest existing segmentation dataset Open Images [60].
图 6:数据集掩码属性。图例中提到了每个数据集中的图像和掩码数量。请注意,与现有最大的分割数据集 Open Images [ 60] 相比,SA-1B 多了 11 ×{\times} 幅图像和 400 ×{\times} 个掩码。

5 Segment Anything Dataset
5段任何数据集

Our dataset, SA-1B, consists of 11M diverse, high-resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks collected with our data engine. We compare SA-1B with existing datasets and analyze mask quality and properties. We are releasing SA-1B to aid future development of foundation models for computer vision. We note that SA-1B will be released under a favorable license agreement for certain research uses and with protections for researchers.
我们的数据集 SA-1B 由 1,100 万张不同的、高分辨率的、经过授权的、隐私保护的图像和我们的数据引擎收集的 11B 个高质量分割掩码组成。我们将 SA-1B 与现有数据集进行了比较,并分析了遮罩的质量和属性。我们发布 SA-1B 是为了帮助未来计算机视觉基础模型的开发。我们注意到,SA-1B 将根据有利的许可协议发布,用于某些研究用途,并为研究人员提供保护。

Images 图片

. We licensed a new set of 11M images from a provider that works directly with photographers. These images are high resolution (3300×{\times}4950 pixels on average), and the resulting data size can present accessibility and storage challenges. Therefore, we are releasing downsampled images with their shortest side set to 1500 pixels. Even after downsampling, our images are significantly higher resolution than many existing vision datasets (e.g., COCO [66] images are similar-to\scriptstyle\sim480×{\times}640 pixels). Note that most models today operate on much lower resolution inputs. Faces and vehicle license plates have been blurred in the released images.
.我们从一个直接与摄影师合作的供应商那里获得了一组新的 1,100 万张图片的授权。这些图片分辨率很高(平均 3300 ×{\times} 4950 像素),由此产生的数据量可能会给访问和存储带来挑战。因此,我们发布的是经过缩减取样的图像,其最短边设置为 1500 像素。即使在降低采样率后,我们的图像分辨率也大大高于许多现有的视觉数据集(例如,COCO [ 66] 图像的分辨率为 similar-to\scriptstyle\sim 480 ×{\times} 640 像素)。请注意,目前大多数模型都是在分辨率低得多的输入上运行的。在已发布的图像中,人脸和车辆牌照已被模糊处理。

Masks 面具

. Our data engine produced 1.1B masks, 99.1% of which were generated fully automatically. Therefore, the quality of the automatic masks is centrally important. We compare them directly to professional annotations and look at how various mask properties compare to prominent segmentation datasets. Our main conclusion, as borne out in the analysis below and the experiments in §7, is that our automatic masks are high quality and effective for training models. Motivated by these findings, SA-1B only includes automatically generated masks.
.我们的数据引擎生成了 1.1B 个掩码,其中 99.1% 是全自动生成的。因此,自动掩码的质量至关重要。我们将它们直接与专业注释进行比较,并考察各种掩码属性与著名分割数据集的比较情况。下文的分析和第 7 节的实验证明,我们的主要结论是,我们的自动遮罩质量高,对训练模型很有效。基于这些结论,SA-1B 只包含自动生成的掩码。

Mask quality. 面罩质量。

To estimate mask quality, we randomly sampled 500 images (similar-to\scriptstyle\sim50k masks) and asked our professional annotators to improve the quality of all masks in these images. Annotators did so using our model and pixel-precise “brush” and “eraser” editing tools. This procedure resulted in pairs of automatically predicted and professionally corrected masks. We computed IoU between each pair and found that 94% of pairs have greater than 90% IoU (and 97% of pairs have greater than 75% IoU). For comparison, prior work estimates inter-annotator consistency at 85-91% IoU [44, 60]. Our experiments in §7 confirm by human ratings that mask quality is high relative to a variety of datasets and that training our model on automatic masks is nearly as good as using all masks produced by the data engine.
为了评估遮罩的质量,我们随机抽取了 500 幅图像( similar-to\scriptstyle\sim 50k 个遮罩),并要求我们的专业注释员提高这些图像中所有遮罩的质量。注释者使用我们的模型和像素精确的 "画笔 "和 "橡皮擦 "编辑工具来完成这项工作。这一过程产生了一对自动预测和专业修正的遮罩。我们计算了每对蒙版之间的 IoU,发现 94% 的蒙版之间的 IoU 大于 90%(97% 的蒙版之间的 IoU 大于 75%)。相比之下,先前的工作估计注释者之间的一致性为 85-91% IoU [ 44, 60]。我们在第 7 节中的实验通过人工评分证实,相对于各种数据集,掩码质量很高,而且在自动掩码上训练我们的模型几乎与使用数据引擎生成的所有掩码一样好。

Refer to caption
Refer to caption
Figure 7: Estimated geographic distribution of SA-1B images. Most of the world’s countries have more than 1000 images in SA-1B, and the three countries with the most images are from different parts of the world.
图 7:SA-1B 图像的估计地理分布。世界上大多数国家在 SA-1B 中都有超过 1000 幅图像,而图像最多的三个国家来自世界不同地区。

Mask properties. 面具特性

In Fig. 5 we plot the spatial distribution of object centers in SA-1B compared to the largest existing segmentation datasets. Common photographer biases are present in all datasets. We observe that SA-1B has greater coverage of image corners compared to LVIS v1 [44] and ADE20K [117], the two most similarly distributed datasets, while COCO [66] and Open Images V5 [60] have a more prominent center bias. In Fig. 6 (legend) we compare these datasets by size. SA-1B has 11×{\times} more images and 400×{\times} more masks than the second largest, Open Images. On average, it has 36×{\times} more masks per image than Open Images. The closest dataset in this respect, ADE20K, still has 3.5×{\times} fewer masks per image. Fig. 6 (left) plots the masks-per-image distribution. Next, we look at image-relative mask size (square root of the mask area divided by image area) in Fig. 6 (middle). As expected, since our dataset has more masks per image, it also tends to include a greater percentage of small and medium relative-size masks. Finally, to analyze shape complexity, we look at mask concavity (1 minus mask area divided by area of mask’s convex hull) in Fig. 6 (right). Since shape complexity is correlated with mask size, we control for the datasets’ mask size distributions by first performing stratified sampling from binned mask sizes. We observe that the concavity distribution of our masks is broadly similar to that of other datasets.
在图 5 中,我们绘制了 SA-1B 中物体中心的空间分布图,并与现有最大的分割数据集进行了比较。所有数据集都存在共同的摄影偏差。我们观察到,与 LVIS v1 [ 44] 和 ADE20K [ 117] 这两个分布最相似的数据集相比,SA-1B 对图像角落的覆盖范围更大,而 COCO [ 66] 和 Open Images V5 [ 60] 的中心偏差更为突出。在图 6(图例)中,我们按大小对这些数据集进行了比较。与第二大数据集 "开放图像 "相比,SA-1B 的图像数量多 11 ×{\times} ,遮罩数量多 400 ×{\times} 。平均而言,每个图像的掩码数比 Open Images 多 36 ×{\times} 。在这方面,最接近的数据集 ADE20K 每幅图像的掩码仍比它少 3.5 ×{\times} 。图 6(左)显示了每张图像的掩码分布。接下来,我们在图 6(中)中查看图像相关掩码大小(掩码面积的平方根除以图像面积)。不出所料,由于我们的数据集每幅图像有更多的掩码,因此也倾向于包含更大比例的中小型相对大小的掩码。最后,为了分析形状复杂性,我们查看了图 6(右图)中的掩膜凹度(1 减去掩膜面积除以掩膜凸壳面积)。由于形状复杂度与掩膜尺寸相关,我们首先对掩膜尺寸进行分层抽样,以控制数据集的掩膜尺寸分布。我们观察到,我们的掩码的凹度分布与其他数据集的凹度分布大致相似。

6 Segment Anything RAI Analysis
6 分部任何 RAI 分析

We next perform a Responsible AI (RAI) analysis of our work by investigating potential fairness concerns and biases when using SA-1B and SAM. We focus on the geographic and income distribution of SA-1B and fairness of SAM across protected attributes of people. We also provide dataset, data annotation, and model cards in §F.
接下来,我们通过调查使用 SA-1B 和 SAM 时可能存在的公平性问题和偏差,对我们的工作进行了负责任的人工智能 (RAI) 分析。我们重点关注 SA-1B 的地域和收入分布以及 SAM 在受保护的人群属性方面的公平性。我们还在 §F 中提供了数据集、数据注释和模型卡。

SA-1B % images % 图像
# countries # 国家 #imgs #masks SA-1B COCO O.I.
Africa 非洲 54 300k 28M 2.8% 3.0% 1.7%
Asia & Oceania 亚洲和大洋洲 70 3.9M 423M 36.2% 11.4% 14.3%
Europe 欧洲 47 5.4M 540M 49.8% 34.2% 36.2%
Latin America & Carib. 拉丁美洲和加勒比 42 380k 36M 3.5% 3.1% 5.0%
North America 北美 4 830k 80M 7.7% 48.3% 42.8%
high income countries 高收入国家 81 5.8M 598M 54.0% 89.1% 87.5%
middle income countries 中等收入国家 108 4.9M 499M 45.0% 10.5% 12.0%
low income countries 低收入国家 28 100k 9.4M 0.9% 0.4% 0.5%
Table 1: Comparison of geographic and income representation. SA-1B has higher representation in Europe and Asia & Oceania as well as middle income countries. Images from Africa, Latin America & Caribbean, as well as low income countries, are underrepresented in all datasets.
表 1:地域和收入代表性比较。SA-1B 在欧洲、亚洲和大洋洲以及中等收入国家的代表性较高。非洲、拉丁美洲和加勒比地区以及低收入国家的图像在所有数据集中的代表性都较低。

Geographic and income representation.
地域和收入代表性。

We infer the country images were photographed in using standard methods (see §C). In Fig. 7 we visualize the per-country image counts in SA-1B (left) and the 50 countries with the most images (right). We note that the top-three countries are from different parts of the world. Next, in Table 1 we compare the geographic and income representation of SA-1B, COCO [66], and Open Images [60]. SA-1B has a substantially higher percentage of images in Europe and Asia & Oceania as well as in middle income countries. All datasets underrepresent Africa as well as low income countries. We note that in SA-1B, all regions, including Africa, have at least 28 million masks, 10×{\times} more than the total number of masks of any previous dataset. Finally, we observe that the average number of masks per image (not shown) is fairly consistent across region and income (94-108 per image).
我们使用标准方法推断图像拍摄的国家(见 C 节)。在图 7 中,我们展示了 SA-1B 中每个国家的图片数量(左图)和图片数量最多的 50 个国家(右图)。我们注意到,排名前三位的国家来自世界不同地区。接下来,我们在表 1 中比较了 SA-1B、COCO [ 66] 和 Open Images [ 60] 在地域和收入方面的代表性。在 SA-1B 中,欧洲、亚洲和大洋洲以及中等收入国家的图像比例要高得多。所有数据集在非洲和低收入国家的代表性都不足。我们注意到,在 SA-1B 中,包括非洲在内的所有地区都有至少 2800 万个掩码,比以往任何数据集的掩码总数都要多 10 ×{\times} 。最后,我们注意到,每幅图像的平均掩码数量(未显示)在不同地区和收入国家相当一致(每幅图像 94-108 个)。

Fairness in segmenting people.
公平划分人群。

We investigate potential fairness concerns across perceived gender presentation, perceived age group, and perceived skin tone by measuring the performance discrepancy of SAM between groups. We use the More Inclusive Annotations for People (MIAP) [87] dataset for gender presentation and age and a proprietary dataset for skin tone (see §C). Our evaluation uses simulated interactive segmentation with random sampling of 1 and 3 points (see §D). Table 2 (top left) shows results for perceived gender presentation. We note that females have been shown to be underrepresented in detection and segmentation datasets [115], but observe that SAM performs similarly across groups. We repeat the analysis for perceived age in Table 2 (bottom left), noting that those who are perceived to be younger and older have been shown to be underrepresented in large-scale datasets [110]. SAM performs best on those who are perceived older (although the confidence interval is large). Finally, we repeat the analysis for perceived skin tone in Table 2 (right), noting that those with lighter apparent skin tones have been shown to be overrepresented and those with darker skin tones underrepresented in large-scale datasets [110]. As MIAP does not contain perceived skin tone annotations, we use a proprietary dataset that contains annotations for the perceived Fitzpatrick skin type [36], which ranges from 1 (lightest skin tone) to 6 (darkest skin tone). While the means vary somewhat, we do not find a significant difference across groups. We believe our findings stem from the nature of the task, and acknowledge biases may arise when SAM is used as a component in larger systems. Finally, in §C we extend the analysis to segmenting clothing where we find an indication of bias across perceived gender presentation.
我们通过测量 SAM 在不同群体间的表现差异,调查了在感知性别呈现、感知年龄组别和感知肤色方面可能存在的公平性问题。我们使用 "More Inclusive Annotations for People(MIAP)" [ 87] 数据集来测量性别和年龄,并使用一个专有数据集来测量肤色(见第 C 节)。我们的评估使用 1 点和 3 点随机抽样的模拟交互式分割(见 §D)。表 2(左上)显示了感知性别呈现的结果。我们注意到,女性在检测和分割数据集中的代表性不足[115],但观察到 SAM 在不同群体中的表现类似。我们在表 2(左下)中重复了对感知年龄的分析,注意到在大规模数据集中,那些被认为是年轻和年长的人的代表性不足[110]。SAM 在被认为年龄较大的人群中表现最佳(尽管置信区间较大)。最后,我们在表 2(右)中重复了对感知肤色的分析,注意到在大规模数据集中,表观肤色较浅的人所占比例较高,而肤色较深的人所占比例较低[110]。由于 MIAP 不包含感知肤色注释,我们使用了一个包含感知菲茨帕特里克肤色类型注释的专有数据集[36],其范围从 1(最浅肤色)到 6(最深肤色)不等。虽然平均值略有不同,但我们并未发现各组之间存在显著差异。我们认为,我们的研究结果源于任务的性质,并承认当 SAM 被用作大型系统的一个组件时,可能会出现偏差。 最后,在第 C 节中,我们将分析扩展到对服装的细分,在服装细分中,我们发现了不同性别呈现的偏差迹象。

mIoU at 
1 point 1 分 3 points 3 分
perceived gender presentation
性别认知
feminine 女性化 54.4±plus-or-minus\pm1.7 90.4±plus-or-minus\pm0.6
masculine 阳性 55.7±plus-or-minus\pm1.7 90.1±plus-or-minus\pm0.6
perceived age group 认知年龄组
older 年长 62.9±plus-or-minus\pm6.7 92.6±plus-or-minus\pm1.3
middle 中层 54.5±plus-or-minus\pm1.3 90.2±plus-or-minus\pm0.5
young 年轻 54.2±plus-or-minus\pm2.2 91.2±plus-or-minus\pm0.7
mIoU at 
1 point 1 分 3 points 3 分
perceived skin tone 感知肤色
1 52.9±plus-or-minus\pm2.2 91.0±plus-or-minus\pm0.9
2 51.5±plus-or-minus\pm1.4 91.1±plus-or-minus\pm0.5
3 52.2±plus-or-minus\pm1.9 91.4±plus-or-minus\pm0.7
4 51.5±plus-or-minus\pm2.7 91.7±plus-or-minus\pm1.0
5 52.4±plus-or-minus\pm4.2 92.5±plus-or-minus\pm1.4
6 56.7±plus-or-minus\pm6.3 91.2±plus-or-minus\pm2.4
Table 2: SAM’s performance segmenting people across perceived gender presentation, age group, and skin tone. 95% confidence intervals are shown. Within each grouping, all confidence intervals overlap except older vs. middle.
表 2:SAM 对不同性别、年龄组和肤色的人进行细分的结果。显示了 95% 的置信区间。在每个组别中,除老年与中年外,所有置信区间都是重叠的。

7 Zero-Shot Transfer Experiments
7 零射程转移实验

In this section, we present zero-shot transfer experiments with SAM, the Segment Anything Model. We consider five tasks, four of which differ significantly from the promptable segmentation task used to train SAM. These experiments evaluate SAM on datasets and tasks that were not seen during training (our usage of “zero-shot transfer” follows its usage in CLIP [82]). The datasets may include novel image distributions, such as underwater or ego-centric images (e.g. Fig. 8) that, to our knowledge, do not appear in SA-1B.
在本节中,我们将介绍 SAM(即 Segment Anything Model)的零镜头传输实验。我们考虑了五项任务,其中四项与用于训练 SAM 的可提示分割任务有很大不同。这些实验在训练过程中没有出现过的数据集和任务上对 SAM 进行评估(我们对 "零镜头转移 "的用法沿用了 CLIP [ 82] 中的用法)。这些数据集可能包括新颖的图像分布,如水下图像或以自我为中心的图像(如图 8),据我们所知,这些图像在 SA-1B 中并未出现。

Our experiments begin by testing the core goal of promptable segmentation: producing a valid mask from any prompt. We emphasize the challenging scenario of a single foreground point prompt, since it is more likely to be ambiguous than other more specific prompts. Next, we present a sequence of experiments that traverse low, mid, and high-level image understanding and roughly parallel the historical development of the field. Specifically, we prompt SAM to (1) perform edge detection, (2) segment everything, i.e. object proposal generation, (3) segment detected objects, i.e. instance segmentation, and (4), as a proof-of-concept, to segment objects from free-form text. These four tasks differ significantly from the promptable segmentation task that SAM was trained on and are implemented via prompt engineering. Our experiments conclude with an ablation study.
我们的实验首先测试了可提示分割的核心目标:根据任何提示生成有效的遮罩。我们强调单个前景点提示的挑战性,因为它比其他更具体的提示更容易产生歧义。接下来,我们将介绍一系列实验,这些实验涉及低、中、高级图像理解,与该领域的历史发展大致平行。具体来说,我们提示 SAM:(1) 执行边缘检测;(2) 分割一切,即生成对象建议;(3) 分割检测到的对象,即实例分割;(4) 作为概念验证,从自由格式文本中分割对象。这四项任务与 SAM 所训练的可提示分割任务有很大不同,都是通过提示工程来实现的。我们的实验以一项消融研究结束。

Implementation. 实施。

Unless otherwise specified: (1) SAM uses an MAE [47] pre-trained ViT-H [33] image encoder and (2) SAM was trained on SA-1B, noting that this dataset includes only automatically generated masks from the final stage of our data engine. For all other model and training details, such as hyperparameters, refer to §A.
除非另有说明:(1) SAM 使用的是 MAE [ 47] 预先训练过的 ViT-H [ 33] 图像编码器;(2) SAM 是在 SA-1B 上训练的,注意这个数据集只包括我们数据引擎最后阶段自动生成的掩码。有关超参数等所有其他模型和训练细节,请参阅§A。

7.1 Zero-Shot Single Point Valid Mask Evaluation
7.1 零点单点有效屏蔽评估

Task. 任务。

We evaluate segmenting an object from a single foreground point. This task is ill-posed as one point can refer to multiple objects. Ground truth masks in most datasets do not enumerate all possible masks, which can make automatic metrics unreliable. Therefore, we supplement the standard mIoU metric (i.e., the mean of all IoUs between predicted and ground truth masks) with a human study in which annotators rate mask quality from 1 (nonsense) to 10 (pixel-perfect). See §D.1, §E, and §G for additional details.
我们对从单个前景点分割物体进行了评估。由于一个点可以指多个物体,因此这项任务的难度很大。大多数数据集中的地面实况掩码并不能列举所有可能的掩码,这可能导致自动度量不可靠。因此,我们在标准 mIoU 指标(即预测掩膜与地面实况掩膜之间所有 IoU 的平均值)的基础上,增加了一项人工研究,即注释者对掩膜质量的评分从 1(无意义)到 10(像素完美)。更多详情请参见§D.1、§E 和§G。

By default, we sample points from the “center” of ground truth masks (at a maximal value of the mask’s interior distance transform), following the standard evaluation protocol in interactive segmentation [92]. Since SAM is capable of predicting multiple masks, we evaluate only the model’s most confident mask by default. The baselines are all single-mask methods. We compare mainly to RITM [92], a strong interactive segmenter that performs best on our benchmark compared to other strong baselines [67, 18].
默认情况下,我们会按照交互式分割中的标准评估协议[92],从地面实况掩码的 "中心"(掩码内部距离变换的最大值)对点进行采样。由于 SAM 能够预测多个掩码,因此我们默认只评估模型中最有把握的掩码。基线都是单掩膜方法。我们主要将其与 RITM [ 92] 进行比较,RITM 是一种强大的交互式分割器,与其他强大的基准相比,它在我们的基准上表现最佳 [ 67, 18]。

ADE20K [117] ADE20K [ 117] BBBC038v1 [12] BBBC038v1 [ 12] Cityscapes [25] 城市景观 [ 25] DOORS [80] 门 [ 80] DRAM [24] DRAM [ 24] EgoHOS [113] EgoHOS [ 113] GTEA [34, 63] Hypersim [86] Hypersim [ 86]
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
IBD [17] IBD [ 17] iShape [111] iShape [ 111] LVIS [44] LVIS [ 44] NDD20 [100] NDD20 [ 100] NDISPark [22, 23] OVIS [81] 奥维斯 [ 81] PPDLS [74] PPDLS [ 74] Plittersdorf [46] 普利特斯多夫 [ 46]
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
STREETS [91] 街道 [ 91] TimberSeg [38] 木材分类[ 38] TrashCan [52] 垃圾桶 [ 52] VISOR [28, 27] 遮阳板 [ 28, 27] WoodScape [112] 木景 [ 112] PIDRay [104] PIDRay [ 104] ZeroWaste-f [6] 零废弃f [ 6]
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 8: Samples from the 23 diverse segmentation datasets used to evaluate SAM’s zero-shot transfer capabilities.
图 8:用于评估 SAM 零镜头传输能力的 23 个不同分割数据集的样本。

Datasets. 数据集。

We use a newly compiled suite of 23 datasets with diverse image distributions. Fig. 8 lists the datasets and shows a sample from each one (see appendix Table 7 for more details). We use all 23 datasets for mIoU evaluation. For the human study, we use the subset listed in Fig. 9b (due to the resource requirements of such studies). This subset includes both datasets for which SAM outperforms and underperforms RITM according to automatic metrics.
我们使用了一套新编制的 23 个数据集,这些数据集具有不同的图像分布。图 8 列出了这些数据集,并展示了每个数据集的样本(更多详情请参见附录表 7)。我们使用所有 23 个数据集进行 mIoU 评估。对于人体研究,我们使用图 9b 中列出的子集(由于此类研究的资源要求)。该子集包括根据自动指标 SAM 优于 RITM 和 RITM 劣于 SAM 的两个数据集。

-200+20+40IoU delta at 1 center pointGTEA [34, 63]TrashCan [52]DRAM [24]PIDRay [104]Cityscapes [25]WoodScape [112]IBD [17]EgoHOS [113]Plittersdorf [46]VISOR [28, 27]NDISPark [22, 23]Hypersim [86]OVIS [81]ADE20K [117]iShape [111]ZeroWaste-f [6]STREETS [91]LVIS [44]NDD20 [100]TimberSeg [38]DOORS [80]BBBC038v1 [12]PPDLS [74]-21.4-15.0-6.5-5.8-2.0-0.6-0.3+0.8+1.5+1.8+2.7+6.1+7.0+7.8+8.8+9.1+17.3+18.5+21.1+28.9+41.1+44.7+46.9
(a) SAM vs. RITM [92] on 23 datasets
(a) 23 个数据集上的 SAM 与 RITM [ 92]
Refer to caption
(b) Mask quality ratings by human annotators
(b) 人工标注者对掩模质量的评级
Refer to caption Refer to caption
(c) Center points (default)
(c) 中心点(默认)
(d) Random points (d) 随机点
Figure 9: Point to mask evaluation on 23 datasets. (a) Mean IoU of SAM and the strongest single point segmenter, RITM [92]. Due to ambiguity, a single mask may not match ground truth; circles show “oracle” results of the most relevant of SAM’s 3 predictions. (b) Per-dataset comparison of mask quality ratings by annotators from 1 (worst) to 10 (best). All methods use the ground truth mask center as the prompt. (c, d) mIoU with varying number of points. SAM significantly outperforms prior interactive segmenters with 1 point and is on par with more points. Low absolute mIoU at 1 point is the result of ambiguity.
图 9:23 个数据集上的点到掩码评估。(a) SAM 和最强的单点分割器 RITM [ 92] 的平均 IoU。由于模糊性,单个遮罩可能与地面实况不匹配;圆圈显示的是 SAM 3 个预测中最相关的 "oracle "结果。(b) 注释者对每个数据集的掩码质量评分比较,从 1(最差)到 10(最佳)。所有方法都使用地面实况掩膜中心作为提示。 (c, d) 不同点数的 mIoU。使用 1 个点时,SAM 明显优于之前的交互式分割器,而使用更多点时,SAM 的表现也不相上下。1 个点的绝对 mIoU 值较低是由于模糊性造成的。

Results. 结果

First, we look at automatic evaluation on the full suite of 23 datasets using mIoU. We compare per-dataset results in Fig. 9a against RITM. SAM yields higher results on 16 of the 23 datasets, by as much as similar-to\scriptstyle\sim47 IoU. We also present an “oracle” result, in which the most relevant of SAM’s 3 masks is selected by comparing them to the ground truth, rather than selecting the most confident mask. This reveals the impact of ambiguity on automatic evaluation. In particular, with the oracle to perform ambiguity resolution, SAM outperforms RITM on all datasets.
首先,我们使用 mIoU 对全部 23 个数据集进行自动评估。在图 9a 中,我们将每个数据集的结果与 RITM 进行了比较。在 23 个数据集中的 16 个数据集上,SAM 得到了更高的结果,高达 similar-to\scriptstyle\sim 47 IoU。我们还展示了一个 "oracle "结果,即通过将 SAM 的 3 个掩码与地面实况进行比较,选择出最相关的掩码,而不是选择最有信心的掩码。这揭示了模糊性对自动评估的影响。特别是,在使用 "神谕 "来解决模糊性问题时,SAM 在所有数据集上的表现都优于 RITM。

Results of the human study are presented in Fig. 9b. Error bars are 95% confidence intervals for mean mask ratings (all differences are significant; see §E for details). We observe that the annotators consistently rate the quality of SAM’s masks substantially higher than the strongest baseline, RITM. An ablated, “ambiguity-unaware” version of SAM with a single output mask has consistently lower ratings, though still higher than RITM. SAM’s mean ratings fall between 7 and 9, which corresponds to the qualitative rating guideline: “A high score (7-9): The object is identifiable and errors are small and rare (e.g., missing a small, heavily obscured disconnected component, …).” These results indicate that SAM has learned to segment valid masks from a single point. Note that for datasets like DRAM and IBD, where SAM is worse on automatic metrics, it receives consistently higher ratings in the human study.
人类研究的结果见图 9b。误差条为平均掩码评分的 95% 置信区间(所有差异均显著;详见 §E)。我们观察到,注释者对 SAM 掩码质量的评价始终远远高于最强的基线 RITM。具有单一输出掩码的 "无歧义感知 "版 SAM 的评分一直较低,但仍高于 RITM。SAM 的平均评分在 7 到 9 分之间,符合定性评分准则:"高分(7-9 分):对象可识别,错误小且罕见(如遗漏一个小的、严重模糊的断开组件......)"。这些结果表明,SAM 已学会从单点开始分割有效的掩码。请注意,在 DRAM 和 IBD 等数据集上,SAM 的自动指标较差,但在人工研究中却获得了一致较高的评分。

Fig. 9c shows additional baselines, SimpleClick [67] and FocalClick [18], which obtain lower single point performance than RITM and SAM. As the number of points increases from 1 to 9, we observe that the gap between methods decreases. This is expected as the task becomes easier; also, SAM is not optimized for the very high IoU regime. Finally, in Fig. 9d we replace the default center point sampling with random point sampling. We observe that the gap between SAM and the baselines grows and SAM is able to achieve comparable results under either sampling method.
图 9c 显示了其他基线方法:SimpleClick [ 67] 和 FocalClick [ 18],它们的单点性能低于 RITM 和 SAM。随着点数从 1 个增加到 9 个,我们发现方法之间的差距在缩小。这是意料之中的,因为任务变得更容易了;另外,SAM 没有针对极高的 IoU 机制进行优化。最后,在图 9d 中,我们用随机点采样取代了默认的中心点采样。我们观察到,SAM 与基线之间的差距越来越大,而 SAM 在任一采样方法下都能取得相当的结果。

7.2 Zero-Shot Edge Detection
7.2 零镜头边缘检测

Approach. 方法。

We evaluate SAM on the classic low-level task of edge detection using BSDS500 [72, 3]. We use a simplified version of our automatic mask generation pipeline. Specifically, we prompt SAM with a 16×{\times}16 regular grid of foreground points resulting in 768 predicted masks (3 per point). Redundant masks are removed by NMS. Then, edge maps are computed using Sobel filtering of unthresholded mask probability maps and standard lightweight postprocessing, including edge NMS (see §D.2 for details).
我们使用 BSDS500 [ 72, 3] 对 SAM 的经典低级边缘检测任务进行了评估。我们使用了简化版的自动掩膜生成管道。具体来说,我们使用 16 ×{\times} 16 个前景点的规则网格来提示 SAM,从而得到 768 个预测掩码(每个点 3 个)。冗余遮罩由 NMS 去除。然后,使用未阈值掩膜概率图的索贝尔滤波和标准轻量级后处理(包括边缘 NMS)计算边缘图(详见第 D.2 节)。

image 图像 ground truth 基本事实 SAM
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 10: Zero-shot edge prediction on BSDS500. SAM was not trained to predict edge maps nor did it have access to BSDS images or annotations during training.
图 10:BSDS500 上的零镜头边缘预测。SAM 没有接受过预测边缘图的训练,在训练期间也无法访问 BSDS 图像或注释。
method 方法 year  ODS OIS AP R50
HED [108] 2015 .788 .808 .840 .923
EDETR [79] EDETR [ 79] 2022 .840 .858 .896 .930
zero-shot transfer methods:
零镜头传输方法:
Sobel filter 索贝尔滤波器 1968 .539 - - -
Canny [13] Canny [ 13] 1986 .600 .640 .580 -
Felz-Hutt [35] 费尔兹-胡特 [ 35] 2004 .610 .640 .560 -
SAM 2023 .768 .786 .794 .928
Table 3: Zero-shot transfer to edge detection on BSDS500.
表 3:在 BSDS500 上从零镜头传输到边缘检测。

Results. 结果

We visualize representative edge maps in Fig. 10 (see Fig. 15 for more). Qualitatively, we observe that even though SAM was not trained for edge detection, it produces reasonable edge maps. Compared to the ground truth, SAM predicts more edges, including sensible ones that are not annotated in BSDS500. This bias is reflected quantitatively in Table 3: recall at 50% precision (R50) is high, at the cost of precision. SAM naturally lags behind state-of-the-art methods that learn the biases of BSDS500, i.e., which edges to suppress. Nevertheless, SAM performs well compared to pioneering deep learning methods such as HED [108] (also trained on BSDS500) and significantly better than prior, though admittedly outdated, zero-shot transfer methods.
我们在图 10 中展示了具有代表性的边缘图(更多信息请参见图 15)。我们观察到,尽管 SAM 没有经过边缘检测的训练,但它还是绘制出了合理的边缘图。与地面实况相比,SAM 预测了更多的边缘,包括 BSDS500 中没有注释的合理边缘。表 3 定量地反映了这一偏差:50% 精确度下的召回率(R50)很高,但精确度却有所降低。SAM 自然落后于学习 BSDS500 偏差(即抑制哪些边缘)的最先进方法。不过,与 HED [ 108](也是在 BSDS500 上训练的)等开创性深度学习方法相比,SAM 的表现还是不错的,而且明显优于之前的零点转移方法,尽管这种方法已经过时。

7.3 Zero-Shot Object Proposals
7.3 零投篮物体提案

Approach. 方法。

Next, we evaluate SAM on the mid-level task of object proposal generation [2, 102]. This task has played an important role in object detection research, serving as an intermediate step in pioneering systems (e.g.[102, 41, 84]). To generate object proposals, we run a slightly modified version of our automatic mask generation pipeline and output the masks as proposals (see §D.3 for details).
接下来,我们将对 SAM 的中级任务--生成对象建议[2, 102]--进行评估。这项任务在物体检测研究中发挥了重要作用,是先驱系统(如[ 102, 41, 84])的中间步骤。为了生成对象建议,我们运行了稍作修改的自动遮罩生成管道,并将遮罩输出为建议(详见第 D.3 节)。

We compute the standard average recall (AR) metric on LVIS v1 [44]. We focus on LVIS because its large number of categories presents a challenging test. We compare to a strong baseline implemented as a ViTDet [62] detector (with cascade Mask R-CNN [48, 11] ViT-H). We note that this “baseline” corresponds to the “Detector Masquerading as Proposal generator” (DMP) method [16] that was shown to game AR, making it a truly demanding comparison.
我们在 LVIS v1 [ 44] 上计算标准平均召回率 (AR) 指标。我们将重点放在 LVIS 上,因为它有大量的类别,是一项具有挑战性的测试。我们将其与作为 ViTDet [ 62] 检测器(使用级联掩码 R-CNN [ 48, 11] ViT-H)实现的强大基线进行比较。我们注意到,这一 "基线 "对应的是 "伪装成提案生成器的检测器"(DMP)方法[ 16],该方法已被证明可与 AR 进行博弈,因此这是一项真正高要求的比较。

mask AR@1000 掩码 AR@1000
method 方法 all 一应俱全 small  med. 中。 large  freq. 频率 com. rare 罕见的
ViTDet-H [62] ViTDet-H [ 62] 63.0 51.7 80.8 87.0 63.1 63.3 58.3
zero-shot transfer methods:
零镜头传输方法:
SAM – single out. SAM - 单人出局。 54.9 42.8 76.7 74.4 54.7 59.8 62.0
SAM 59.3 45.5 81.6 86.9 59.1 63.9 65.8
Table 4: Object proposal generation on LVIS v1. SAM is applied zero-shot, i.e. it was not trained for object proposal generation nor did it access LVIS images or annotations.
表 4:在 LVIS v1 上生成对象建议。SAM 是零拍摄应用,即没有经过生成对象建议的训练,也没有访问 LVIS 图像或注释。

Results. 结果

In Table 4 we see unsurprisingly that using the detections from ViTDet-H as object proposals (i.e., the DMP method [16] that games AR) performs the best overall. However, SAM does remarkably well on several metrics. Notably, it outperforms ViTDet-H on medium and large objects, as well as rare and common objects. In fact, SAM only underperforms ViTDet-H on small objects and frequent objects, where ViTDet-H can easily learn LVIS-specific annotation biases since it was trained on LVIS, unlike SAM. We also compare against an ablated ambiguity-unaware version of SAM (“single out.”), which performs significantly worse than SAM on all AR metrics.
从表 4 中我们不难看出,使用 ViTDet-H 的检测结果作为对象建议(即与 AR 游戏相关的 DMP 方法[16])的整体表现最佳。然而,SAM 在几个指标上的表现都非常出色。值得注意的是,它在中型和大型物体以及稀有和常见物体上的表现都优于 ViTDet-H。事实上,SAM 仅在小型物体和常见物体上的表现不如 ViTDet-H,而 ViTDet-H 可以很容易地学习 LVIS 特有的注释偏差,因为 ViTDet-H 是在 LVIS 上训练出来的,这一点与 SAM 不同。我们还将其与 SAM 的不感知歧义版本("single out.")进行了比较,后者在所有 AR 指标上的表现都明显不如 SAM。

7.4 Zero-Shot Instance Segmentation
7.4 零镜头实例分割

Approach. 方法。

Moving to higher-level vision, we use SAM as the segmentation module of an instance segmenter. The implementation is simple: we run a object detector (the ViTDet used before) and prompt SAM with its output boxes. This illustrates composing SAM in a larger system.
在更高层次的视觉方面,我们将 SAM 用作实例分割器的分割模块。实现方法很简单:我们运行一个对象检测器(之前使用过的 ViTDet),然后用其输出框提示 SAM。这说明了如何将 SAM 组合到一个更大的系统中。

Results. 结果

We compare the masks predicted by SAM and ViTDet on COCO and LVIS in Table 5. Looking at the mask AP metric we observe gaps on both datasets, where SAM is reasonably close, though certainly behind ViTDet. By visualizing outputs, we observed that SAM masks are often qualitatively better than those of ViTDet, with crisper boundaries (see §D.4 and Fig. 16). To investigate this observation, we conducted an additional human study asking annotators to rate the ViTDet masks and SAM masks on the 1 to 10 quality scale used before. In Fig. 11 we observe that SAM consistently outperforms ViTDet in the human study.
表 5 比较了 SAM 和 ViTDet 在 COCO 和 LVIS 上预测的掩码。通过观察掩码 AP 指标,我们发现在这两个数据集上都存在差距,其中 SAM 与 ViTDet 相当接近,但肯定落后于 ViTDet。 通过可视化输出,我们观察到 SAM 的掩码通常比 ViTDet 的掩码质量更好,边界更清晰(见第 D.4 节和图 16)。为了研究这一观察结果,我们又进行了一项人工研究,要求注释者按照之前使用的 1 到 10 的质量等级对 ViTDet 掩膜和 SAM 掩膜进行评分。在图 11 中,我们发现 SAM 在人类研究中的表现始终优于 ViTDet。

COCO [66] COCO [ 66] LVIS v1 [44]
method 方法 AP APS APM APL AP APS APM APL
ViTDet-H [62] ViTDet-H [ 62] 51.0 32.0 54.3 68.9 46.6 35.0 58.0 66.3
zero-shot transfer methods (segmentation module only):
零镜头传输方法(仅限分割模块):
SAM 46.5 30.8 51.0 61.7 44.7 32.5 57.6 65.5
Table 5: Instance segmentation results. SAM is prompted with ViTDet boxes to do zero-shot segmentation. The fully-supervised ViTDet outperforms SAM, but the gap shrinks on the higher-quality LVIS masks. Interestingly, SAM outperforms ViTDet according to human ratings (see Fig. 11).
表 5:实例分割结果。使用 ViTDet 框提示 SAM 进行零镜头分割。完全监督的 ViTDet 的表现优于 SAM,但在更高质量的 LVIS 掩膜上,差距有所缩小。有趣的是,根据人工评分,SAM 的表现优于 ViTDet(见图 11)。
Refer to caption
Figure 11: Mask quality rating distribution from our human study for ViTDet and SAM, both applied to LVIS ground truth boxes. We also report LVIS and COCO ground truth quality. The legend shows rating means and 95% confidence intervals. Despite its lower AP (Table 5), SAM has higher ratings than ViTDet, suggesting that ViTDet exploits biases in the COCO and LVIS training data.
图 11:我们对 ViTDet 和 SAM(均应用于 LVIS 地面实况箱)进行的人体研究得出的掩膜质量评级分布。我们还报告了 LVIS 和 COCO 地面实况质量。图例显示了评级平均值和 95% 的置信区间。尽管 SAM 的 AP 值较低(表 5),但其评分却高于 ViTDet,这表明 ViTDet 利用了 COCO 和 LVIS 训练数据中的偏差。

We hypothesize that on COCO, where the mask AP gap is larger and the ground truth quality is relatively low (as borne out by the human study), ViTDet learns the specific biases of COCO masks. SAM, being a zero-shot method, is unable to exploit these (generally undesirable) biases. The LVIS dataset has higher quality ground truth, but there are still specific idiosyncrasies (e.g., masks do not contain holes, they are simple polygons by construction) and biases for modal vs. amodal masks. Again, SAM is not trained to learn these biases, while ViTDet can exploit them.
我们假设,在 COCO 上,掩码 AP 差距较大,地面实况质量相对较低(人类研究证实了这一点),ViTDet 可以学习 COCO 掩码的特定偏差。而 SAM 作为一种零拍摄方法,无法利用这些(通常不可取的)偏差。LVIS 数据集的地面实况质量较高,但仍存在一些特殊的特异性(例如,掩膜不包含孔洞,它们在结构上是简单的多边形),以及模态掩膜与非模态掩膜的偏差。同样,SAM 没有经过训练来学习这些偏差,而 ViTDet 可以利用这些偏差。

7.5 Zero-Shot Text-to-Mask
7.5 零镜头文本到掩码

Approach. 方法。

Finally, we consider an even higher-level task: segmenting objects from free-form text. This experiment is a proof-of-concept of SAM’s ability to process text prompts. While we used the exact same SAM in all prior experiments, for this one SAM’s training procedure is modified to make it text-aware, but in a way that does not require new text annotations. Specifically, for each manually collected mask with area larger than 1002superscript1002\textrm{100}^{\textrm{2}} we extract the CLIP image embedding. Then, during training, we prompt SAM with the extracted CLIP image embeddings as its first interaction. The key observation here is that because CLIP’s image embeddings are trained to align with its text embeddings, we can train with image embeddings, but use text embeddings for inference. That is, at inference time we run text through CLIP’s text encoder and then give the resulting text embedding as a prompt to SAM (see §D.5 for details).
最后,我们考虑了一个更高级别的任务:从自由格式文本中分割对象。本实验是对 SAM 处理文本提示能力的概念验证。虽然我们在之前的实验中使用的是完全相同的 SAM,但在本实验中,我们对 SAM 的训练程序进行了修改,使其具有文本感知能力,但不需要新的文本注释。具体来说,对于每个人工收集的面积大于 1002superscript1002\textrm{100}^{\textrm{2}} 的掩码,我们都会提取 CLIP 图像嵌入。然后,在训练过程中,我们用提取的 CLIP 图像嵌入作为第一次交互来提示 SAM。这里的关键观察点是,由于 CLIP 的图像嵌入经过训练后与文本嵌入保持一致,因此我们可以使用图像嵌入进行训练,但使用文本嵌入进行推理。也就是说,在推理时,我们通过 CLIP 的文本编码器运行文本,然后将生成的文本嵌入作为对 SAM 的提示(详见第 D.5 节)。

Refer to caption Refer to caption
Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 12: Zero-shot text-to-mask. SAM can work with simple and nuanced text prompts. When SAM fails to make a correct prediction, an additional point prompt can help.
图 12:零镜头文本到掩码。SAM 可以使用简单和细微的文本提示。当 SAM 无法做出正确预测时,额外的点提示会有所帮助。

Results. 结果

We show qualitative results in Fig. 12. SAM can segment objects based on simple text prompts like “a wheel” as well as phrases like “beaver tooth grille”. When SAM fails to pick the right object from a text prompt only, an additional point often fixes the prediction, similar to [31].
我们在图 12 中展示了定性结果。SAM 可以根据简单的文本提示(如 "车轮")以及短语(如 "海狸牙格栅")来分割对象。当 SAM 无法仅从文本提示中挑选出正确的对象时,一个附加点往往可以修复预测结果,这与 [ 31] 相似。

Refer to caption
Refer to caption
Refer to caption
Figure 13: Ablation studies of our data engine stages, image encoder scaling, and training data scaling. (Left) Each data engine stage leads to improvements on our 23 dataset suite, and training with only the automatic data (our default) yields similar results to using data from all three stages. (Middle) SAM trained with similar-to\scriptstyle\sim10% of SA-1B and full SA-1B is comparable. We train with all 11M images by default, but using 1M images is a reasonable practical setting. (Right) Scaling SAM’s image encoder shows meaningful, yet saturating gains. Nevertheless, smaller image encoders may be preferred in certain settings.
图 13:数据引擎阶段、图像编码器缩放和训练数据缩放的消融研究。(左图)每个数据引擎阶段都对我们的 23 个数据集套件有所改进,仅使用自动数据(我们的默认值)进行训练与使用所有三个阶段的数据产生的结果相似。(中)使用 similar-to\scriptstyle\sim 10% 的 SA-1B 和全部 SA-1B 训练的 SAM 具有可比性。我们默认使用全部 1100 万张图像进行训练,但使用 100 万张图像也是一个合理的实际设置。(右图)缩放 SAM 的图像编码器显示出有意义的、但趋于饱和的增益。不过,在某些情况下,较小的图像编码器可能更受欢迎。

7.6 Ablations 7.6 切割

We perform several ablations on our 23 dataset suite with the single center point prompt protocol. Recall that a single point may be ambiguous and that ambiguity may not be represented in the ground truth, which contains only a single mask per point. Since SAM is operating in a zero-shot transfer setting there can be systematic biases between SAM’s top-ranked mask vs. the masks resulting from data annotation guidelines. We therefore additionally report the best mask with respect to the ground truth (“oracle”).
我们使用单中心点提示协议对 23 个数据集套件进行了多次消融。回想一下,单个点可能是模糊的,而这种模糊性可能无法在地面实况中体现,因为地面实况中每个点只包含一个掩膜。由于 SAM 是在零镜头传输设置下运行的,因此 SAM 排名第一的掩码与数据注释指南产生的掩码之间可能存在系统性偏差。因此,我们额外报告了相对于地面实况("oracle")的最佳掩码。

Fig. 13 (left) plots SAM’s performance when trained on cumulative data from the data engine stages. We observe that each stage increases mIoU. When training with all three stages, the automatic masks vastly outnumber the manual and semi-automatic masks. To address this, we found that oversampling the manual and semi-automatic masks during training by 10×{\times} gave best results. This setup complicates training. We therefore tested a fourth setup that uses only the automatically generated masks. With this data, SAM performs only marginally lower than using all data (similar-to\scriptstyle\sim0.5 mIoU). Therefore, by default we use only the automatically generated masks to simplify the training setup.
图 13(左)显示了 SAM 在数据引擎各阶段的累积数据上进行训练后的性能。我们发现,每个阶段都会增加 mIoU。在使用所有三个阶段进行训练时,自动掩码的数量远远超过手动和半自动掩码。为了解决这个问题,我们发现在训练过程中对手动和半自动掩码进行 10 ×{\times} 的超采样可以获得最佳效果。这种设置使训练变得复杂。因此,我们测试了第四种设置,即只使用自动生成的掩码。在使用这些数据时,SAM 的性能仅略低于使用所有数据时的性能( similar-to\scriptstyle\sim 0.5 mIoU)。因此,我们默认只使用自动生成的掩码来简化训练设置。

In Fig. 13 (middle) we look at the impact of data volume. The full SA-1B contains 11M images, which we uniformly subsample to 1M and 0.1M for this ablation. At 0.1M images, we observe a large mIoU decline under all settings. However, with 1M images, about 10% of the full dataset, we observe results comparable to using the full dataset. This data regime, which still includes approximately 100M masks, may be a practical setting for many use cases.
图 13(中)显示了数据量的影响。完整的 SA-1B 包含 1,100 万张图像,我们将这些图像统一抽样为 100 万张和 0.1 万张。在 110 万幅图像中,我们观察到在所有设置下 mIoU 都有大幅下降。然而,在 100 万幅图像(约为全部数据集的 10%)上,我们观察到的结果与使用全部数据集的结果相当。这种数据机制仍包括约 1 亿个掩膜,可能是许多用例的实用设置。

Finally, Fig. 13 (right) shows results with ViT-B, ViT-L, and ViT-H image encoders. ViT-H improves substantially over ViT-B, but has only marginal gains over ViT-L. Further image encoder scaling does not appear fruitful at this time.
最后,图 13(右)显示了 ViT-B、ViT-L 和 ViT-H 图像编码器的结果。与 ViT-B 相比,ViT-H 有了大幅提升,但与 ViT-L 相比,ViT-H 仅有微弱提升。目前看来,对图像编码器进行进一步的扩展并不奏效。

8 Discussion 8讨论

Foundation models. 基础模型。

Pre-trained models have been adapted to downstream tasks since the early days of machine learning [99]. This paradigm has become increasingly important in recent years with a growing emphasis on scale, and such models have recently been (re-)branded as “foundation models”: i.e. models that are “trained on broad data at scale and are adaptable to a wide range of downstream tasks” [8]. Our work correlates well with this definition, though we note that a foundation model for image segmentation is an inherently limited scope, since it represents an important, yet fractional, subset of computer vision. We also contrast one aspect of our approach with [8], which emphasizes the role of self-supervised learning in foundation models. While our model is initialized with a self-supervised technique (MAE [47]), the vast majority of its capabilities come from large-scale supervised training. In cases where data engines can scale available annotations, like ours, supervised training provides an effective solution.
自机器学习诞生之初,预训练模型就已被用于下游任务[99]。近年来,随着人们对规模的日益重视,这种模式变得越来越重要,此类模型最近被(重新)命名为 "基础模型":即 "在大规模的广泛数据上训练并可适应广泛下游任务 "的模型[8]。我们的工作与这一定义十分吻合,不过我们也注意到,图像分割基础模型的范围本身就很有限,因为它代表了计算机视觉的一个重要但又零碎的子集。我们还将我们方法的一个方面与 [ 8] 进行了对比,后者强调了自监督学习在基础模型中的作用。虽然我们的模型是通过自监督技术(MAE [ 47])初始化的,但其绝大部分功能都来自大规模的监督训练。在数据引擎可以扩展可用注释的情况下,比如我们的情况,监督训练提供了有效的解决方案。

Compositionality. 构成性。

Pre-trained models can power new capabilities even beyond ones imagined at the moment of training. One prominent example is how CLIP [82] is used as a component in larger systems, such as DALL\cdot[83]. Our goal is to make this kind of composition straightforward with SAM. We aim to achieve this by requiring SAM to predict a valid mask for a wide range of segmentation prompts. The effect is to create a reliable interface between SAM and other components. For example, MCC [106] can easily use SAM to segment an object of interest and achieve strong generalization to unseen objects for 3D reconstruction from a single RGB-D image. In another example, SAM can be prompted with gaze points detected by a wearable device, enabling new applications. Thanks to SAM’s ability to generalize to new domains like ego-centric images, such systems work without need for additional training.
预先训练的模型可以提供新的功能,甚至超出训练时的想象。一个突出的例子是 CLIP [ 82] 如何被用作 DALL \cdot E [ 83] 等大型系统的一个组件。我们的目标是利用 SAM 直接进行这种组合。为了实现这一目标,我们要求 SAM 为各种分段提示预测有效的掩码。其效果是在 SAM 和其他组件之间建立一个可靠的接口。例如,MCC[106] 可以轻松地使用 SAM 对感兴趣的物体进行分割,并实现对未见物体的强泛化,以便从单张 RGB-D 图像进行 3D 重建。在另一个例子中,SAM 可以通过可穿戴设备检测到的注视点进行提示,从而实现新的应用。由于 SAM 能够泛化到以自我为中心的图像等新领域,因此此类系统无需额外培训即可运行。

Limitations. 局限性。

While SAM performs well in general, it is not perfect. It can miss fine structures, hallucinates small disconnected components at times, and does not produce boundaries as crisply as more computationally intensive methods that “zoom-in”, e.g. [18]. In general, we expect dedicated interactive segmentation methods to outperform SAM when many points are provided, e.g. [67]. Unlike these methods, SAM is designed for generality and breadth of use rather than high IoU interactive segmentation. Moreover, SAM can process prompts in real-time, but nevertheless SAM’s overall performance is not real-time when using a heavy image encoder. Our foray into the text-to-mask task is exploratory and not entirely robust, although we believe it can be improved with more effort. While SAM can perform many tasks, it is unclear how to design simple prompts that implement semantic and panoptic segmentation. Finally, there are domain-specific tools, such as [7], that we expect to outperform SAM in their respective domains.
虽然 SAM 的总体性能良好,但它并不完美。它可能会遗漏细微的结构,有时会幻化出断开的小部分,而且无法像 "放大 "等计算密集型方法那样清晰地生成边界,例如[ 18]。一般来说,如果提供的点数较多,我们认为专门的交互式分割方法会优于 SAM,例如[ 67]。与这些方法不同的是,SAM 是为通用性和广泛使用而设计的,而不是高 IoU 交互式分割。此外,SAM 可以实时处理提示,但在使用重型图像编码器时,SAM 的整体性能并不是实时的。我们对文本到掩码任务的尝试是探索性的,并不完全稳健,不过我们相信通过更多的努力,它可以得到改善。虽然 SAM 可以执行很多任务,但目前还不清楚如何设计简单的提示来实现语义和全视角分割。最后,还有一些针对特定领域的工具,如[7],我们希望它们能在各自的领域胜过 SAM。

Conclusion. 结论

The Segment Anything project is an attempt to lift image segmentation into the era of foundation models. Our principal contributions are a new task (promptable segmentation), model (SAM), and dataset (SA-1B) that make this leap possible. Whether SAM achieves the status of a foundation model remains to be seen by how it is used in the community, but regardless we expect the perspective of this work, the release of over 1B masks, and our promptable segmentation model will help pave the path ahead.
Segment Anything 项目是将图像分割提升到基础模型时代的一次尝试。我们的主要贡献是新任务(可提示分割)、模型(SAM)和数据集(SA-1B),它们使这一飞跃成为可能。SAM 是否能成为基础模型还有待观察其在社区中的使用情况,但无论如何,我们希望这项工作的前景、超过 1B 掩膜的发布以及我们的可提示分割模型将有助于铺平前进的道路。

Acknowledgments. 致谢。

We would like to thank Aaron Adcock and Jitendra Malik for helpful discussion. We thank Vaibhav Aggarwal and Yanghao Li for help with scaling the model. We thank Cheng-Yang Fu, Jiabo Hu, and Robert Kuo for help with data annotation platform. We thank Allen Goodman and Bram Wasti for help in optimizing web-version of our model. Finally, we thank Morteza Behrooz, Ashley Gabriel, Ahuva Goldstand, Sumanth Gurram, Somya Jain, Devansh Kukreja, Joshua Lane, Lilian Luong, Mallika Malhotra, William Ngan, Omkar Parkhi, Nikhil Raina, Dirk Rowe, Neil Sejoor, Vanessa Stark, Bala Varadarajan, and Zachary Winstrom for their help in making the demo, dataset viewer, and other assets and tooling.
感谢 Aaron Adcock 和 Jitendra Malik 在讨论中提供的帮助。我们感谢 Vaibhav Aggarwal 和李洋浩在扩展模型方面提供的帮助。感谢傅承洋、胡家波和 Robert Kuo 在数据注释平台方面提供的帮助。感谢 Allen Goodman 和 Bram Wasti 在优化网络版模型方面的帮助。最后,我们感谢 Morteza Behrooz、Ashley Gabriel、Ahuva Goldstand、Sumanth Gurram、Somya Jain、Devansh Kukreja、Joshua Lane、Lilian Luong、Mallika Malhotra、William Ngan、Omkar Parkhi、Nikhil Raina、Dirk Rowe、Neil Sejoor、Vanessa Stark、Bala Varadarajan 和 Zachary Winstrom 在制作演示、数据集查看器以及其他资产和工具方面的帮助。

References 参考资料

  • [1] Edward H Adelson. On seeing stuff: the perception of materials by humans and machines. Human vision and electronic imaging VI, 2001.
    爱德华-H-阿德尔森On seeing stuff: the perception of materials by humans and machines.人类视觉与电子成像 VI,2001 年。
  • [2] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an object? CVPR, 2010.
    博格丹-阿列克谢、托马斯-德塞勒斯和维托里奥-法拉利。什么是对象?CVPR,2010。
  • [3] Pablo Arbeláez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. TPAMI, 2010.
    Pablo Arbeláez、Michael Maire、Charless Fowlkes 和 Jitendra Malik。轮廓检测和分层图像分割。TPAMI,2010。
  • [4] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv:1607.06450, 2016.
    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton.层归一化》,arXiv:1607.06450,2016.
  • [5] Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of image transformers. arXiv:2106.08254, 2021.
    Hangbo Bao, Li Dong, and Furu Wei.BEiT:图像变换器的BERT预训练。ArXiv:2106.08254,2021.
  • [6] Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. ZeroWaste dataset: Towards deformable object segmentation in cluttered scenes. CVPR, 2022.
    Dina Bashkirova、Mohamed Abdelfattah、Ziliang Zhu、James Akl、Fadi Alladkani、Ping Hu、Vitaly Ablavsky、Berk Calli、Sarah Adel Bargal 和 Kate Saenko。零废弃数据集:杂乱场景中的可变形物体分割。CVPR,2022。
  • [7] Stuart Berg, Dominik Kutra, Thorben Kroeger, Christoph N. Straehle, Bernhard X. Kausler, Carsten Haubold, Martin Schiegg, Janez Ales, Thorsten Beier, Markus Rudy, Kemal Eren, Jaime I. Cervantes, Buote Xu, Fynn Beuttenmueller, Adrian Wolny, Chong Zhang, Ullrich Koethe, Fred A. Hamprecht, and Anna Kreshuk. ilastik: interactive machine learning for (bio)image analysis. Nature Methods, 2019.
    Stuart Berg、Dominik Kutra、Thorben Kroeger、Christoph N. Straehle、Bernhard X。Kausler, Carsten Haubold, Martin Schiegg, Janez Ales, Thorsten Beier, Markus Rudy, Kemal Eren, Jaime I. Cervantes, Buote Xu, Fynn Beuttenmueller, Adrian Wolny, Chong Zhang, Ullrich Koethe, Fred A. Hamprecht, and Anna Kreshuk. ilastik: interactive machine learning for (bio)image analysis.自然方法》,2019 年。
  • [8] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv:2108.07258, 2021.
    Rishi Bommasani、Drew A Hudson、Ehsan Adeli、Russ Altman、Simran Arora、Sydney von Arx、Michael S Bernstein、Jeannette Bohg、Antoine Bosselut、Emma Brunskill 等:论基础模型的机遇与风险。
  • [9] Gustav Bredell, Christine Tanner, and Ender Konukoglu. Iterative interaction training for segmentation editing networks. MICCAI, 2018.
    Gustav Bredell、Christine Tanner 和 Ender Konukoglu。分割编辑网络的迭代交互训练。MICCAI,2018。
  • [10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. NeurIPS, 2020.
    汤姆-布朗、本杰明-曼、尼克-莱德、梅兰妮-苏比亚、贾里德-D-卡普兰、普拉富拉-达里瓦尔、阿文德-尼拉坎坦、普拉纳夫-希亚姆、吉里什-萨斯特里、阿曼达-阿斯凯尔、桑迪尼-阿加瓦尔、阿里埃尔-赫伯特-沃斯、格雷琴-克鲁格、汤姆-亨尼根、雷旺-查尔德Aditya Ramesh、Daniel Ziegler、Jeffrey Wu、Clemens Winter、Chris Hesse、Mark Chen、Eric Sigler、Mateusz Litwin、Scott Gray、Benjamin Chess、Jack Clark、Christopher Berner、Sam McCandlish、Alec Radford、Ilya Sutskever 和 Dario Amodei。语言模型是少数学习者。NeurIPS, 2020.
  • [11] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. CVPR, 2018.
    Zhaowei Cai 和 Nuno Vasconcelos.级联 R-CNN:探索高质量物体检测。CVPR,2018。
  • [12] Juan C. Caicedo, Allen Goodman, Kyle W. Karhohs, Beth A. Cimini, Jeanelle Ackerman, Marzieh Haghighi, CherKeng Heng, Tim Becker, Minh Doan, Claire McQuin, Mohammad Rohban, Shantanu Singh, and Anne E. Carpenter. Nucleus segmentation across imaging experiments: the 2018 data science bowl. Nature Methods, 2019.
    Juan C. Caicedo、Allen Goodman、Kyle W. Karhohs、Beth A. Cimini、Jeanelle Ackerman、Marzieh Haghighi、CherKeng Heng、Tim Becker、Minh Doan、Claire McQuin、Mohammad Rohban、Shantanu Singh 和 Anne E. Carpenter。跨成像实验的细胞核分割:2018年数据科学碗。自然方法》,2019年。
  • [13] John Canny. A computational approach to edge detection. TPAMI, 1986.
    约翰-坎尼边缘检测的计算方法TPAMI, 1986.
  • [14] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with Transformers. ECCV, 2020.
    Nicolas Carion、Francisco Massa、Gabriel Synnaeve、Nicolas Usunier、Alexander Kirillov 和 Sergey Zagoruyko。用变形器进行端到端物体检测。ECCV, 2020.
  • [15] Guillaume Charpiat, Matthias Hofmann, and Bernhard Schölkopf. Automatic image colorization via multimodal predictions. ECCV, 2008.
    Guillaume Charpiat、Matthias Hofmann 和 Bernhard Schölkopf.通过多模态预测实现自动图像着色。ECCV,2008。
  • [16] Neelima Chavali, Harsh Agrawal, Aroma Mahendru, and Dhruv Batra. Object-proposal evaluation protocol is’ gameable’. CVPR, 2016.
    Neelima Chavali、Harsh Agrawal、Aroma Mahendru 和 Dhruv Batra。对象提案评估协议是'可游戏的'。CVPR,2016。
  • [17] Jiazhou Chen, Yanghui Xu, Shufang Lu, Ronghua Liang, and Liangliang Nan. 3D instance segmentation of MVS buildings. IEEE Transactions on Geoscience and Remote Sensing, 2022.
    陈家洲、徐阳辉、卢淑芳、梁荣华、南亮亮。MVS建筑物的三维实例分割电气和电子工程师学会地球科学与遥感论文集》,2022 年。
  • [18] Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao. FocalClick: towards practical interactive image segmentation. CVPR, 2022.
    Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao.FocalClick:走向实用的交互式图像分割。CVPR,2022。
  • [19] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. CVPR, 2022.
    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar.用于通用图像分割的遮罩-注意力遮罩变换器。CVPR,2022。
  • [20] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
    Bowen Cheng、Alex Schwing 和 Alexander Kirillov.每像素分类并非语义分割的全部。NeurIPS, 2021.
  • [21] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. ArXiv:2204.02311, 2022.
  • [22] Luca Ciampi, Carlos Santiago, Joao Costeira, Claudio Gennaro, and Giuseppe Amato. Domain adaptation for traffic density estimation. International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2021.
    Luca Ciampi、Carlos Santiago、Joao Costeira、Claudio Gennaro 和 Giuseppe Amato。交通密度估算的领域适应。计算机视觉、成像和计算机图形理论与应用国际联合会议,2021 年。
  • [23] Luca Ciampi, Carlos Santiago, Joao Costeira, Claudio Gennaro, and Giuseppe Amato. Night and day instance segmented park (NDISPark) dataset: a collection of images taken by day and by night for vehicle detection, segmentation and counting in parking areas. Zenodo, 2022.
    Luca Ciampi、Carlos Santiago、Joao Costeira、Claudio Gennaro 和 Giuseppe Amato。日夜实例分割公园(NDISPark)数据集:用于停车区车辆检测、分割和计数的日夜图像集合。Zenodo, 2022.
  • [24] Nadav Cohen, Yael Newman, and Ariel Shamir. Semantic segmentation in art paintings. Computer Graphics Forum, 2022.
    纳达夫-科恩(Nadav Cohen)、雅伊尔-纽曼(Yael Newman)和阿里尔-沙米尔(Ariel Shamir)。艺术绘画中的语义分割。计算机图形论坛》,2022 年
  • [25] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. CVPR, 2016.
    Marius Cordts、Mohamed Omran、Sebastian Ramos、Timo Rehfeld、Markus Enzweiler、Rodrigo Benenson、Uwe Franke、Stefan Roth 和 Bernt Schiele。用于城市场景语义理解的城市景观数据集。CVPR,2016。
  • [26] Bruno da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. ICML, 2012.
    布鲁诺-达席尔瓦、乔治-科尼达里斯和安德鲁-巴尔托。学习参数化技能。ICML, 2012.
  • [27] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. IJCV, 2022.
    Dima Damen、Hazel Doughty、Giovanni Maria Farinella、Antonino Furnari、Jian Ma、Evangelos Kazakos、Davide Moltisanti、Jonathan Munro、Toby Perrett、Will Price 和 Michael Wray。重构以自我为中心的视觉:EPIC-KITCHENS-100 的收集、流水线和挑战。IJCV,2022。
  • [28] Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. EPIC-KITCHENS VISOR benchmark: Video segmentations and object relations. NeurIPS, 2022.
    Ahmad Darkhalil、Dandan Shan、Bin Zhu、Jian Ma、Amlan Kar、Richard Higgins、Sanja Fidler、David Fouhey 和 Dima Damen。EPIC-KITCHENS VISOR 基准:视频分割与物体关系。NeurIPS, 2022.
  • [29] Terrance De Vries, Ishan Misra, Changhan Wang, and Laurens Van der Maaten. Does object recognition work for everyone? CVPR workshops, 2019.
    Terrance De Vries、Ishan Misra、Changhan Wang 和 Laurens Van der Maaten.物体识别对每个人都有效吗?CVPR研讨会,2019年。
  • [30] Mark Díaz, Ian Kivlichan, Rachel Rosen, Dylan Baker, Razvan Amironesei, Vinodkumar Prabhakaran, and Emily Denton. CrowdWorkSheets: Accounting for individual and collective identities underlying crowdsourced dataset annotation. ACM Conference on Fairness, Accountability, and Transparency, 2022.
    Mark Díaz、Ian Kivlichan、Rachel Rosen、Dylan Baker、Razvan Amironesei、Vinodkumar Prabhakaran 和 Emily Denton。CrowdWorkSheets:基于众包数据集注释的个人和集体身份核算。ACM 公平、问责与透明会议,2022 年。
  • [31] Henghui Ding, Scott Cohen, Brian Price, and Xudong Jiang. PhraseClick: toward achieving flexible interactive segmentation by phrase and click. ECCV, 2020.
    丁恒辉、斯科特-科恩、布莱恩-普赖斯、蒋旭东。PhraseClick:通过短语和点击实现灵活的交互式分割。ECCV,2020。
  • [32] Piotr Dollár and C Lawrence Zitnick. Fast edge detection using structured forests. TPAMI, 2014.
    Piotr Dollár 和 C Lawrence Zitnick.使用结构化森林的快速边缘检测TPAMI, 2014.
  • [33] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
    Alexey Dosovitskiy、Lucas Beyer、Alexander Kolesnikov、Dirk Weissenborn、Xiaohua Zhai、Thomas Unterthiner、Mostafa Dehghani、Matthias Minderer、Georg Heigold、Sylvain Gelly、Jakob Uszkoreit 和 Neil Houlsby。一幅图像胜过 16x16 个单词:大规模图像识别变换器。ICLR,2021。
  • [34] Alireza Fathi, Xiaofeng Ren, and James M. Rehg. Learning to recognize objects in egocentric activities. CVPR, 2011.
    Alireza Fathi, Xiaofeng Ren, and James M. Rehg.在以自我为中心的活动中学习识别物体。CVPR,2011。
  • [35] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation. IJCV, 2004.
    Pedro F Felzenszwalb 和 Daniel P Huttenlocher.基于图的高效图像分割。IJCV,2004。
  • [36] Thomas B. Fitzpatrick. The validity and practicality of sun-reactive skin types i through vi. Archives of Dermatology, 1988.
    托马斯-B-菲茨帕特里克日光反应皮肤类型 I 至 VI 的有效性和实用性。皮肤病学档案》,1988 年。
  • [37] Marco Forte, Brian Price, Scott Cohen, Ning Xu, and François Pitié. Getting to 99% accuracy in interactive segmentation. arXiv:2003.07932, 2020.
    Marco Forte, Brian Price, Scott Cohen, Ning Xu, and François Pitié.达到 99% 的交互式分割准确率。arXiv:2003.07932, 2020.
  • [38] Jean-Michel Fortin, Olivier Gamache, Vincent Grondin, François Pomerleau, and Philippe Giguère. Instance segmentation for autonomous log grasping in forestry operations. IROS, 2022.
    Jean-Michel Fortin、Olivier Gamache、Vincent Grondin、François Pomerleau 和 Philippe Giguère。林业作业中自动抓取原木的实例分割。IROS,2022。
  • [39] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 2021.
    Timnit Gebru、Jamie Morgenstern、Briana Vecchione、Jennifer Wortman Vaughan、Hanna Wallach、Hal Daumé Iii 和 Kate Crawford。数据集的数据表。ACM 通信》,2021 年。
  • [40] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. CVPR, 2021.
    Golnaz Ghiasi、Yin Cui、Aravind Srinivas、Rui Qian、Tsung-Yi Lin、Ekin D Cubuk、Quoc V Le 和 Barret Zoph。简单复制粘贴是实例分割的强大数据增强方法。CVPR,2021。
  • [41] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR, 2014.
    Ross Girshick、Jeff Donahue、Trevor Darrell 和 Jitendra Malik。用于精确物体检测和语义分割的丰富特征层次。CVPR,2014。
  • [42] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017.
    Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He.精确、大型迷你批量 SGD:1 小时内训练 ImageNet。arXiv:1706.02677, 2017.
  • [43] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Christian Fuegen, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, and Jitendra Malik. Ego4D: Around the World in 3,000 Hours of Egocentric Video. CVPR, 2022.
    Kristen Grauman、Andrew Westbury、Eugene Byrne、Zachary Chavis、Antonino Furnari、Rohit Girdhar、Jackson Hamburger、Hao Jiang、Miao Liu、Xingyu Liu、Miguel Martin、Tushar Nagarajan、Ilija Radosavovic、Santhosh Kumar Ramakrishnan、Fiona Ryan、Jayant Sharma、Michael Wray、徐萌萌、徐忠聪、赵辰、Siddhant Bansal、Dhruv Batra、Vincent Cartillier、Sean Crane、Tien Do、Morrie Doulaty、Akshay Erapalli、Christoph Feichtenhofer、Adriano Fragomeni、傅启辰、Christian Fuegen、Abrham Gebreselasie、Cristina Gonzalez、James Hillis、黄旭华、James Hillis詹姆斯-希利斯、黄旭华、黄逸飞、贾文琪、邱蔚然、Jachym Kolar、Satwik Kottur、Anurag Kumar、Federico Landini、李超、李阳浩、李振强、Karttikeya Mangalam、Raghava Modhugu、Jonathan Munro、Tullie Murrell、Takumi Nishiyasu、Will Price、Paola Ruiz Puentes、Merey Ramazanova、Leda Sari、Kiran Somasundaram、Audrey Southerland、Yusuke Sugano、Ruijie Tao、Minh Vo、Yuchen Wang、Xindi Wu、Takuma Yagi、Yunyi Zhu、Pablo Arbelaez、David Crandall、Dima Damen、Giovanni Maria Farinella、Bernard Ghanem、Vamsi Krishna Ithapu、C.V. Jawahar、Hanbyul Joo、Kris Kitani、Haizhou Li、Richard Newcombe、Aude Oliva、Hyun Soo Park、James M. Rehg、Yoichi Sato、Jianbo Shi、Mike Zheng Shou、Antonio Torralba、Lorenzo Torresani、Mingfei Yan 和 Jitendra Malik。Ego4D:3000小时以自我为中心的视频环游世界。CVPR,2022。
  • [44] Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. CVPR, 2019.
    Agrim Gupta、Piotr Dollar 和 Ross Girshick。LVIS:大型词汇实例分割数据集。CVPR,2019。
  • [45] Abner Guzman-Rivera, Dhruv Batra, and Pushmeet Kohli. Multiple choice learning: Learning to produce multiple structured outputs. NeurIPS, 2012.
    Abner Guzman-Rivera、Dhruv Batra 和 Pushmeet Kohli。多选学习:学习产生多种结构化输出。NeurIPS, 2012.
  • [46] Timm Haucke, Hjalmar S. Kühl, and Volker Steinhage. SOCRATES: Introducing depth in visual wildlife monitoring using stereo vision. Sensors, 2022.
    Timm Haucke、Hjalmar S. Kühl 和 Volker Steinhage。SOCRATES:利用立体视觉将深度引入野生动物视觉监测。传感器,2022 年。
  • [47] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. CVPR, 2022.
    何开明、陈新磊、谢赛宁、李阳浩、Piotr Dollár 和 Ross Girshick。遮蔽式自动编码器是可扩展的视觉学习器。CVPR,2022。
  • [48] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. ICCV, 2017.
    何开明、Georgia Gkioxari、Piotr Dollár 和 Ross Girshick。Mask R-CNN.ICCV,2017。
  • [49] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016.
    何开明、张翔宇、任少清和孙健。图像识别的深度残差学习。CVPR,2016。
  • [50] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv:1606.08415, 2016.
    Dan Hendrycks 和 Kevin Gimpel.高斯误差线性单元(Gelus)。arXiv:1606.08415,2016。
  • [51] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv:2203.15556, 2022.
    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. ArXiv:2203.15556, 2022.
  • [52] Jungseok Hong, Michael Fulton, and Junaed Sattar. TrashCan: A semantically-segmented dataset towards visual detection of marine debris. arXiv:2007.08097, 2020.
    Jungseok Hong、Michael Fulton 和 Junaed Sattar。TrashCan:用于海洋废弃物视觉检测的语义分割数据集。arXiv:2007.08097,2020。
  • [53] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. ECCV, 2016.
    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger.具有随机深度的深度网络ECCV,2016。
  • [54] Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. arXiv:2211.06220, 2022.
    Jitesh Jain、Jiachen Li、MangTik Chiu、Ali Hassani、Nikita Orlov 和 Humphrey Shi。一个变换器:一个变换器统治通用图像分割。arXiv:2211.06220,2022。
  • [55] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. ICML, 2021.
    Chao Jia、Yinfei Yang、Ye Xia、Yi-Ting Chen、Zarana Parekh、Hieu Pham、Quoc Le、Yun-Hsuan Sung、Zhen Li 和 Tom Duerig。利用噪声文本监督扩展视觉和视觉语言表征学习。ICML,2021年。
  • [56] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020.
    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.神经语言模型的缩放定律》,arXiv:2001.08361,2020.
  • [57] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models. IJCV, 1988.
    迈克尔-卡斯、安德鲁-维特金、德米特里-特尔佐普洛斯。蛇:主动轮廓模型IJCV,1988 年。
  • [58] Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, and Weicheng Kuo. Learning open-world object proposals without learning to classify. IEEE Robotics and Automation Letters, 2022.
    Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, and Weicheng Kuo.在不学习分类的情况下学习开放世界物体建议。电气与电子工程师学会机器人与自动化通讯》,2022 年。
  • [59] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. CVPR, 2019.
    Alexander Kirillov、Kaiming He、Ross Girshick、Carsten Rother 和 Piotr Dollár。全景分割。CVPR,2019。
  • [60] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
    Alina Kuznetsova、Hassan Rom、Neil Alldrin、Jasper Uijlings、Ivan Krasin、Jordi Pont-Tuset、Shahab Kamali、Stefan Popov、Matteo Malloci、Alexander Kolesnikov、Tom Duerig 和 Vittorio Ferrari。开放图像数据集 V4:大规模统一图像分类、对象检测和视觉关系检测。IJCV,2020。
  • [61] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. arXiv:1910.09700, 2019.
    Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres.量化机器学习的碳排放》,arXiv:1910.09700,2019.
  • [62] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. ECCV, 2022.
    李阳昊、毛汉子、罗斯-吉尔希克、何开明。探索用于物体检测的平视变换器骨干。ECCV,2022年。
  • [63] Yin Li, Zhefan Ye, and James M. Rehg. Delving into egocentric actions. CVPR, 2015.
    Yin Li, Zhefan Ye, and James M. Rehg.深入研究以自我为中心的行动。CVPR, 2015.
  • [64] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image segmentation with latent diversity. CVPR, 2018.
    Zhuwen Li, Qifeng Chen, and Vladlen Koltun.潜在多样性的交互式图像分割。CVPR,2018。
  • [65] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. ICCV, 2017.
    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár.密集物体检测的焦点损失。ICCV,2017。
  • [66] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. ECCV, 2014.
    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick.微软 COCO:上下文中的通用对象。ECCV,2014。
  • [67] Qin Liu, Zhenlin Xu, Gedas Bertasius, and Marc Niethammer. SimpleClick: Interactive image segmentation with simple vision transformers. arXiv:2210.11006, 2022.
    Qin Liu、Zhenlin Xu、Gedas Bertasius 和 Marc Niethammer。SimpleClick:使用简单视觉变换器的交互式图像分割。arXiv:2210.11006,2022。
  • [68] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ICLR, 2019.
    伊利亚-洛希洛夫和弗兰克-胡特解耦权重衰减正则化。ICLR, 2019.
  • [69] Cathy H Lucas, Daniel OB Jones, Catherine J Hollyhead, Robert H Condon, Carlos M Duarte, William M Graham, Kelly L Robinson, Kylie A Pitt, Mark Schildhauer, and Jim Regetz. Gelatinous zooplankton biomass in the global oceans: geographic variation and environmental drivers. Global Ecology and Biogeography, 2014.
    Cathy H Lucas、Daniel OB Jones、Catherine J Hollyhead、Robert H Condon、Carlos M Duarte、William M Graham、Kelly L Robinson、Kylie A Pitt、Mark Schildhauer 和 Jim Regetz。全球海洋中的胶状浮游动物生物量:地理差异和环境驱动因素。全球生态学与生物地理学》,2014年。
  • [70] Sabarinath Mahadevan, Paul Voigtlaender, and Bastian Leibe. Iteratively trained interactive segmentation. BMVC, 2018.
    Sabarinath Mahadevan、Paul Voigtlaender 和 Bastian Leibe。迭代训练的交互式分割。BMVC,2018。
  • [71] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object segmentation. CVPR, 2018.
    Kevis-Kokitsi Maninis、Sergi Caelles、Jordi Pont-Tuset 和 Luc Van Gool。深度极值切割:从极值点到物体分割。CVPR,2018。
  • [72] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. ICCV, 2001.
    David Martin、Charless Fowlkes、Doron Tal 和 Jitendra Malik。人类分割自然图像数据库及其在评估分割算法和测量生态统计中的应用。ICCV,2001 年。
  • [73] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. 3DV, 2016.
    Fausto Milletari、Nassir Navab 和 Seyed-Ahmad Ahmadi。V-Net:用于体积医学图像分割的全卷积神经网络。3DV,2016年。
  • [74] Massimo Minervini, Andreas Fischbach, Hanno Scharr, and Sotirios A. Tsaftaris. Finely-grained annotated datasets for image-based plant phenotyping. Pattern Recognition Letters, 2016.
    Massimo Minervini、Andreas Fischbach、Hanno Scharr 和 Sotirios A. Tsaftaris。基于图像的植物表型精细注释数据集。模式识别通讯》,2016 年。
  • [75] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. Proceedings of the conference on fairness, accountability, and transparency, 2019.
    Margaret Mitchell、Simone Wu、Andrew Zaldivar、Parker Barnes、Lucy Vasserman、Ben Hutchinson、Elena Spitzer、Inioluwa Deborah Raji 和 Timnit Gebru。用于模型报告的模型卡。公平、问责与透明会议论文集,2019。
  • [76] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. Extreme clicking for efficient object annotation. ICCV, 2017.
    Dim P Papadopoulos、Jasper RR Uijlings、Frank Keller 和 Vittorio Ferrari。用于高效对象注释的极限点击。ICCV,2017。
  • [77] David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv:2104.10350, 2021.
    David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean.碳排放与大型神经网络训练》,arXiv:2104.10350, 2021.
  • [78] Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017.
    Matthew E Peters、Waleed Ammar、Chandra Bhagavatula 和 Russell Power。使用双向语言模型的半监督序列标记。计算语言学协会第 55 届年会论文集,2017 年。
  • [79] Mengyang Pu, Yaping Huang, Yuming Liu, Qingji Guan, and Haibin Ling. EDTER: Edge detection with transformer. CVPR, 2022.
    蒲梦阳、黄亚平、刘玉明、关庆吉和凌海滨。EDTER:用变压器进行边缘检测。CVPR,2022。
  • [80] Mattia Pugliatti and Francesco Topputo. DOORS: Dataset fOr bOuldeRs Segmentation. Zenodo, 2022.
    Mattia Pugliatti 和 Francesco Topputo.DOORS:用于建筑物分段的数据集。Zenodo, 2022.
  • [81] Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip Torr, and Song Bai. Occluded video instance segmentation: A benchmark. ICCV, 2022.
    Jiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge Belongie, Alan Yuille, Philip Torr, and Song Bai.遮挡视频实例分割:一个基准。ICCV, 2022.
  • [82] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. ICML, 2021.
    Alec Radford、Jong Wook Kim、Chris Hallacy、Aditya Ramesh、Gabriel Goh、Sandhini Agarwal、Girish Sastry、Amanda Askell、Pamela Mishkin、Jack Clark 等:从自然语言监督中学习可转移的视觉模型。ICML, 2021.
  • [83] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ICML, 2021.
    Aditya Ramesh、Mikhail Pavlov、Gabriel Goh、Scott Gray、Chelsea Voss、Alec Radford、Mark Chen 和 Ilya Sutskever。零镜头文本到图像生成。ICML, 2021.
  • [84] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. NeurIPS, 2015.
    任少卿、何开明、罗斯-吉尔希克和孙健。更快的 R-CNN:利用区域建议网络实现实时物体检测。NeurIPS, 2015.
  • [85] Xiaofeng Ren and Jitendra Malik. Learning a classification model for segmentation. ICCV, 2003.
    Xiaofeng Ren 和 Jitendra Malik.学习分割分类模型。ICCV, 2003.
  • [86] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. ICCV, 2021.
    Mike Roberts、Jason Ramapuram、Anurag Ranjan、Atulit Kumar、Miguel Angel Bautista、Nathan Paczan、Russ Webb 和 Joshua M. Susskind。Hypersim:用于整体室内场景理解的逼真合成数据集。ICCV,2021。
  • [87] Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru. A step toward more inclusive people annotations for fairness. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021.
    Candice Schumann、Susanna Ricco、Utsav Prabhu、Vittorio Ferrari 和 Caroline Pantofaru。迈向更具包容性的公平人物注释。2021 年 AAAI/ACM 人工智能、伦理与社会会议论文集,2021 年。
  • [88] Sefik Ilkin Serengil and Alper Ozpinar. LightFace: A hybrid deep face recognition framework. ASYU, 2020.
    Sefik Ilkin Serengil 和 Alper Ozpinar。LightFace:混合深度人脸识别框架。ASYU, 2020.
  • [89] Sefik Ilkin Serengil and Alper Ozpinar. HyperExtended LightFace: A facial attribute analysis framework. ICEET, 2021.
    Sefik Ilkin Serengil 和 Alper Ozpinar.HyperExtended LightFace:面部属性分析框架。ICEET,2021年。
  • [90] Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi. TextonBoost: Joint appearance, shape and context modeling for mulit-class object recognition and segmentation. ECCV, 2006.
    Jamie Shotton、John Winn、Carsten Rother 和 Antonio Criminisi。TextonBoost:用于多类物体识别和分割的联合外观、形状和上下文建模。ECCV,2006。
  • [91] Corey Snyder and Minh Do. STREETS: A novel camera network dataset for traffic flow. NeurIPS, 2019.
    科里-斯奈德(Corey Snyder)和明-杜(Minh Do)。STREETS:交通流量的新型摄像头网络数据集。NeurIPS, 2019.
  • [92] Konstantin Sofiiuk, Ilya A Petrov, and Anton Konushin. Reviving iterative training with mask guidance for interactive segmentation. ICIP, 2022.
    Konstantin Sofiiuk、Ilya A Petrov 和 Anton Konushin。为交互式分割恢复带有遮罩引导的迭代训练ICIP, 2022.
  • [93] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014.
    Nitish Srivastava、Geoffrey Hinton、Alex Krizhevsky、Ilya Sutskever 和 Ruslan Salakhutdinov。辍学:防止神经网络过度拟合的简单方法。机器学习研究》杂志,2014 年。
  • [94] Chris Stauffer and W Eric L Grimson. Adaptive background mixture models for real-time tracking. CVPR, 1999.
    Chris Stauffer 和 W Eric L Grimson.用于实时跟踪的自适应背景混合模型。CVPR,1999 年。
  • [95] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 2020.
    Matthew Tancik、Pratul Srinivasan、Ben Mildenhall、Sara Fridovich-Keil、Nithin Raghavan、Utkarsh Singhal、Ravi Ramamoorthi、Jonathan Barron 和 Ren Ng。傅立叶特征让网络在低维域学习高频函数NeurIPS, 2020.
  • [96] Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou. Action recognition in RGB-D egocentric videos. ICIP, 2017.
    Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou.RGB-D自我中心视频中的动作识别。ICIP, 2017.
  • [97] Yansong Tang, Zian Wang, Jiwen Lu, Jianjiang Feng, and Jie Zhou. Multi-stream deep neural networks for RGB-D egocentric action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2019.
    唐岩松、王自安、陆继文、冯建江、周杰。用于 RGB-D 以自我为中心的动作识别的多流深度神经网络。IEEE 视频技术电路与系统论文集,2019.
  • [98] The World Bank. The world by income and regions, 2022. https://datatopics.worldbank.org/world-development-indicators/the-world-by-income-and-region.html.
    世界银行。2022 年按收入和地区划分的世界。https://datatopics.worldbank.org/world-development-indicators/the-world-by-income-and-region.html
  • [99] Sebastian Thrun. Is learning the n-th thing any easier than learning the first? NeurIPS, 1995.
    塞巴斯蒂安-特伦学习第 n 件事比学习第一件事容易吗?NeurIPS,1995年。
  • [100] Cameron Trotter, Georgia Atkinson, Matt Sharpe, Kirsten Richardson, A. Stephen McGough, Nick Wright, Ben Burville, and Per Berggren. NDD20: A large-scale few-shot dolphin dataset for coarse and fine-grained categorisation. arXiv:2005.13359, 2020.
    Cameron Trotter, Georgia Atkinson, Matt Sharpe, Kirsten Richardson, A. Stephen McGough, Nick Wright, Ben Burville, and Per Berggren.NDD20: A large-scale few-shot dolphin dataset for coarse and fine-grained categorisation. arXiv:2005.13359, 2020.
  • [101] United States Environmental Protection Agency. Greenhouse Gas Equivalencies Calculator. https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator, 2022.
    美国环境保护局。温室气体当量计算器。https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator, 2022.
  • [102] Koen EA van de Sande, Jasper RR Uijlings, Theo Gevers, and Arnold WM Smeulders. Segmentation as selective search for object recognition. ICCV, 2011.
    Koen EA van de Sande、Jasper RR Uijlings、Theo Gevers 和 Arnold WM Smeulders。作为对象识别选择性搜索的分割。ICCV,2011。
  • [103] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.
    Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N Gomez、Lukasz Kaiser 和 Illia Polosukhin。注意力就是你所需要的一切。NeurIPS, 2017.
  • [104] Boying Wang, Libo Zhang, Longyin Wen, Xianglong Liu, and Yanjun Wu. Towards real-world prohibited item detection: A large-scale x-ray benchmark. CVPR, 2021.
    王伯英、张立波、文龙吟、刘祥龙、吴彦君。实现真实世界违禁物品检测:大规模X射线基准。CVPR,2021。
  • [105] Weiyao Wang, Matt Feiszli, Heng Wang, Jitendra Malik, and Du Tran. Open-world instance segmentation: Exploiting pseudo ground truth from learned pairwise affinity. CVPR, 2022.
    王维尧、马特-费斯利、王恒、吉腾德拉-马利克和杜-陈。开放世界实例分割:从学习到的成对亲和力中利用伪地面实况。CVPR,2022。
  • [106] Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, and Georgia Gkioxari. Multiview compressive coding for 3D reconstruction. CVPR, 2023.
    Chao-Yuan Wu、Justin Johnson、Jitendra Malik、Christoph Feichtenhofer 和 Georgia Gkioxari。三维重建的多视图压缩编码。CVPR,2023。
  • [107] Jianxiong Xiao, James Hays, Krista Ehinger, Aude Oliva, and Antonio Torralba. SUN database: Large-scale scene recognition from abbey to zoo. CVPR, 2010.
    Jianxiong Xiao, James Hays, Krista Ehinger, Aude Oliva, and Antonio Torralba.SUN 数据库:从修道院到动物园的大规模场景识别。CVPR,2010。
  • [108] Saining Xie and Zhuowen Tu. Holistically-nested edge detection. ICCV, 2015.
    Saining Xie 和 Zhuowen Tu.整体嵌套边缘检测。ICCV,2015。
  • [109] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S Huang. Deep interactive object selection. CVPR, 2016.
    Ning Xu、Brian Price、Scott Cohen、Jimei Yang 和 Thomas S Huang。深度交互对象选择。CVPR,2016。
  • [110] Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky. Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the imagenet hierarchy. Proceedings of the 2020 conference on fairness, accountability, and transparency, 2020.
    杨开宇、Klint Qinami、李菲菲、邓佳和 Olga Russakovsky.实现更公平的数据集:过滤和平衡人物子树在图像网层次结构中的分布。2020 年公平、问责与透明会议论文集,2020 年。
  • [111] Lei Yang, Yan Zi Wei, Yisheng HE, Wei Sun, Zhenhang Huang, Haibin Huang, and Haoqiang Fan. iShape: A first step towards irregular shape instance segmentation. arXiv:2109.15068, 2021.
    iShape:迈向不规则形状实例分割的第一步. arXiv:2109.15068, 2021.
  • [112] Senthil Yogamani, Ciarán Hughes, Jonathan Horgan, Ganesh Sistu, Padraig Varley, Derek O’Dea, Michal Uricár, Stefan Milz, Martin Simon, Karl Amende, et al. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving. ICCV, 2019.
    Senthil Yogamani、Ciarán Hughes、Jonathan Horgan、Ganesh Sistu、Padraig Varley、Derek O'Dea、Michal Uricár、Stefan Milz、Martin Simon、Karl Amende 等。 WoodScape:用于自动驾驶的多任务、多摄像头鱼眼数据集。ICCV,2019。
  • [113] Lingzhi Zhang, Shenghao Zhou, Simon Stent, and Jianbo Shi. Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. ECCV, 2022.
    Lingzhi Zhang, Shenghao Zhou, Simon Stent, and Jianbo Shi.细粒度以自我为中心的手部物体分割:数据集、模型和应用。ECCV,2022年。
  • [114] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy. K-Net: Towards unified image segmentation. NeurIPS, 2021.
    Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy.K-Net:实现统一图像分割。NeurIPS, 2021.
  • [115] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. arXiv:1707.09457, 2017.
    Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang.男人也喜欢购物:使用语料库级约束减少性别偏见放大。arXiv:1707.09457, 2017.
  • [116] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. TPAMI, 2017.
    Bolei Zhou、Agata Lapedriza、Aditya Khosla、Aude Oliva 和 Antonio Torralba。地点用于场景识别的千万级图像数据库。TPAMI,2017。
  • [117] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. IJCV, 2019.
    周博磊、赵航、泽维尔-普伊格、肖特特、萨尼亚-菲德勒、阿德拉-巴里乌索和安东尼奥-托拉尔巴。通过 ADE20K 数据集理解场景语义。IJCV,2019。

Appendix 附录

Table of contents: 目录

  • §A: Segment Anything Model and Task Details


    - §A:分段任何模式和任务细节
  • §B: Automatic Mask Generation Details


    - §B: 自动生成掩模详情
  • §C: RAI Additional Details


    - §C:RAI 补充细节
  • §D: Experiment Implementation Details


    - §D:实验实施细节
  • §E: Human Study Experimental Design


    - §E:人类研究实验设计
  • §F: Dataset, Annotation, and Model Cards


    - §F:数据集、注释和模型卡
  • §G: Annotation Guidelines


    - §G:注释指南

Appendix A Segment Anything Model and Task Details
附录:Aegment Anything 模型和任务详情

Image encoder. 图像编码器

In general, the image encoder can be any network that outputs a C×H×W𝐶𝐻𝑊C{\times}H{\times}W image embedding. Motivated by scalability and access to strong pre-training, we use an MAE [47] pre-trained Vision Transformer (ViT) [33] with minimal adaptations to process high resolution inputs, specifically a ViT-H/16 with 14×{\times}14 windowed attention and four equally-spaced global attention blocks, following [62]. The image encoder’s output is a 16×{\times} downscaled embedding of the input image. Since our runtime goal is to process each prompt in real-time, we can afford a high number of image encoder FLOPs because they are computed only once per image, not per prompt.
一般来说,图像编码器可以是任何能输出 C×H×W𝐶𝐻𝑊C{\times}H{\times}W 图像嵌入的网络。受可扩展性和强大的预训练能力的驱动,我们使用 MAE [ 47] 预训练的视觉转换器(ViT)[33],并对其进行了最小化的调整,以处理高分辨率输入,特别是 ViT-H/16,它具有 14 ×{\times} 14 窗口注意力和四个等间距的全局注意力块。图像编码器的输出是输入图像的 16 ×{\times} 降频嵌入。由于我们的运行目标是实时处理每个提示,因此我们可以承受较高的图像编码器 FLOPs 数量,因为每个图像而不是每个提示只需计算一次。

Following standard practices (e.g.[40]), we use an input resolution of 1024×{\times}1024 obtained by rescaling the image and padding the shorter side. The image embedding is therefore 64×{\times}64. To reduce the channel dimension, following [62], we use a 1×{\times}1 convolution to get to 256 channels, followed by a 3×{\times}3 convolution also with 256 channels. Each convolution is followed by a layer normalization [4].
按照标准做法(如 [ 40]),我们使用的输入分辨率为 1024 ×{\times} 1024,这是通过重新缩放图像并填充较短的一侧获得的。因此,图像嵌入为 64 ×{\times} 64。为了减少通道维度,我们按照 [ 62] 的方法,使用 1 ×{\times} 1 卷积来获得 256 个通道,然后再进行 3 ×{\times} 3 卷积,也是 256 个通道。每次卷积后都要进行层归一化处理 [ 4]。

Prompt encoder. 提示编码器。

Sparse prompts are mapped to 256-dimensional vectorial embeddings as follows. A point is represented as the sum of a positional encoding [95] of the point’s location and one of two learned embeddings that indicate if the point is either in the foreground or background. A box is represented by an embedding pair: (1) the positional encoding of its top-left corner summed with a learned embedding representing “top-left corner” and (2) the same structure but using a learned embedding indicating “bottom-right corner”. Finally, to represent free-form text we use the text encoder from CLIP [82] (any text encoder is possible in general). We focus on geometric prompts for the remainder of this section and discuss text prompts in depth in §D.5.
稀疏提示映射到 256 维向量嵌入的方法如下。一个点表示为该点的位置编码[95]和两个学习嵌入的总和,这两个嵌入表示该点是在前景还是背景中。一个方框用一对嵌入式编码来表示:(1) 方框左上角的位置编码与表示 "左上角 "的学习嵌入式编码之和;(2) 同样的结构,但使用表示 "右下角 "的学习嵌入式编码。最后,为了表示自由形式的文本,我们使用了 CLIP [ 82] 中的文本编码器(一般来说,任何文本编码器都可以)。在本节的其余部分,我们将重点讨论几何提示,并在第 D.5 节深入讨论文本提示。

Dense prompts (i.e., masks) have a spatial correspondence with the image. We input masks at a 4×{\times} lower resolution than the input image, then downscale an additional 4×{\times} using two 2×{\times}2, stride-2 convolutions with output channels 4 and 16, respectively. A final 1×{\times}1 convolution maps the channel dimension to 256. Each layer is separated by GELU activations [50] and layer normalization. The mask and image embedding are then added element-wise. If there is no mask prompt, a learned embedding representing “no mask” is added to each image embedding location.
密集提示(即掩码)与图像具有空间对应关系。我们以比输入图像低 4 ×{\times} 的分辨率输入掩码,然后使用两个 2 ×{\times} 2、stride-2 卷积将输出通道分别为 4 和 16 的掩码再降级 4 ×{\times} 。最后的 1 ×{\times} 1 卷积将通道维度映射为 256。每一层都通过 GELU 激活[ 50] 和层归一化进行分隔。然后按元素顺序添加掩码和图像嵌入。如果没有掩码提示,则会在每个图像嵌入位置添加代表 "无掩码 "的学习嵌入。

Refer to caption
Figure 14: Details of the lightweight mask decoder. A two-layer decoder updates both the image embedding and prompt tokens via cross-attention. Then the image embedding is upscaled, from which the updated output tokens are used to dynamically predict masks. (Not illustrated for figure clarity: At every attention layer, positional encodings are added to the image embedding, and the entire original prompt token (including position encoding) is re-added to the token queries and keys.)
图 14:轻量级掩码解码器的细节。双层解码器通过交叉关注更新图像嵌入和提示标记。然后对图像嵌入进行升级,并利用升级后的输出标记来动态预测掩码。(为使图示清晰,此处未作说明:在每个注意层,位置编码会添加到图像嵌入中,整个原始提示标记(包括位置编码)会重新添加到标记查询和密钥中)。

Lightweight mask decoder.
轻型面具解码器

This module efficiently maps the image embedding and a set of prompt embeddings to an output mask. To combine these inputs, we take inspiration from Transformer segmentation models [14, 20] and modify a standard Transformer decoder [103]. Before applying our decoder, we first insert into the set of prompt embeddings a learned output token embedding that will be used at the decoder’s output, analogous to the [class] token in [33]. For simplicity, we refer to these embeddings (not including the image embedding) collectively as “tokens”.
该模块能有效地将图像嵌入和一组提示嵌入映射到输出掩码。为了结合这些输入,我们从变换器分割模型[14, 20]中汲取灵感,并修改了标准变换器解码器[103]。在应用我们的解码器之前,我们首先在提示嵌入集合中插入一个学习到的输出标记嵌入,该嵌入将用于解码器的输出,类似于[33]中的[类]标记。为简单起见,我们将这些嵌入(不包括图像嵌入)统称为 "标记"。

Our decoder design is shown in Fig. 14. Each decoder layer performs 4 steps: (1) self-attention on the tokens, (2) cross-attention from tokens (as queries) to the image embedding, (3) a point-wise MLP updates each token, and (4) cross-attention from the image embedding (as queries) to tokens. This last step updates the image embedding with prompt information. During cross-attention, the image embedding is treated as a set of 6422{}^{\textrm{2}} 256-dimensional vectors. Each self/cross-attention and MLP has a residual connection [49], layer normalization, and a dropout [93] of 0.1 at training. The next decoder layer takes the updated tokens and the updated image embedding from the previous layer. We use a two-layer decoder.
我们的解码器设计如图 14 所示。每个解码器层执行 4 个步骤:(1) 对标记进行自关注,(2) 从标记(作为查询)到图像嵌入进行交叉关注,(3) 一个点向 MLP 更新每个标记,(4) 从图像嵌入(作为查询)到标记进行交叉关注。最后一步用提示信息更新图像嵌入。在交叉关注过程中,图像嵌入被视为一组 64 22{}^{\textrm{2}} 256 维向量。每个自注意/交叉注意和 MLP 在训练时都有残差连接[49]、层归一化和 0.1 的 dropout[93]。下一层解码器接收上一层更新的标记和更新的图像嵌入。我们使用双层解码器。

To ensure the decoder has access to critical geometric information the positional encodings are added to the image embedding whenever they participate in an attention layer. Additionally, the entire original prompt tokens (including their positional encodings) are re-added to the updated tokens whenever they participate in an attention layer. This allows for a strong dependence on both the prompt token’s geometric location and type.
为确保解码器能获得关键的几何信息,只要参与注意力层,位置编码就会被添加到图像嵌入中。此外,每当提示标记参与注意力层时,整个原始提示标记(包括其位置编码)都会被重新添加到更新的标记中。这使得提示标记的几何位置和类型都具有很强的依赖性。

After running the decoder, we upsample the updated image embedding by 4×{\times} with two transposed convolutional layers (now it’s downscaled 4×{\times} relative to the input image). Then, the tokens attend once more to the image embedding and we pass the updated output token embedding to a small 3-layer MLP that outputs a vector matching the channel dimension of the upscaled image embedding. Finally, we predict a mask with a spatially point-wise product between the upscaled image embedding and the MLP’s output.
运行解码器后,我们使用两个转置卷积层对更新后的图像嵌入进行 4 ×{\times} 的高采样(现在相对于输入图像缩放了 4 ×{\times} )。然后,令牌再次参与图像嵌入,我们将更新后的输出令牌嵌入传递给一个小型的 3 层 MLP,该 MLP 输出一个与提升后的图像嵌入的通道维度相匹配的向量。最后,我们用放大图像嵌入和 MLP 输出之间的空间点积来预测掩码。

The transformer uses an embedding dimension of 256. The transformer MLP blocks have a large internal dimension of 2048, but the MLP is applied only to the prompt tokens for which there are relatively few (rarely greater than 20). However, in cross-attention layers where we have a 64×{\times}64 image embedding, we reduce the channel dimension of the queries, keys, and values by 2×{\times} to 128 for computational efficiency. All attention layers use 8 heads.
转换器使用的嵌入维度为 256。转换器 MLP 块的内部维度较大,为 2048,但 MLP 只应用于提示标记,而提示标记的数量相对较少(很少超过 20 个)。不过,在交叉注意层中,我们有 64 ×{\times} 64 个图像嵌入,为了提高计算效率,我们将查询、键和值的通道维度减少了 2 ×{\times} 至 128。所有注意力层都使用 8 个头。

The transposed convolutions used to upscale the output image embedding are 2×{\times}2, stride 2 with output channel dimensions of 64 and 32 and have GELU activations. They are separated by layer normalization.
用于放大输出图像嵌入的转置卷积为 2 ×{\times} 2,步长为 2,输出通道尺寸分别为 64 和 32,具有 GELU 激活。它们通过层归一化进行分离。

Making the model ambiguity-aware.
使模型具有模糊感知能力。

As described, a single input prompt may be ambiguous in the sense that it corresponds to multiple valid masks, and the model will learn to average over these masks. We eliminate this problem with a simple modification: instead of predicting a single mask, we use a small number of output tokens and predict multiple masks simultaneously. By default we predict three masks, since we observe that three layers (whole, part, and subpart) are often enough to describe nested masks. During training, we compute the loss (described shortly) between the ground truth and each of the predicted masks, but only backpropagate from the lowest loss. This is a common technique used for models with multiple outputs [15, 45, 64]. For use in applications, we’d like to rank predicted masks, so we add a small head (operating on an additional output token) that estimates the IoU between each predicted mask and the object it covers.
如前所述,单个输入提示可能是模棱两可的,因为它对应多个有效的掩码,而模型将学习对这些掩码进行平均。我们通过一个简单的修改就可以解决这个问题:我们使用少量的输出标记,同时预测多个掩码,而不是预测单个掩码。默认情况下,我们预测三个掩码,因为我们观察到三个层(整体、部分和子部分)往往足以描述嵌套掩码。在训练过程中,我们会计算地面实况和每个预测掩码之间的损失(稍后会介绍),但只会从损失最小的掩码开始反向传播。这是多输出模型常用的技术[15, 45, 64]。在应用中,我们希望对预测的掩码进行排序,因此我们添加了一个小头(在额外的输出标记上运行),用于估算每个预测的掩码与其覆盖的对象之间的 IoU。

Ambiguity is much rarer with multiple prompts and the three output masks will usually become similar. To minimize computation of degenerate losses at training and ensure the single unambiguous mask receives a regular gradient signal, we only predict a single mask when more than one prompt is given. This is accomplished by adding a fourth output token for an additional mask prediction. This fourth mask is never returned for a single prompt and is the only mask returned for multiple prompts.
如果有多个提示,模棱两可的情况就会少得多,而且三个输出掩码通常会变得相似。为了在训练时尽量减少退化损失的计算,并确保单一的无歧义掩码能接收到有规律的梯度信号,我们在给出多个提示时只预测一个掩码。为此,我们增加了第四个输出标记,用于额外的掩码预测。对于单个提示,第四个掩码不会返回,而对于多个提示,第四个掩码是唯一返回的掩码。

Losses. 损失

We supervise mask prediction with a linear combination of focal loss [65] and dice loss [73] in a 20:1 ratio of focal loss to dice loss, following [20, 14]. Unlike [20, 14], we observe that auxiliary deep supervision after each decoder layer is unhelpful. The IoU prediction head is trained with mean-square-error loss between the IoU prediction and the predicted mask’s IoU with the ground truth mask. It is added to the mask loss with a constant scaling factor of 1.0.
按照[20, 14],我们使用焦点损失[65]和骰子损失[73]的线性组合对掩码预测进行监督,焦点损失和骰子损失的比例为 20:1。与[20, 14]不同的是,我们发现在每个解码器层之后进行辅助深度监督并无益处。IoU 预测头是用 IoU 预测和预测掩码的 IoU 与地面实况掩码之间的均方误差损失来训练的。它以 1.0 的恒定缩放因子添加到掩码损失中。

Training algorithm. 训练算法。

Following recent approaches [92, 37], we simulate an interactive segmentation setup during training. First, with equal probability either a foreground point or bounding box is selected randomly for the target mask. Points are sampled uniformly from the ground truth mask. Boxes are taken as the ground truth mask’s bounding box, with random noise added in each coordinate with standard deviation equal to 10% of the box sidelength, to a maximum of 20 pixels. This noise profile is a reasonable compromise between applications like instance segmentation, which produce a tight box around the target object, and interactive segmentation, where a user may draw a loose box.
按照最近的方法[92, 37],我们在训练过程中模拟了交互式分割设置。首先,以相同的概率随机选择前景点或边界框作为目标掩膜。点从地面实况掩膜中均匀采样。方框作为地面实况掩膜的边界方框,在每个坐标上添加随机噪音,标准偏差等于方框边长的 10%,最大为 20 像素。这种噪音曲线是实例分割等应用与交互式分割之间的一种合理折衷,前者会在目标对象周围产生一个紧密的方框,而后者用户可能会画出一个松散的方框。

After making a prediction from this first prompt, subsequent points are selected uniformly from the error region between the previous mask prediction and the ground truth mask. Each new point is foreground or background if the error region is a false negative or false positive, respectively. We also supply the mask prediction from the previous iteration as an additional prompt to our model. To provide the next iteration with maximal information, we supply the unthresholded mask logits instead of the binarized mask. When multiple masks are returned, the mask passed to the next iteration and used to sample the next point is the one with the highest predicted IoU.
根据第一个提示进行预测后,从上一个掩码预测和地面实况掩码之间的误差区域均匀地选择后续点。如果误差区域为假阴性或假阳性,则每个新点分别为前景或背景。我们还提供上一次迭代的掩码预测,作为模型的额外提示。为了给下一次迭代提供最多的信息,我们提供未阈值化的掩码对数,而不是二值化掩码。当返回多个掩码时,传递给下一次迭代并用于下一个点采样的掩码是预测 IoU 最高的掩码。

We find diminishing returns after 8 iteratively sampled points (we have tested up to 16). Additionally, to encourage the model to benefit from the supplied mask, we also use two more iterations where no additional points are sampled. One of these iterations is randomly inserted among the 8 iteratively sampled points, and the other is always at the end. This gives 11 total iterations: one sampled initial input prompt, 8 iteratively sampled points, and two iterations where no new external information is supplied to the model so it can learn to refine its own mask predictions. We note that using a relatively large number of iterations is possible because our lightweight mask decoder requires less than 1% of the image encoder’s compute and, therefore, each iteration adds only a small overhead. This is unlike previous interactive methods that perform only one or a few interactive steps per optimizer update [70, 9, 37, 92].
我们发现,迭代采样 8 个点后,收益会逐渐减少(我们最多测试过 16 个点)。此外,为了鼓励模型从提供的掩码中获益,我们还进行了两次迭代,在这两次迭代中都没有额外的采样点。其中一次迭代是在 8 个迭代采样点中随机插入的,另一次迭代总是在最后。这样一共进行了 11 次迭代:1 次初始输入提示采样,8 次迭代采样点,以及 2 次不向模型提供新的外部信息的迭代,这样模型就能学会完善自己的掩码预测。我们注意到,使用相对较多的迭代次数是可能的,因为我们的轻量级掩码解码器所需的计算量不到图像编码器的 1%,因此每次迭代只会增加少量开销。这与以往每次优化器更新只执行一个或几个交互步骤的交互式方法不同[70, 9, 37, 92]。

Training recipe. 培训食谱

We use the AdamW [68] optimizer (β1=0.9subscript𝛽10.9\beta_{1}=0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999) and a linear learning rate warmup [42] for 250 iterations and a step-wise learning rate decay schedule. The initial learning rate (lr), after warmup, is 8e48superscripte4{8}\mathrm{e}^{-4}. We train for 90k iterations (similar-to\scriptstyle\sim2 SA-1B epochs) and decrease the lr by a factor of 10 at 60k iterations and again at 86666 iterations. The batch size is 256 images. To regularize SAM, we set weight decay (wd) to 0.1 and apply drop path [53] (dp) with a rate of 0.4. We use a layer-wise learning rate decay [5] (ld) of 0.8. No data augmentation is applied. We initialize SAM from an MAE [47] pre-trained ViT-H. We distribute training across 256 GPUs, due to the large image encoder and 1024×{\times}1024 input size. To limit GPU memory usage, we train with up to 64 randomly sampled masks per GPU. Additionally, we find that lightly filtering SA-1B masks to discard any that cover more than 90% of the image qualitatively improves results.
我们使用 AdamW [ 68] 优化器 ( β1=0.9subscript𝛽10.9\beta_{1}=0.9 , β2=0.999subscript𝛽20.999\beta_{2}=0.999 ) 和线性学习率预热 [ 42] ,迭代次数为 250 次,学习率衰减时间为步进式。预热后的初始学习率(lr)为 8e48superscripte4{8}\mathrm{e}^{-4} 。我们进行了 90k 次迭代训练( similar-to\scriptstyle\sim 2 SA-1B epochs),并在 60k 次迭代和 86666 次迭代时将 lr 降低了 10 倍。批次大小为 256 幅图像。为了使 SAM 正则化,我们将权重衰减(wd)设为 0.1,并以 0.4 的比率应用下降路径[53](dp)。我们使用 0.8 的层学习衰减率[ 5] (ld)。不应用数据增强。我们通过 MAE [ 47] 预先训练的 ViT-H 来初始化 SAM。由于采用了大型图像编码器和 1024 ×{\times} 1024 的输入大小,我们将训练分配给 256 个 GPU。为了限制 GPU 内存使用量,我们在每个 GPU 上最多使用 64 个随机采样掩码进行训练。此外,我们还发现,对 SA-1B 掩码进行轻度过滤,剔除覆盖图像 90% 以上的掩码,可以从质量上改善结果。

For ablations and others variations on training (e.g., text-to-mask §D.5), we deviate from the default recipe above as follows. When training with data from the first and second data engine stages only, we augment the input with large-scale jitter [40] with a scale range of [0.1, 2.0]. Intuitively, data augmentation may be helpful when training data is more limited. To train ViT-B and ViT-L, we use 180k iterations with batch size 128 distributed across 128 GPUs. We set lr = 8e48superscripte4{8}\mathrm{e}^{-4}/4e44superscripte4{4}\mathrm{e}^{-4}, ld = 0.6/0.8, wd = 0.1, and dp = 0.6/0.4 for ViT-B/L, respectively.
对于消融和其他训练变化(例如文本到掩码 §D.5),我们偏离上述默认配方如下。当仅使用第一和第二数据引擎阶段的数据进行训练时,我们使用大规模抖动[40]来增强输入,其范围为[0.1, 2.0]。直观地说,当训练数据较为有限时,数据增强可能会有所帮助。为了训练 ViT-B 和 ViT-L,我们使用了 180k 次迭代,批量大小为 128,分布在 128 个 GPU 上。我们为 ViT-B/L 分别设置了 lr = 8e48superscripte4{8}\mathrm{e}^{-4} / 4e44superscripte4{4}\mathrm{e}^{-4} 、ld = 0.6/0.8、wd = 0.1 和 dp = 0.6/0.4。

Appendix B Automatic Mask Generation Details
附录 BA 自动生成掩码详情

Here we discuss details of the data engine’s fully automatic stage that was used to generate the released SA-1B.
在此,我们将讨论用于生成已发布的 SA-1B 的数据引擎全自动阶段的细节。

Cropping. 种植。

Masks were generated from a regular grid of 32×{\times}32 points on the full image and 20 additional zoomed-in image crops arising from 2×{\times}2 and 4×{\times}4 partially overlapping windows using 16×{\times}16 and 8×{\times}8 regular point grids, respectively. The original high-resolution images were used for cropping (this was the only time we used them). We removed masks that touch the inner boundaries of the crops. We applied standard greedy box-based NMS (boxes were used for efficiency) in two phases: first within each crop and second across crops. When applying NMS within a crop, we used the model’s predicted IoU to rank masks. When applying NMS across crops, we ranked masks from most zoomed-in (i.e., from a 4×{\times}4 crop) to least zoomed-in (i.e., the original image), based on their source crop. In both cases, we used an NMS threshold of 0.7.
遮罩是由完整图像上 32 ×{\times} 32 个点的规则网格和 20 个额外的放大图像裁剪生成的,这些图像裁剪来自 2 ×{\times} 2 和 4 ×{\times} 4 个部分重叠的窗口,分别使用 16 ×{\times} 16 和 8 ×{\times} 8 个规则点网格。原始的高分辨率图像被用于裁剪(这是我们唯一一次使用它们)。我们删除了触及作物内部边界的遮罩。我们分两个阶段应用基于贪婪方框的标准 NMS(使用方框是为了提高效率):第一阶段是在每种作物内部,第二阶段是跨作物。在作物内部应用 NMS 时,我们使用模型预测的 IoU 对掩蔽进行排序。在跨作物应用 NMS 时,我们根据遮罩的来源作物,将遮罩从最放大(即来自 4 ×{\times} 4 的作物)到最小放大(即原始图像)进行排序。在这两种情况下,我们都使用了 0.7 的 NMS 阈值。

Filtering. 过滤。

We used three filters to increase mask quality. First, to keep only confident masks we filtered by the model’s predicted IoU score at a threshold of 88.0. Second, to keep only stable masks we compared two binary masks resulting from the same underlying soft mask by thresholding it at different values. We kept the prediction (i.e., the binary mask resulting from thresholding logits at 0) only if the IoU between its pair of -1 and +1 thresholded masks was equal to or greater than 95.0. Third, we noticed that occasionally an automatic mask would cover the entire image. These masks were generally uninteresting, and we filtered them by removing masks that covered 95% or more of an image. All filtering thresholds were selected to achieve both a large number of masks and high mask quality as judged by professional annotators using the method described in §5.
我们使用了三种过滤器来提高掩码质量。首先,为了只保留有把握的掩码,我们根据模型预测的 IoU 分数进行筛选,阈值为 88.0。其次,为了只保留稳定的掩码,我们比较了由同一基础软掩码产生的两个二进制掩码,将其阈值设定为不同的值。只有当一对-1 和+1 的阈值掩码之间的 IoU 等于或大于 95.0 时,我们才会保留预测结果(即阈值对数为 0 时产生的二进制掩码)。第三,我们注意到自动掩码偶尔会覆盖整个图像。这些掩码一般都是无趣的,我们通过删除覆盖图像 95% 或更多的掩码来对其进行过滤。所有过滤阈值的选择都是为了既能获得大量的掩码,又能获得由专业注释者使用第 5 节所述方法判断的高质量掩码。

Postprocessing. 后期处理。

We observed two error types that are easily mitigated with postprocessing. First, an estimated 4% of masks include small, spurious components. To address these, we removed connected components with area less than 100 pixels (including removing entire masks if the largest component is below this threshold). Second, another estimated 4% of masks include small, spurious holes. To address these, we filled holes with area less than 100 pixels. Holes were identified as components of inverted masks.
我们观察到两种错误类型,通过后处理很容易得到缓解。首先,估计有 4% 的掩码包含了小的虚假成分。为了解决这些问题,我们删除了面积小于 100 像素的连接组件(包括在最大组件小于此阈值时删除整个掩膜)。其次,估计还有 4% 的掩膜包含小的、虚假的孔洞。为了解决这些问题,我们填补了面积小于 100 像素的洞。孔洞被识别为倒置掩码的组成部分。

Automatic mask generation model.
自动面具生成模型

We trained a special version of SAM for fully automatic mask generation that sacrifices some inference speed for improved mask generation properties. We note the differences between our default SAM and the one used for data generation here: it was trained on manual and semi-automatic data only, it was trained for longer (177656 iterations instead of 90k) with large-scale jitter data augmentation [40], simulated interactive training used only point and mask prompts (no boxes) and sampled only 4 points per mask during training (reducing from our default of 9 to 4 sped up training iterations and had no impact on 1-point performance, though it would harm mIoU if evaluating with more points), and finally the mask decoder used 3 layers instead of 2.
我们为全自动掩码生成训练了一个特殊版本的 SAM,它牺牲了一些推理速度,但却改善了掩码生成性能。我们注意到我们的默认 SAM 与这里用于数据生成的 SAM 之间的差异:它只在手动和半自动数据上进行训练,训练时间更长(177656 次迭代,而不是 90k),使用大规模抖动数据增强[40],模拟交互式训练只使用点和掩码提示(没有方框),训练期间每个掩码只采样 4 个点(从默认的 9 个点减少到 4 个点,加快了训练迭代速度,对单点性能没有影响,但如果使用更多点进行评估,则会损害 mIoU),最后,掩码解码器使用 3 层,而不是 2 层。

SA-1B examples. SA-1B 示例。

We show SA-1B samples in Fig. 2. For more examples, please see our dataset explorer.
我们在图 2 中展示了 SA-1B 样本。如需了解更多示例,请参阅我们的数据集资源管理器。

Appendix C RAI Additional Details
附录 CRAI 其他详细信息

Inferring geographic information for SA-1B.
为 SA-1B 推断地理信息。

While the images in SA-1B are not geo-tagged, each image has a caption describing its contents and where it was taken. We infer approximate image geo-locations from these captions using an Elmo-based named entity recognition model [78]. Each extracted location entity is mapped to every matching country, province, and city. Captions are mapped to a single country by first considering the matching countries, then provinces, and finally cities. We note that there are ambiguities and potential for biases with this method (e.g., “Georgia” may refer to the country or the US state). As such, we use the extracted locations to analyze the dataset as a whole, but do not release the inferred locations. The captions will not be released publicly as required by the image provider.
虽然 SA-1B 中的图像没有地理标记,但每张图像都有一个标题,描述图像内容和拍摄地点。我们使用基于 Elmo 的命名实体识别模型[78],从这些标题中推断出近似的图像地理位置。每个提取的位置实体都会映射到每个匹配的国家、省份和城市。首先考虑匹配的国家,然后是省份,最后是城市,从而将字幕映射到单个国家。我们注意到,这种方法存在模糊性和潜在的偏差(例如,"佐治亚州 "可能指国家,也可能指美国的州)。因此,我们使用提取的地点来分析整个数据集,但不会公布推断的地点。根据图片提供商的要求,我们不会公开发布标题。

Inferring geographic information for COCO and Open Images.
为 COCO 和开放图像推断地理信息。

The COCO [66] and Open Images [60] datasets do not provide geo-locations. Following [29], we retrieve geographic metadata using the Flickr API. We retrieved locations for 24% of the COCO training set (19,562 images) and for Open Images we retrieved 18% of the training set (493,517 images, after only considering images with masks). We note that the geographic information is approximate, and the sample of images with this information may not fully match the full dataset distribution.
COCO [ 66] 和 Open Images [ 60] 数据集不提供地理位置。按照 [ 29] 的方法,我们使用 Flickr API 检索地理元数据。我们检索到了 COCO 训练集(19,562 张图片)中 24% 的位置信息,而对于 Open Images,我们检索到了训练集(493,517 张图片,仅考虑带遮罩的图片)中 18% 的位置信息。我们注意到,地理信息是近似的,包含这些信息的图像样本可能与整个数据集的分布不完全一致。

Inferring income information.
推断收入信息。

We use each image’s inferred country to look up its income level using the levels defined by The World Bank [98]. We collapse the upper-middle and lower-middle levels into a single middle level.
我们使用每幅图像推断出的国家,并根据世界银行[98]定义的收入水平查询其收入水平。我们将中上和中下收入水平合并为一个中等收入水平。

Fairness in segmenting people.
公平划分人群。

To investigate SAM’s fairness at segmenting people we use the More Inclusive Annotations for People (MIAP) [87] test set annotations for Open Images [60], which allows us to compare SAM’s performance across perceived gender presentation and perceived age group. MIAP provides box annotations, while we need ground truth masks for this analysis. To get ground truth masks, we select each person-category mask from Open Images if its corresponding bounding box is within a 1% margin (based on relative box side lengths) of an annotated bounding box in MIAP, resulting in 3.9k masks.
为了研究 SAM 在分割人物方面的公平性,我们使用了 Open Images 的 More Inclusive Annotations for People(MIAP)[87] 测试集注释[60]。MIAP 提供了方框注释,而我们的分析需要地面实况掩码。为了获得地面实况掩码,我们从 Open Images 中选择了每个人物类别掩码,如果其对应的边框与 MIAP 中注释的边框相差 1%(基于相对边框边长)的话,这样就得到了 3.9k 个掩码。

mIoU at 
1 point 1 分 3 points 3 分
perceived gender presentation
性别认知
feminine 女性化 76.3±plus-or-minus\pm1.1 90.7±plus-or-minus\pm0.5
masculine 阳性 81.0±plus-or-minus\pm1.2 92.3±plus-or-minus\pm0.4
mIoU at 
1 point 1 分 3 points 3 分
perceived age group 认知年龄组
older 年长 81.9±plus-or-minus\pm3.8 92.8±plus-or-minus\pm1.6
middle 中层 78.2±plus-or-minus\pm0.8 91.3±plus-or-minus\pm0.3
young 年轻 77.3±plus-or-minus\pm2.7 91.5±plus-or-minus\pm0.9
Table 6: SAM’s performance segmenting clothing across perceived gender presentation and age group. The intervals for perceived gender are disjoint, with mIoU for masculine being higher. Confidence intervals for age group overlap.
表 6:SAM 对不同性别和年龄组服装的细分结果。感知性别的置信区间不相连,男性的 mIoU 较高。年龄组的置信区间重叠。

Fairness in segmenting clothing.
服装分类的公平性。

We extend our analysis from §6 to clothing segmentation. We look at SAM’s performance on clothing relative to the attributes of those wearing the clothes. We use all 6.5k ground truth masks from Open Images that have a category under the clothing superclass and reside within a person box from MIAP. In Table 6 we compare performance across perceived gender presentation and age group. We find that SAM is better at segmenting clothing on those who present predominantly masculine, with disjoint 95% confidence intervals. The gap closes when moving from 1 to 3 point evaluation. Differences for perceived age group are not significant. Our results indicate there is a bias when segmenting clothing across perceived gender presentation with a one point prompt, and we encourage users of SAM to be mindful of this limitation.
我们将第 6 节中的分析扩展到服装细分。我们考察了 SAM 在服装上的表现与穿着服装的人的属性之间的关系。我们使用了 Open Images 中所有 6.5k 个基本真实遮罩,这些遮罩都属于服装超类,并且位于 MIAP 的人物框中。在表 6 中,我们比较了不同性别和年龄组的表现。我们发现,SAM 能更好地分割以男性为主的人的服装,95% 置信区间不相连。从 1 点评价到 3 点评价,差距逐渐缩小。感知年龄组的差异并不显著。我们的结果表明,用 1 点提示对不同性别的服装进行细分时存在偏差,我们鼓励 SAM 的用户注意这一局限性。

Appendix D Experiment Implementation Details
附录 DE 实验实施细节

D.1 Zero-Shot Single Point Valid Mask Evaluation
D.1 零点单点有效屏蔽评估

dataset 数据集 abbreviation 简写 & link 链接 image 图像 type 类型 description 描述 mask 面罩 type 类型 source split 源分割 # images # 图像 sampled 已采样 # masks sampled 已采样
Plant Phenotyping Datasets Leaf Segmentation [74]
植物表型数据集 叶片分割 [ 74]
PPDLS Plants 植物 Leaf segmentation for images of tobacco and ara plants.
烟草和阿拉植物图像的叶片分割。
Instance 实例 N/A 不适用 182 2347
BBBC038v1 from Broad Bioimage Benchmark Collection [12]
来自 Broad Bioimage Benchmark Collection 的 BBBC038v1 [ 12]
BBBC038v1 Microscopy 显微镜 Biological images of cells in a variety of settings testing robustness in nuclei segmentation.
各种环境下的细胞生物图像,测试细胞核分割的鲁棒性。
Instance 实例 Train 火车 227 10506
Dataset fOr bOuldeRs Segmentation [80]
用于黑体分割的数据集 [ 80]
DOORS Boulders 巨石 Segmentation masks of single boulders positioned on the surface of a spherical mesh.
定位在球形网格表面的单个巨石的分割掩模。
Instance 实例 DS1 10000 10000
TimberSeg 1.0 [38] TimberSeg 1.0 [ 38] TimberSeg Logs 日志 Segmentation masks of individual logs in piles of timber in various environments and conditions. Images are taken from an operator’s point-of-view.
在不同的环境和条件下,对木材堆中的单个原木进行分段屏蔽。图像从操作员的视角拍摄。
Instance 实例 N/A 不适用 220 2487
Northumberland Dolphin Dataset 2020 [100]
诺森伯兰海豚数据集 2020 [ 100]
NDD20 Underwater 水下 Segmentation masks of two different dolphin species in images taken above and under water.
在水上和水下拍摄的图像中,对两种不同海豚物种进行分割掩码。
Instance 实例 N/A 不适用 4402 6100
Large Vocabulary Instance Segmentation [44]
大型词汇实例分割 [ 44]
LVIS Scenes 场景 Additional annotations for the COCO [66] dataset to enable the study of long-tailed object detection and segmentation.
COCO [ 66] 数据集的附加注释,用于研究长尾物体的检测和分割。
Instance 实例 Validation (v0.5) 验证(v0.5) 945 9642
STREETS [91] 街道 [ 91] STREETS Traffic 交通 camera 照相机 Segmentation masks of cars in traffic camera footage.
交通摄像头录像中的汽车分割掩码。
Instance 实例 N/A 不适用 819 9854
ZeroWaste-f [6] 零废弃f [ 6] ZeroWaste-f 零废弃f Recycling 回收利用 Segmentation masks in cluttered scenes of deformed recycling waste.
变形回收垃圾杂乱场景中的分割掩码。
Instance 实例 Train 火车 2947 6155
iShape [111] iShape [ 111] iShape Irregular 不规则 shapes 外形 Segmentation masks of irregular shapes like antennas, logs, fences, and hangers.
天线、圆木、栅栏和衣架等不规则形状的分割掩模。
Instance 实例 Validation 验证 754 9742
ADE20K [117] ADE20K [ 117] ADE20K Scenes 场景 Object and part segmentation masks for images from SUN [107] and Places [116] datasets.
SUN [ 107] 和 Places [ 116] 数据集图像的物体和部件分割掩码。
Instance 实例 Validation 验证 302 10128
Occluded Video Instance Segmentation [81]
闭塞视频实例分割 [ 81]
OVIS Occlusions 闭塞 Instance segmentation masks in videos, focusing on objects that are occluded.
视频中的实例分割掩码,重点关注被遮挡的对象。
Instance 实例 Train 火车 2044 10011
Hypersim [86] Hypersim [ 86] Hypersim Simulation 模拟 Photorealistic synthetic dataset of indoor scenes with instance masks.
带有实例掩码的逼真室内场景合成数据集。
Instance 实例 Evermotion archinteriors volumes 1-55 excluding 20,25,40,49
Evermotion archinteriors 第 1-55 卷,不包括 20、25、40、49 卷
338 9445
Night and Day Instance Segmented Park [22, 23]
昼夜实例分段公园 [ 22, 23]
NDISPark Parking lots 停车场 Images of parking lots from video footage taken at day and night during different weather conditions and camera angles for vehicle segmentation.
根据白天和夜晚不同天气条件下拍摄的视频录像以及摄像机的拍摄角度,对停车场的图像进行车辆分割。
Instance 实例 Train 火车 111 2577
EPIC-KITCHENS VISOR [28, 27]
史诗级厨房遮阳板 [ 28, 27]
VISOR Egocentric 以自我为中心 Segmentation masks for hands and active objects in ego-centric video from the cooking dataset EPIC-KITCHENS [27]
烹饪数据集 EPIC-KITCHENS 中以自我为中心的视频中手和活动物体的分割掩码[ 27] 。
.
Instance 实例 Validation 验证 1864 10141
Plittersdorf dataset [46]
普利特斯多夫数据集 [ 46]
Plittersdorf 普利特斯多夫 Stereo 立体声 images 图像 Segmentation masks of wildlife in images taken with the SOCRATES stereo camera trap.
用 SOCRATES 立体相机陷阱拍摄的图像中野生动物的分割掩码。
Instance 实例 Train, validation, test 培训、验证、测试 187 546
Egocentric Hand-Object Segmentation [113]
以自我为中心的手部物体分割 [ 113]
EgoHOS 自我 Egocentric 以自我为中心 Fine-grained egocentric hand-object segmentation dataset. Dataset contains mask annotations for existing datasets.
精细的以自我为中心的手部物体分割数据集。数据集包含现有数据集的掩码注释。
Instance 实例 Train (including only Ego4D [43] and THU-READ [97, 96])
火车(仅包括 Ego4D [ 43] 和 THU-READ [ 97, 96])
2940 9961
InstanceBuilding 2D [17]
实例构建 2D [ 17]
IBD Drones 无人机 High-resolution drone UAV images annotated with roof instance segmentation masks.
用屋顶实例分割掩码标注的高分辨率无人机 UAV 图像。
Instance 实例 Train (2D annotations) 列车(二维注释) 467 11953
WoodScape [112] 木景 [ 112] WoodScape 木景 Fisheye 鱼眼 driving 驾驶 Fisheye driving dataset with segmentation masks. Images are taken from four surround-view cameras.
带有分割掩码的鱼眼驾驶数据集。图像由四个环视摄像头拍摄。
Instance 实例 Set 1 第 1 套 107 10266
Cityscapes [25] 城市景观 [ 25] Cityscapes 城市景观 Driving 驾驶 Stereo video of street scenes with segmentation masks.
街景立体视频与分割掩码。
Panoptic 泛光 Validation 验证 293 9973
PIDray [104] PIDray [ 104] PIDRay X-ray X光 Segmentation masks of prohibited items in X-ray images of baggage.
行李 X 光图像中违禁物品的分割掩码。
Instance 实例 Test (hard) 测试(硬) 3733 8892
Diverse Realism in Art Movements [24]
艺术运动中的多元现实主义[ 24]
DRAM Paintings 绘画 Domain adaptation dataset for semantic segmentation of art paintings.
用于艺术绘画语义分割的领域适应性数据集。
Semantic 语义学 Test 测试 718 1179
TrashCan [52] 垃圾桶 [ 52] TrashCan 垃圾桶 Underwater 水下 Segmentation masks of trash in images taken by underwater ROVs. Images are sourced from the J-EDI [69] dataset.
水下遥控潜水器拍摄的图像中的垃圾分割掩码。图片来源于 J-EDI [ 69] 数据集。
Instance 实例 Train (instance task) 培训(实例任务) 5936 9540
Georgia Tech Egocentric Activity Datasets [34, 63]
佐治亚理工学院以自我为中心的活动数据集 [ 34, 63]
GTEA Egocentric 以自我为中心 Videos are composed of four different subjects performing seven types of daily activities with segmentation masks of hands.
视频由四名不同的受试者进行七种类型的日常活动组成,并对手部进行了分段屏蔽。
Instance 实例 Train (segmenting hands task)
训练(双手分段任务)
652 1208
Table 7: Segmentation datasets used to evaluate zero-shot segmentation with point prompts. The 23 datasets cover a broad range of domains; see column “image type”. To make our evaluation efficient, we subsample datasets that have more than 15k masks. Specifically, we randomly sampled images so that the total number of masks in the images is similar-to\scriptstyle\sim10k.
表 7:用于评估带点提示的零镜头分割的分割数据集。这 23 个数据集涵盖了广泛的领域;参见 "图像类型 "一栏。为了提高评估效率,我们对掩码数超过 15k 的数据集进行了子采样。具体来说,我们随机抽取图像,使图像中的掩码总数为 similar-to\scriptstyle\sim 10k。

Datasets. 数据集。

We built a new segmentation benchmark to evaluate the zero-shot transfer capabilities of our model using a suite of 23 diverse segmentation datasets from prior work. A description of each dataset is given in Table 7. For examples, see main text Fig. 8. This suite covers a range of domains including egocentric [34, 28, 113], microscopy [12], X-ray [104], underwater [52, 100], aerial [17], simulation [86], driving [25], and painting [24] images. For efficient evaluation we subsampled datasets with more than 15k masks. Specifically, we randomly picked images so that the total number of masks in the sampled images was similar-to\scriptstyle\sim10k. We blurred faces of people in all the datasets.
我们建立了一个新的分割基准,使用先前工作中的 23 个不同分割数据集来评估我们模型的零镜头传输能力。表 7 给出了每个数据集的说明。示例见正文图 8。这套数据集涵盖了一系列领域,包括自我中心图像[34, 28, 113]、显微镜图像[12]、X 射线图像[104]、水下图像[52, 100]、航空图像[17]、模拟图像[86]、驾驶图像[25]和绘画图像[24]。为了有效评估,我们对掩码超过 15k 的数据集进行了子采样。具体来说,我们随机抽取图像,使抽样图像中的掩码总数为 similar-to\scriptstyle\sim 10k。我们对所有数据集中的人脸进行了模糊处理。

Point sampling. 点取样。

Our default point sampling follows standard practice in interactive segmentation [109, 64, 92]. The first point is chosen deterministically as the point farthest from the object boundary. Each subsequent point is the farthest from the boundary of the error region between ground truth and the previous prediction. Some experiments (where specified) use a more challenging sampling strategy in which the first point is a random point, rather than a deterministically selected “center” point. Each subsequent point is selected as described above. This setting better reflects use cases in which the first point is not reliably near the center of the mask, such as prompting from eye gaze.
我们的默认点取样遵循交互式分割的标准做法[109, 64, 92]。第一个点被确定地选择为距离物体边界最远的点。随后的每个点都是距离地面实况和前一个预测之间误差区域边界最远的点。有些实验(如有说明)使用了更具挑战性的采样策略,即第一个点是一个随机点,而不是一个确定选择的 "中心 "点。之后每个点的选择如上所述。这种设置能更好地反映出第一点并不可靠地靠近掩码中心的使用情况,例如通过眼睛注视进行提示。

Evaluation. 评估。

We measure IoU between a prediction after N𝑁N point prompts and a ground truth mask, where N={1,2,3,5,9}𝑁12359N=\{1,2,3,5,9\} and points are sampled iteratively with either of the strategies described above. The per-dataset mIoU is the per-mask IoU averaged across all objects in the dataset. Finally, we report the top-line metric by averaging the per-dataset mIoUs across all 23 datasets. Our evaluation differs from the standard interactive segmentation evaluation protocol which measures the average number of points needed to achieve X𝑋X% IoU, with up to 20 points. We focus on predictions after just one, or possibly a few points, since many of our use cases involve a single or very few prompts. Given our application focus, which requires real-time prompt processing, we expect the best interactive segmentation models to outperform SAM when using a large number of points.
我们测量的是 N𝑁N 点提示后的预测与地面实况掩码之间的 IoU,其中 N={1,2,3,5,9}𝑁12359N=\{1,2,3,5,9\} 和点是用上述任一策略迭代采样的。每个数据集的 mIoU 是数据集中所有对象的每个掩码 IoU 的平均值。最后,我们通过对所有 23 个数据集的每个数据集 mIoU 取平均值来报告顶线指标。我们的评估有别于标准的交互式分割评估协议,后者测量的是达到 X𝑋X % IoU 所需的平均点数,最多可达 20 点。我们只关注一个点或几个点之后的预测结果,因为我们的许多用例只涉及一个或很少的提示。考虑到我们的应用重点需要实时提示处理,我们希望最好的交互式分割模型在使用大量点数时能够优于 SAM。

Baselines. 基线。

We use three recent strong interactive baselines: RITM [92], FocalClick [18], and SimpleClick [67]. For each, we use the largest models trained on the broadest datasets publicly released by the authors. For RITM, we use HRNet32 IT-M trained on the combination of COCO [66] and LVIS [44] introduced by the authors. For FocalClick, we use SegFormerB3-S2 trained on a “combined dataset” that includes 8 different segmentation datasets [18]. For SimpleClick, we use ViT-H448 trained on a combination of COCO and LVIS. We follow the suggested default strategies for data pre-processing (i.e., data augmentations or image resizing) and do not change or adapt any parameters for our evaluation. In our experiments, we observe that RITM outperforms other baselines on our 23 dataset suite with 1 point evaluation. Therefore, we use RITM as the default baseline. When evaluating with more points we report results for all baselines.
我们使用了三个最新的强交互基线:RITM [ 92]、FocalClick [ 18] 和 SimpleClick [ 67]。我们分别使用了在作者公开发布的最广泛数据集上训练的最大模型。对于 RITM,我们使用在作者介绍的 COCO [ 66] 和 LVIS [ 44] 组合上训练的 HRNet32 IT-M。对于 FocalClick,我们使用在 "组合数据集 "上训练的 SegFormerB3-S2,该数据集包括 8 个不同的分割数据集[ 18]。对于 SimpleClick,我们使用在 COCO 和 LVIS 组合上训练的 ViT-H448。我们按照建议的默认策略进行数据预处理(即数据增强或图像大小调整),在评估时没有更改或调整任何参数。在我们的实验中,我们观察到 RITM 在 23 个数据集套件上的表现优于其他基线,评估结果为 1 分。因此,我们将 RITM 作为默认基线。当评估点数较多时,我们会报告所有基线的结果。

Single point ambiguity and oracle evaluation.
单点模糊性和甲骨文评估。

In addition to IoU after N𝑁N points prompts, we report SAM’s “oracle” performance at 1 point by evaluating the predicted mask that best matches ground truth from amongst SAM’s three predictions (rather than using the one that SAM itself ranks first, as we do by default). This protocol addresses possible single point prompt ambiguity by relaxing the requirement to guess the one right mask among several valid objects.
除了 N𝑁N 点提示后的 IoU 外,我们还报告了 SAM 在 1 点时的 "oracle "性能,方法是从 SAM 的三个预测中评估与地面实况最匹配的预测掩码(而不是像我们默认的那样,使用 SAM 自己排在第一位的掩码)。该协议通过放宽从多个有效对象中猜测一个正确掩码的要求,解决了可能出现的单点提示模糊问题。

image 图像 ground truth 基本事实 SAM image 图像 ground truth 基本事实 SAM
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 15: Additional visualizations of zero-shot edge predictions on BSDS500. Recall that SAM was not trained to predict edge maps and did not have access to BSDS images and annotations during training.
图 15:BSDS500 上零镜头边缘预测的其他可视化效果。请注意,SAM 并未接受过预测边缘图的训练,而且在训练期间无法访问 BSDS 图像和注释。

D.2 Zero-Shot Edge Detection
D.2 零镜头边缘检测

Dataset and metrics. 数据集和衡量标准。

We perform zero-shot edge detection experiments on BSDS500 [72, 3]. The ground truth for each image comes from the manual annotations of five different subjects. We report results on the 200 image test subset using the four standard metrics for edge detection [3, 32]: optimal dataset scale (ODS), optimal image scale (OIS), average precision (AP), and recall at 50% precision (R50).
我们在 BSDS500 [ 72, 3] 上进行了零镜头边缘检测实验。每张图像的基本真相来自五个不同受试者的手动注释。我们使用边缘检测的四个标准指标[3, 32]:最佳数据集比例(ODS)、最佳图像比例(OIS)、平均精度(AP)和精度为 50%时的召回率(R50)来报告 200 张图像测试子集的结果。

Method. 方法

For zero-shot transfer, we use a simplified version of our automatic mask generation pipeline. We prompt SAM with a 16×{\times}16 regular grid of foreground points, which yields 768 predicted masks (three per point). We do not filter by predicted IoU or stability. Redundant masks are removed by NMS. Then we apply a Sobel filter to the remaining masks’ unthresholded probability maps and set values to zero if they do not intersect with the outer boundary pixels of a mask. Finally, we take a pixel-wise max over all the predictions, linearly normalize the result to [0,1], and apply edge NMS [13] to thin the edges.
对于零镜头传输,我们使用了简化版的自动掩膜生成管道。我们使用 16 ×{\times} 16 个前景点的规则网格来提示 SAM,这样可以生成 768 个预测掩码(每个点三个)。我们不根据预测的 IoU 或稳定性进行过滤。冗余掩码由 NMS 剔除。然后,我们对剩余掩码的未阈值概率图应用索贝尔滤波器,并将与掩码外部边界像素不相交的值设为零。最后,我们对所有预测值进行像素最大化处理,将结果线性归一化为 [0,1],并应用边缘 NMS[ 13] 来减薄边缘。

Visualizations. 可视化。

In Fig. 15, we show additional examples of zero-shot edge predictions from SAM. These qualitative examples further illustrate how SAM tends to output sensible edge maps, despite not being trained for edge detection. We see that the edges can align well with the human annotations. Although, as previously mentioned, since SAM is not trained for edge detection it does not learn the biases of the BSDS500 dataset and often outputs more edges than are present in the ground truth annotations.
在图 15 中,我们展示了来自 SAM 的零镜头边缘预测的其他示例。这些定性示例进一步说明,尽管没有经过边缘检测训练,但 SAM 仍倾向于输出合理的边缘图。我们可以看到,边缘与人类注释非常吻合。不过,如前所述,由于 SAM 没有经过边缘检测训练,因此它不会学习 BSDS500 数据集的偏差,输出的边缘往往多于地面实况注释中存在的边缘。

D.3 Zero-Shot Object Proposals
D.3 零投篮物体提案

Dataset and metrics. 数据集和衡量标准。

We report the standard average recall (AR) metric for masks at 1000 proposals on the LVIS v1 validation set [44]. Since LVIS has high-quality masks for 1203 object classes, it provides a challenging test for object proposal generation. We focus on AR@1000 due to the open-world nature of our model, which will likely produce many valid masks outside even the 1203 classes in LVIS. To measure performance on frequent, common, and rare categories, we use AR@1000 but measured against a ground truth set containing just the corresponding LVIS categories.
我们报告了在 LVIS v1 验证集[44]上提出 1000 个建议的掩码的标准平均召回率(AR)指标。由于 LVIS 为 1203 个对象类别提供了高质量的掩码,因此它为对象建议的生成提供了一个具有挑战性的测试。由于我们的模型具有开放世界的性质,可能会在 LVIS 的 1203 个类别之外生成许多有效的掩码,因此我们将重点放在 AR@1000 上。为了衡量频繁、常见和罕见类别的性能,我们使用了 AR@1000,但只针对包含相应 LVIS 类别的地面实况集进行衡量。

Baseline. 基准线。

We use cascade ViTDet-H as a baseline, the strongest model from [62] by AP on LVIS. As noted in the main text, an object detector trained in-domain can “game” AR [16] and is expected to be a stronger baseline than other models that focus on open-world proposals or segmentation [58, 105]. To produce 1000 proposals, we disable score thresholding in the three cascade stages and as raise the maximum number of predictions per stage to 1000.
我们使用级联 ViTDet-H 作为基线,这是 LVIS 上 AP 的最强模型[ 62]。正如正文所述,在领域内训练的物体检测器可以 "游戏 "AR[ 16],与其他专注于开放世界提案或分割的模型相比,它有望成为更强的基线[ 58, 105]。为了生成 1000 个提案,我们禁用了三个级联阶段的分数阈值,并将每个阶段的最大预测数提高到 1000 个。

Method. 方法

We use a modified version of SAM’s automatic mask generation pipeline for zero-shot transfer. First, to make inference time comparable to that of ViTDet we do not process image crops. Second, we remove filtering by predicted IoU and stability. This leaves two tunable parameters to get similar-to\scriptstyle\sim1000 masks per image: the input point grid and the NMS threshold duplicate mask suppression. We choose a 64×{\times}64 point grid and an NMS threshold of 0.9, which produces similar-to\scriptstyle\sim900 masks per image on average. At evaluation, if greater than 1000 masks have been proposed in an image, they are ranked by the average of their confidence and stability scores, then truncated to the top 1000 proposals.
我们使用经过修改的 SAM 自动掩膜生成管道进行零镜头传输。首先,为了使推理时间与 ViTDet 相当,我们不处理图像裁剪。其次,我们删除了通过预测 IoU 和稳定性进行的过滤。这就为每幅图像获得 similar-to\scriptstyle\sim 1000 个掩码留下了两个可调参数:输入点网格和 NMS 复制掩码抑制阈值。我们选择 64 ×{\times} 64 点网格和 0.9 的 NMS 阈值,这样平均每幅图像可生成 similar-to\scriptstyle\sim 900 个掩码。在评估时,如果图像中提出的掩码超过 1000 个,则根据其置信度和稳定性得分的平均值进行排序,然后截取前 1000 个提案。

We hypothesize that SAM’s ability to output multiple masks is especially valuable for this task, since recall should benefit from proposals generated at multiple scales from a single input point. To test this, we compare to an ablated version SAM that only outputs a single mask instead of three (SAM - single-output). Since this model produces fewer masks, we further increase the number of points sampled and NMS threshold to 128×{\times}128 and 0.95, respectively, obtaining similar-to\scriptstyle\sim950 masks per image on average. Additionally, single-output SAM does not produce the IoU score used to rank masks for NMS in the automatic mask generation pipeline, so instead masks are ranked randomly. Testing suggests this has similar performance to more sophisticated methods of ranking masks, such as using the max logit value of the mask as a proxy for model confidence.
我们假设,SAM 输出多个掩码的能力对这项任务特别有价值,因为从单个输入点生成的多个尺度的建议应能使记忆受益。为了验证这一点,我们将 SAM 与只输出一个掩码而非三个掩码(SAM - 单输出)的消融版本进行了比较。由于该模型产生的掩码较少,我们进一步将采样点数和 NMS 阈值分别提高到 128 ×{\times} 128 和 0.95,平均每幅图像可获得 similar-to\scriptstyle\sim 950 个掩码。此外,单输出 SAM 不会产生 IoU 分数,用于在自动掩码生成管道中为 NMS 的掩码排序,因此掩码的排序是随机的。测试表明,这种方法的性能与更复杂的掩码排序方法类似,例如使用掩码的最大对数值作为模型置信度的代表。

ground truth 基本事实 ViTDet SAM ground truth 基本事实 ViTDet SAM
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 16: Zero-shot instance segmentation on LVIS v1. SAM produces higher quality masks than ViTDet. As a zero-shot model, SAM does not have the opportunity to learn specific training data biases; see top-right as an example where SAM makes a modal prediction, whereas the ground truth in LVIS is amodal given that mask annotations in LVIS have no holes.
图 16:LVIS v1 上的零镜头实例分割。 作为一个零镜头模型,SAM 没有机会学习特定的训练数据偏差;见右上角的例子,SAM 做出了模态预测,而 LVIS 中的地面实况是模态的,因为 LVIS 中的掩码注释没有孔。

D.4 Zero-Shot Instance Segmentation
D.4 零镜头实例分割

Method. 方法

For zero-shot instance segmentation, we prompt SAM with the boxes output by a fully-supervised ViTDet-H on COCO and LVIS v1 validation splits. We apply an additional mask refinement iteration by feeding the most confident predicted mask, together with the box prompt, back to the mask decoder to produce the final prediction. We show zero-shot instance segmentations predicted on LVIS in Fig. 16. Compared to ViTDet, SAM tends to produce higher quality masks with cleaner boundaries. We confirm this observation with human studies in §7.4. Note that as a zero-shot model, SAM is not able to learn annotation biases in a dataset. For instance, we see that SAM makes a valid modal prediction for the plate, whereas LVIS masks cannot contain holes by design so the plate is annotated amodally.
对于零镜头实例分割,我们用 COCO 和 LVIS v1 验证分片上的全监督 ViTDet-H 输出的方框提示 SAM。我们通过将最有把握的预测掩码连同方框提示反馈给掩码解码器,进行额外的掩码细化迭代,以生成最终预测结果。我们在图 16 中展示了在 LVIS 上预测的零镜头实例分割。与 ViTDet 相比,SAM 往往能生成质量更高、边界更清晰的掩码。我们将在第 7.4 节中通过人类研究证实这一观察结果。需要注意的是,作为一个零镜头模型,SAM 无法学习数据集中的注释偏差。例如,我们看到 SAM 对板块进行了有效的模态预测,而 LVIS 掩膜在设计上不能包含孔洞,因此板块的标注是模态的。

D.5 Zero-Shot Text-to-Mask
D.5零镜头文本到掩码

Model and training. 模式和培训。

We use the largest publicly available CLIP model [82] (ViT-L/14@336px) to compute text and image embeddings, which we 2superscript2\ell^{2} normalize prior to use. To train SAM, we use masks from the first two stages of our data engine. Moreover, we discard all masks with an area smaller than 1002superscript1002\textrm{100}^{\textrm{2}} pixels. We train this model with large-scale jitter [40] for 120k iterations with batch size 128. All other training parameters follow our default settings.
我们使用最大的公开可用 CLIP 模型[82](ViT-L/14@336px)来计算文本和图像嵌入,并在使用前对其进行 2superscript2\ell^{2} 归一化。为了训练 SAM,我们使用了数据引擎前两个阶段的掩码。此外,我们舍弃了所有面积小于 1002superscript1002\textrm{100}^{\textrm{2}} 像素的掩码。我们使用大规模抖动[40]对该模型进行了 120k 次迭代训练,批量大小为 128。所有其他训练参数均采用默认设置。

Generating training prompts.
生成培训提示。

To extract an input prompt we first expand the bounding box around each mask by a random factor from 1×{\times} to 2×{\times}, square-crop the expanded box to maintain its aspect ratio, and resize it to 336×{\times}336 pixels. Before feeding the crop to the CLIP image encoder, with 50% probability we zero-out pixels outside the mask. To ensure the embedding focuses on the object, we use masked attention in the last layer to restrict attention from the output token to the image positions inside the mask. Finally, our prompt is the output token embedding. For training we supply the CLIP-based prompt first, followed by additional iterative point prompts to refine the prediction.
为了提取输入提示,我们首先将每个掩码周围的边界框按 1 ×{\times} 到 2 ×{\times} 的随机因子进行扩展,然后对扩展后的边界框进行方形裁剪以保持其长宽比,并将其调整为 336 ×