\floatsetup \设置

[table]capposition=top

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
基础 SAM：为各种视觉任务组装开放世界模型

International Digital Economy Academy (IDEA) & Community
国际数字经济学院（IDEA）与社区
Code & Demo: https://github.com/IDEA-Research/Grounded-Segment-Anything
代码与演示：https://github.com/IDEA-Research/Grounded-Segment-Anything

Abstract 摘要

We introduce Grounded SAM, which uses Grounding DINO [38] as an open-set object detector to combine with the segment anything model (SAM) [25]. This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models. As shown in Fig. 1, a wide range of vision tasks can be achieved by using the versatile Grounded SAM pipeline. For example, an automatic annotation pipeline based solely on input images can be realized by incorporating models such as BLIP [31] and Recognize Anything [83]. Additionally, incorporating Stable-Diffusion [52] allows for controllable image editing, while the integration of OSX [33] facilitates promptable 3D human motion analysis. Grounded SAM also shows superior performance on open-vocabulary benchmarks, achieving 48.7 mean AP on SegInW (Segmentation in the wild) zero-shot benchmark with the combination of Grounding DINO-Base and SAM-Huge models.
我们介绍 Grounded SAM，它使用 Grounding DINO [ 38] 作为开放集对象检测器，并与 segment anything model (SAM) [ 25] 相结合。这种整合能够根据任意文本输入检测和分割任何区域，并为连接各种视觉模型打开了一扇大门。如图 1 所示，通过使用多功能的 Grounded SAM 管道，可以实现多种视觉任务。例如，通过结合 BLIP [ 31] 和 Recognize Anything [ 83] 等模型，可以实现仅基于输入图像的自动注释管道。此外，结合 Stable-Diffusion [ 52] 可以实现可控的图像编辑，而 OSX [ 33] 的集成则有助于进行及时的三维人体运动分析。Grounded SAM 在开放词汇基准测试中也表现出了卓越的性能，结合 Grounding DINO-Base 和 SAM-Huge 模型，它在 SegInW（野外分割）零射击基准测试中取得了 48.7 的平均 AP 值。

Figure 1: Grounded SAM can simultaneously detect and segment corresponding regions within images based on arbitrary text inputs provided by users. And it can seamlessly integrate with other Open-World models to accomplish more intricate visual tasks
图 1：Grounded SAM 可以根据用户提供的任意文本输入，同时检测和分割图像中的相应区域。它还能与其他开放世界模型无缝集成，以完成更复杂的视觉任务

1 Introduction 1引言

Visual perception and understanding tasks in open-world scenarios are crucial for the advancement of applications such as autonomous driving, robotic navigation, and intelligent security surveillance. These applications demand robust and versatile visual perception models capable of interpreting and interacting with open-world environments.
开放世界场景中的视觉感知和理解任务对于自动驾驶、机器人导航和智能安全监控等应用的发展至关重要。这些应用需要能够解释开放世界环境并与之互动的强大而多用途的视觉感知模型。

Currently, there are three primary methodologies to address the challenges in open-world visual perception. First, the Unified Model approach involves training models like UNINEXT [66] and OFA [59] on multiple datasets to support various vision tasks. This method also includes training large language models on different visual question-answering datasets to unify tasks, like LLaVA [34], InstructBLIP [12], Qwen-VL [3] and other MLLMs [60, 40, 80]. However, a significant limitation of such an approach is its limited scope in data, especially in complex tasks like open-set segmentation. Second, the LLM as Controller method attempts to bridge vision experts with language models. Examples include HuggingGPT [55], Visual ChatGPT [62], and LLaVA-Plus [35]. These approaches leverage the linguistic comprehension capabilities of large language models to direct various visual tasks. However, this method is heavily reliant on the functionalities and limitations of large language models. Third, the Ensemble Foundation Models approach seeks to accomplish open-world tasks in complex scenarios by collaboratively integrating expert models designed for specific contexts. This approach offers flexibility by combining the strengths of various specialized models.
目前，应对开放世界视觉感知挑战的方法主要有三种。首先，统一模型方法包括在多个数据集上训练 UNINEXT [ 66] 和 OFA [ 59] 等模型，以支持各种视觉任务。这种方法还包括在不同的视觉问题解答数据集上训练大型语言模型来统一任务，如 LLaVA [ 34]、InstructBLIP [ 12]、Qwen-VL [ 3] 和其他 MLLM [ 60, 40, 80]。然而，这种方法的一个重要局限是其数据范围有限，尤其是在像开放集分割这样的复杂任务中。其次，作为控制器的 LLM 方法试图在视觉专家和语言模型之间架起一座桥梁。这方面的例子包括 HuggingGPT [ 55]、Visual ChatGPT [ 62] 和 LLaVA-Plus [ 35]。这些方法利用大型语言模型的语言理解能力来指导各种视觉任务。然而，这种方法在很大程度上依赖于大型语言模型的功能和局限性。第三，"集合基础模型 "方法旨在通过协作整合为特定情境设计的专家模型，在复杂场景中完成开放世界任务。这种方法结合了各种专业模型的优势，从而提供了灵活性。

Although there have been advances in addressing open-world tasks through these methodologies, a robust pipeline capable of supporting complex and fundamental open-world tasks such as open-set segmentation is still lacking in the market. Grounded SAM takes an innovative approach from the perspective of the Ensemble Foundation Models approach, pioneering the integration of open-set detector models, such as Grounding DINO [38], and promptable segmentation models like SAM [25]. It effectively tackles the open-set segmentation challenge by dividing it into two main components: open-set detection, and promptable segmentation. Based on this approach, Grounded SAM offers a powerful and comprehensive platform that further facilitates an efficient fusion of different expert models to tackle more intricate open-world tasks.
尽管通过这些方法在处理开放世界任务方面取得了进展，但市场上仍然缺乏能够支持复杂和基本开放世界任务（如开放场景分割）的强大管道。Grounded SAM 从集合基础模型（Ensemble Foundation Models）的角度出发，采用了一种创新的方法，率先将开放集检测器模型（如 Grounding DINO [ 38] ）和可提示分割模型（如 SAM [ 25]）整合在一起。它将开放集分割分为两个主要部分：开放集检测和可提示分割，从而有效地解决了开放集分割难题。基于这种方法，Grounded SAM 提供了一个功能强大的综合平台，可进一步促进不同专家模型的有效融合，以处理更复杂的开放世界任务。

Building upon Grounded SAM as a foundation and leveraging its robust open-set segmentation capabilities, we can easily incorporating additional open-world models. For instance, when combined with Recognize Anything (RAM) [83], the RAM-Grounded-SAM model can automatically identify and segment things or objects within images without the need for any textual input, thus facilitating automatic image annotation tasks. Similar automatic image annotation capabilities can also be achieved through integration with BLIP [31]. Furthermore, Grounded SAM, when coupled with the inpainting capability of Stable Diffusion, as exemplified by the Grounded-SAM-SD model, can execute highly precise image editing tasks. We will provide a more detailed discussion of Grounded SAM and its augmented capabilities through the incorporation of additional open-world models in Section 3.
以 Grounded SAM 为基础，利用其强大的开放集分割功能，我们可以轻松地整合其他开放世界模型。例如，RAM-Grounded-SAM 模型与 Recognize Anything（RAM）[83] 结合使用时，可以自动识别和分割图像中的事物或物体，而无需任何文本输入，从而方便了自动图像注释任务。通过与 BLIP [ 31] 集成，也可以实现类似的自动图像注释功能。此外，Grounded SAM 与稳定扩散（Stable Diffusion）的内绘功能（如 Grounded-SAM-SD 模型）相结合，可以执行高度精确的图像编辑任务。我们将在第 3 节中更详细地讨论 Grounded SAM 及其通过加入其他开放世界模型而增强的功能。

2 Related Work 2 相关工作

2.1 Task-specific Vision Models
2.1 特定任务视觉模型

In the field of computer vision, significant advancements have been made across a variety of tasks, including image recognition [47, 31, 18, 83, 17], generic object detection [49, 87, 43, 27, 77, 36, 19, 38, 51, 50, 20, 30], generic image segmentation [9, 8, 26, 29, 78, 88, 79, 25, 79, 16, 28], referring object detection and segmentation [41, 37, 86], object tracking [67, 84], image generation [75, 54, 48, 45, 23, 14, 52, 22, 82, 44], image editing [42, 1, 2, 53, 21], human-centric perception and understanding [72, 71, 73, 70, 69, 4, 33, 74], and human-centric motion generation [39, 6, 32, 61, 5]. However, despite these advancements, current models are mostly task-specific and usually fall short in addressing a broader range of tasks.
在计算机视觉领域，各种任务都取得了重大进展，包括图像识别[ 47, 31, 18, 83, 17]、通用对象检测[ 49, 87, 43, 27, 77, 36, 19, 38, 51, 50, 20, 30]、通用图像分割[ 9, 8, 26, 29, 78, 88, 79, 25, 79, 16, 28]、物体检测和分割[ 41, 37, 86]、物体跟踪[ 67, 84]、图像生成[ 75, 54, 48, 45, 23, 14, 52, 22, 82, 44]、图像编辑[ 42, 1, 2, 53, 21]、以人为中心的感知和理解[ 72, 71, 73, 70, 69, 4, 33, 74]以及以人为中心的运动生成[ 39, 6, 32, 61, 5]。然而，尽管取得了这些进步，目前的模型大多是针对特定任务的，通常无法应对更广泛的任务。

2.2 Unified Models 2.2 统一模型

Unified models have been developed to address multiple tasks. In the language field, Large language models (LLMs) such as GPT-3 [13], LaMDA [57], and PaLM [11] are examples of general-purpose unified models, which handle language tasks through an auto-regressive and generative approach. Unlike language tasks that rely on a unified and structured token representation, vision tasks encompass many data formats, including pixel, spatial (e.g., box, key point), temporal, and others. Recent works have attempted to develop unified vision models from two perspectives to accommodate these diverse modalities. First, some models aim to unify various vision modalities into a single one. For instance, Pix2Seq [7] and OFA [59] attempt to merge spatial modalities such as box coordinates into language. Second, some models seek a unified model compatible with different modality outputs. UNINEXT [66] is an example that supports different instance-level task outputs. Although these unified vision models are advancing the progress of general intelligence, existing models can only handle a limited number of tasks and often fall short of task-specific models in performance.
统一模型是为解决多种任务而开发的。在语言领域，GPT-3[ 13]、LaMDA[ 57] 和 PaLM[ 11] 等大型语言模型（LLMs）是通用统一模型的典范，它们通过自动回归和生成的方法处理语言任务。与依赖统一结构化标记表示的语言任务不同，视觉任务包含多种数据格式，包括像素、空间（如方框、关键点）、时间等。近期的研究试图从两个角度开发统一的视觉模型，以适应这些不同的模式。首先，一些模型旨在将各种视觉模式统一为单一模式。例如，Pix2Seq [ 7] 和 OFA [ 59] 试图将方框坐标等空间模态合并到语言中。其次，有些模型寻求与不同模态输出兼容的统一模型。UNINEXT [ 66] 就是一个支持不同实例级任务输出的例子。虽然这些统一视觉模型推动了通用智能的发展，但现有模型只能处理有限的任务，在性能上往往无法与特定任务模型相比。

2.3 Model Assembly with a Controller System
2.3 利用控制系统组装模型

Orthogonal to our work, Visual ChatGPT [62] and HuggingGPT [55] propose to leverage LLMs to control different AI models for solving different tasks. Compared with these models, the foundation model assembling method does not employ an LLM as the controller, which makes the whole pipeline more efficient and flexible. We show that complex tasks can be decoupled, and step-by-step visual reasoning can be accomplished in a training-free model assembly manner.
与我们的工作相对应，Visual ChatGPT [ 62] 和 HuggingGPT [ 55] 提出利用 LLMs 控制不同的人工智能模型来解决不同的任务。与这些模型相比，基础模型组装方法不使用 LLM 作为控制器，这使得整个流水线更加高效和灵活。我们的研究表明，复杂的任务可以解耦，并且可以通过免训练的模型组装方式逐步完成可视化推理。

3 Grounded SAM Playground 3接地 SAM 游乐场

In this chapter, utilizing Grounded SAM as a foundation, we demonstrate our method of amalgamating expert models from various domains to facilitate the accomplishment of more comprehensive visual tasks.
在本章中，我们将以 Grounded SAM 为基础，展示我们将不同领域的专家模型进行组合的方法，以促进完成更全面的视觉任务。

3.1 Preliminary 3.1 初探

We discuss the basic components of Grounded SAM and other domain expert models here.
我们在此讨论 Grounded SAM 和其他领域专家模型的基本组成部分。

Segment Anything Model (SAM) [25] is an open-world segmentation model that can "cut out" any object in any image with proper prompts, like points, boxes, or text. It has been trained on over $11$ million images and $1.1$ billion masks. Despite of its strong zero-shot performance, the model cannot identify the masked objects based an arbitrary text input and normally requires point or box prompts to run.
Segment Anything Model（SAM）[25] 是一个开放世界的分割模型，它可以根据适当的提示（如点、方框或文本）在任何图像中 "切出 "任何物体。它已在超过 $11$ 万张图像和 @10# 亿个遮罩上进行过训练。尽管该模型具有很强的零误差性能，但它无法根据任意文本输入识别遮罩对象，通常需要点或框提示才能运行。

Grounding DINO [38] is an open-set object detector that can detect any objects with respect to an arbitrary free-form text prompt. The model was trained on over $10$ million images, including detection data, visual grounding data, and image-text pairs. It has a strong zero-shot detection performance. However, the model needs text as inputs and can only detect boxes with corresponding phrases.
Grounding DINO [ 38] 是一种开放集物体检测器，可以检测任意自由格式文本提示的任何物体。该模型在超过 $10$ 万张图像上进行了训练，包括检测数据、视觉接地数据和图像-文本对。它具有很强的零镜头检测性能。不过，该模型需要文本作为输入，并且只能检测到具有相应短语的方框。

OSX [33] is the state-of-the-art model for expressive whole-body mesh recovery, which aims to estimate the 3D human body poses, hand gestures, and facial expressions jointly from monocular images. It needs first to detect human boxes, crop and resize the human boxes, and then conduct single-person mesh recovery.
OSX [ 33] 是目前最先进的表现力全身网格复原模型，其目的是从单目图像中联合估计三维人体姿势、手势和面部表情。它首先需要检测人体方框、裁剪和调整人体方框的大小，然后进行单人网格复原。

BLIP [31] is a vision-language model that unifies vision-language understanding and generation tasks. We use the image caption model of BLIP in our experiments. The caption model can generate descriptions given any image. However, the model cannot perform object-level tasks, like detecting or segmenting objects.
BLIP [ 31] 是一个视觉语言模型，它将视觉语言理解和生成任务统一起来。我们在实验中使用了 BLIP 的图像标题模型。标题模型可以生成任何图像的描述。不过，该模型不能执行对象级任务，如检测或分割对象。

Recognize Anything Model (RAM) [83] is a strong image tagging model that can recognize any common categories of high accuracy for an input image. However, RAM can only generate tags but cannot generate precise boxes and masks for the recognized categories.
Recognize Anything Model（RAM）[83] 是一个强大的图像标记模型，能高精度识别输入图像的任何常见类别。不过，RAM 只能生成标签，而不能为识别出的类别生成精确的框和遮罩。

Stable Diffusion [52] is an image generation model that samples images from the learned distribution of training data. Its most widely used application is generating images with text prompts. We use its inpainting variant in our experiment. Despite of its awesome generation results, the model cannot perform perception or understanding tasks.
Stable Diffusion [ 52] 是一种图像生成模型，它从训练数据的学习分布中提取图像样本。其最广泛的应用是生成带有文本提示的图像。我们在实验中使用了它的内绘制变体。尽管该模型的生成效果很好，但它不能执行感知或理解任务。

ChatGPT & GPT-4 [15, 46] are large language models developed using the GPT (Generative Pre-trained Transformer) architecture, which is used for building conversational AI agents. It is trained on massive amounts of text data and can generate human-like responses to user input. The model can understand the context of the conversation and generate appropriate responses that are often indistinguishable from those of a human.
ChatGPT 和 GPT-4 [ 15, 46] 是使用 GPT（生成预训练转换器）架构开发的大型语言模型，用于构建会话式人工智能代理。它在海量文本数据的基础上进行训练，能对用户输入生成类似人类的反应。该模型可以理解对话的上下文，并生成适当的回复，这些回复通常与人类的回复无异。

Refer to caption — Figure 2: Grounded-SAM effectively detects and segments objects according to various user inputs. Its effectiveness is not limited to common cases but also includes long-tail object categories (like "Zale Horrida", and "Gazania Linearis", e.g.). Some of the demo images were sampled from the V3Det [58] dataset. We greatly appreciate their excellent work.
图 2：Grounded-SAM 可根据用户的各种输入有效地检测和分割对象。其有效性不仅限于常见情况，还包括长尾对象类别（如 "Zale Horrida "和 "Gazania Linearis "等）。部分演示图像取自 V3Det [ 58] 数据集。我们非常感谢他们的出色工作。

3.2 Grounded SAM: Open-Vocabulary Detection and Segmentation
3.2Grounded SAM：开放词汇检测和分割

It is highly challenging to determining masks in images corresponding to regions mentioned in any user-provided text and thereby enabling finer-grained image understanding tasks like open-set segmentation. This is primarily due to the limited availability of high-quality data for segmentation in the wild tasks, which presents a challenge for the model to accomplish precise open-set segmentation under conditions characterized by data scarcity. In contrast, open-set detection tasks are more tractable, primarily due to the following two reasons. First, the annotation cost of detection data is relatively lower compared to segmentation tasks, enabling the collection of more higher-quality annotated data. Second, open-set detection only requires identifying the corresponding object coordinates on the images based on the given text without the need for precise pixel-level object masks. Similarly, the prediction of the corresponding object mask, conditioned on a box and benefiting from the prior knowledge of the box’s location, is more efficient than directly predicting the region mask based on a text. This approach has been validated in previous works such as OpenSeeD [79], and the substantial issue of data scarcity can be largely addressed by utilizing the SAM-1B dataset developed in SAM [25].
在图像中确定与用户提供的文本中提及的区域相对应的掩码，从而实现更精细的图像理解任务（如开放集分割），是一项极具挑战性的工作。这主要是由于用于野生任务分割的高质量数据有限，这对模型在数据稀缺的条件下完成精确的开放集分割提出了挑战。相比之下，开集检测任务则更加容易完成，这主要有以下两个原因。首先，与分割任务相比，检测数据的注释成本相对较低，因此可以收集到更多高质量的注释数据。其次，开放集检测只需要根据给定的文本识别图像上相应的对象坐标，而不需要精确的像素级对象掩码。同样，以方框为条件，利用方框位置的先验知识预测相应的对象掩码，比直接根据文本预测区域掩码更有效。这种方法已经在 OpenSeeD [ 79] 等之前的作品中得到了验证，而且利用 SAM [ 25] 中开发的 SAM-1B 数据集可以在很大程度上解决数据稀缺的实质性问题。

Consequently, inspired by prior successful works such as Grounded Pre-training [81, 38] and SAM [25], we aim to address complex segmentation in the wild tasks by combining the strong open-set foundation models. Given an input image and a text prompt, we first employ Grounding DINO to generate precise boxes for objects or regions within the image by leveraging the textual information as condition. Subsequently, the annotated boxes obtained through Grounding DINO serve as the box prompts for SAM to generate precise mask annotations. By leveraging the capabilities of these two robust expert models, the open-set detection and segmentation tasks can be more effortlessly accomplished. As illustrated in Fig. 2, Grounded SAM can accurately detect and segment text based on user input in both conventional and long-tail scenarios.
因此，受 Grounded Pre-training [ 81, 38] 和 SAM [ 25] 等先前成功作品的启发，我们旨在通过结合强大的开放集基础模型来解决野外复杂的分割任务。给定一张输入图像和一个文本提示，我们首先利用接地 DINO，以文本信息为条件，为图像中的物体或区域生成精确的方框。随后，通过 Grounding DINO 获得的注释框作为 SAM 的框提示，生成精确的遮罩注释。利用这两个强大的专家模型的功能，可以更轻松地完成开放集检测和分割任务。如图 2 所示，Grounded SAM 可以在传统和长尾场景中根据用户输入准确检测和分割文本。

3.3 RAM-Grounded-SAM: Automatic Dense Image Annotation
3.3 RAM-Grounded-SAM：自动密集图像注释

The automatic image annotation system has numerous practical applications, such as enhancing the efficiency of manual annotation of data, reducing the cost of human annotation, or providing real-time scene annotation and understanding in autonomous driving to enhance driving safety. In the framework of Grounded SAM, it leverages the capabilities of Grounding DINO. Users have the flexibility to input arbitrary categories or captions, which are then automatically matched with entities within the images. Building upon this foundation, we can employ either an image-caption model (like BLIP [31] and Tag2Text [18]) or an image tagging model (like RAM [83]), using their output results (captions or tags) as inputs to Grounded SAM and generating precise box and mask for each instance. This enables the automatic labeling of an entire image, achieving an automated labeling system. As depicted in Fig. 3, RAM-Grounded-SAM exhibits the capability to automatically perform category prediction and provide dense annotations for input images across various scenarios. This significantly reduces the annotation cost and greatly enhances the flexibility of image annotation.
自动图像标注系统有许多实际应用，例如提高人工标注数据的效率，降低人工标注的成本，或在自动驾驶中提供实时场景标注和理解以提高驾驶安全性。在 Grounded SAM 框架下，它充分利用了 Grounding DINO 的功能。用户可以灵活地输入任意类别或标题，然后将其与图像中的实体自动匹配。在此基础上，我们可以使用图像标题模型（如 BLIP [ 31] 和 Tag2Text [ 18]）或图像标签模型（如 RAM [ 83]），将其输出结果（标题或标签）作为 Grounded SAM 的输入，并为每个实例生成精确的框和掩码。这样就能对整个图像进行自动标注，实现自动标注系统。如图 3 所示，RAM-Grounded-SAM 能够自动执行类别预测，并为各种场景下的输入图像提供密集注释。这大大降低了标注成本，并极大地提高了图像标注的灵活性。

3.4 Grounded-SAM-SD: Highly Accurate and Controllable Image Editing
3.4Grounded-SAM-SD: 高度精确和可控的图像编辑

By integrating the powerful text-to-image capability of image generation models with Grounded SAM, we can establish a comprehensive framework that enables the creation of a robust data synthesis factory, supporting fine-grained operations at the part-level, instance-level, and semantic-level. As shown in Fig. 4, users can obtain precise masks through interactive methods such as clicking or drawing bounding boxes within this pipeline. Moreover, users can leverage the capability of grounding, combined with text prompts, to automatically locate corresponding regions of interest. Building upon this foundation, with the additional capability of an image generation model, we can achieve highly precise and controlled image manipulation, including modifying the image representation, replacing objects, removing the corresponding regions, etc. In downstream scenarios where data scarcity arises, our system can generate new data, addressing the data requirements for the training of models.
通过将图像生成模型强大的文本到图像功能与 Grounded SAM 相集成，我们可以建立一个综合框架，从而创建一个强大的数据合成工厂，支持部件级、实例级和语义级的细粒度操作。如图 4 所示，用户可以通过点击或在管道中绘制边界框等交互式方法获得精确的遮罩。此外，用户还可以利用接地功能，结合文本提示，自动定位相应的兴趣区域。在此基础上，通过图像生成模型的附加功能，我们可以实现高度精确和可控的图像处理，包括修改图像表示、替换对象、删除相应区域等。在下游数据匮乏的情况下，我们的系统可以生成新的数据，满足模型训练对数据的需求。

3.5 Grounded-SAM-OSX: Promptable Human Motion Analysis
3.5Grounded-SAM-OSX: 可提示的人体运动分析

Previous expressive whole-body mesh recovery first detects all (instance-agnostic) human boxes and then conducts the single-person mesh recovery. In many real-world applications, we need to specify the target person to be detected and analyzed. However, existing human detectors can not distinguish different instances (e.g., specify to analyze "a person with pink clothes"), making fine-grained human motion analysis challenging. As shown in Fig. 5, we can integrate the Grounded SAM and OSX [33] models to achieve a novel promptable (instance-specific) whole-body human detection and mesh recovery, thereby realizing a promptable human motion analysis system. Specifically, given an image and a prompt to refer to a specific person, we first use Grounded SAM to generate a precise specific human box. Then, we use OSX to estimate an instance-specific human mesh to complete the process.
以往的表达式全身网格复原首先检测所有（与实例无关的）人体方框，然后进行单人网格复原。在许多实际应用中，我们需要指定要检测和分析的目标人物。然而，现有的人体检测器无法区分不同的实例（例如，指定分析 "穿粉色衣服的人"），这使得细粒度的人体运动分析具有挑战性。如图 5 所示，我们可以整合 Grounded SAM 和 OSX [ 33] 模型，实现新颖的可提示（特定实例）全身人体检测和网格恢复，从而实现可提示人体运动分析系统。具体来说，在给定一张图像和一个指向特定人的提示后，我们首先使用 Grounded SAM 生成一个精确的特定人体框。然后，我们使用 OSX 估算特定实例的人体网格，完成整个过程。

3.6 More Extensions for Grounded SAM
3.6 Grounded SAM 的更多扩展

Table 1: Zero-shot benchmarking results of Grounded-SAM in SGinW. The best and second-best results are highlighted in bold and underlined, respectively. * means the results were tested by the SAM-HQ [24] team. We are immensely thankful for their assistance in conducting these tests and highly appreciate their work.
表 1：Grounded-SAM 在 SGinW 中的零射击基准测试结果。最佳和次佳结果分别以粗体和下划线标出。* 表示这些结果经过了 SAM-HQ [ 24] 团队的测试。我们非常感谢他们协助进行这些测试，并对他们的工作表示高度赞赏。

Method 方法	mean SGinW SGinW	Elephants 大象	Hand-Metal 手工金属	Watermelon 西瓜	House-Parts 房屋部件	HouseHold-Items 房屋-物品	Strawberry 草莓	Fruits 水果	Nutterfly-Squireel 果仁蝇	Hand 手	Garbage 垃圾	Chicken 鸡	Rail 铁轨	Airplane-Parts 飞机部件	Brain-Tumor 脑肿瘤	Poles 杆	Electric-Shaver 电动剃须刀	Bottles 瓶子	Toolkits 工具包	Trash 垃圾桶	Salmon-Fillet 鲑鱼片	Puppies 小狗	Tablets 平板电脑	Phones 手机	Cows 奶牛	Ginger-Garlic 生姜大蒜
X-Decoder-T [88] X-Decoder-T [ 88］	22.6	65.6	22.4	16.2	5.5	50.6	41.6	66.5	62.1	0.6	28.7	12.0	0.7	10.5	1.1	3.6	1.2	19.0	9.5	19.3	15.0	48.9	15.2	29.9	12.0	7.9
X-Decoder-L-IN22K	26.6	63.9	20.3	13.5	4.9	50.5	74.4	79.1	58.8	0.0	24.3	3.5	1.3	12.3	0.5	13.4	18.8	43.2	14.6	20.1	12.3	57.3	6.9	43.4	12.3	15.6
X-Decoder-B	27.7	68.0	18.5	13.0	6.7	51.7	81.6	76.7	53.1	20.6	30.2	13.6	0.8	13.0	0.3	5.6	4.2	45.9	13.9	27.3	18.2	55.4	8.0	8.9	36.8	19.4
X-Decoder-L	32.2	66.0	42.1	13.8	7.0	53.0	67.1	79.2	68.4	75.9	33.0	8.6	2.3	13.1	2.2	20.1	7.5	42.1	9.9	22.3	19.0	59.0	22.5	15.6	44.9	11.6
OpenSeeD-L [79] OpenSeeD-L [ 79］	36.7	72.9	38.7	52.3	1.8	50.0	82.8	76.4	40.0	92.7	16.9	82.9	1.8	13.0	2.1	4.6	4.7	39.7	15.4	15.3	15.0	74.6	47.4	7.6	40.9	13.6
ODISE-L [64] ODISE-L [ 64］	38.7	74.9	51.4	37.5	9.3	60.4	79.9	81.3	71.9	41.4	39.8	84.1	2.8	15.8	2.9	0.4	18.3	37.7	15.0	28.6	30.2	65.4	9.1	43.8	41.6	23.0
SAN-CLIP-ViT-L [65]	41.4	67.4	62.9	43.5	9.0	60.1	81.8	77.4	82.2	88.8	46.5	69.2	2.9	13.2	2.6	1.8	11.4	48.8	31.2	41.4	20.0	60.1	35.1	10.4	44.0	23.3
UNINEXT-H [66] Uninext-H [ 66］	42.1	72.1	57.0	56.3	0.0	54.0	80.7	81.1	84.1	93.7	16.9	75.2	0.0	15.1	2.6	13.4	71.2	46.1	10.1	10.8	44.4	64.6	21.0	6.1	52.7	23.7
Grounded-HQ-SAM (B+H)* [24] 接地-HQ-SAM（B+H）* [ 24］	49.6	77.5	81.2	65.6	8.5	60.1	85.6	82.3	77.1	74.8	25.0	84.5	7.7	37.6	12.0	20.1	72.1	66.3	21.8	30.0	42.2	50.1	29.7	35.3	47.8	45.6
Grounded-SAM (B+H)* 接地-SAM (B+H)*	48.7	77.9	81.2	64.2	8.4	60.1	83.5	82.3	71.3	70.0	24.0	84.5	8.7	37.2	11.9	23.3	71.7	65.4	20.8	30.0	32.9	50.1	29.8	35.4	47.5	45.8
Grounded-SAM (L+H) 基础 SAM（L+H）	46.0	78.6	75.2	61.5	7.2	35.0	82.5	86.9	70.9	90.7	28.2	84.6	7.2	38.4	10.2	17.4	59.7	43.7	26.9	22.4	27.1	63.2	38.6	3.4	49.4	40.0

In addition to the aforementioned primary applications, Grounded SAM can further expand its scope of utilization by integrating more models. For instance, in the data labeling process, Grounded SAM can collaborate with the faster inference SAM models, such as FastSAM [85], MobileSAM [76], Light-HQ-SAM [24], and EfficientSAM [63]. This collaboration can significantly reduce the overall inference time and expedite the labeling workflow. Grounded SAM can also leverage the HQ-SAM [24] model, which is capable of generating higher-quality masks, to enhance the quality of annotations. In the realm of image editing, Grounded SAM can also synergize with the newly proposed generative models such as Stable-Diffusion-XL [52] to achieve higher-quality image editing. Furthermore, it can be integrated with models like LaMa [56] and PaintByExample [68] to achieve precise image erasure and customized image editing. Grounded SAM can also integrate with tracking models such as DEVA [10] to perform object tracking based on specific text prompts.
除了上述主要应用外，Grounded SAM 还可以通过整合更多模型进一步扩大其应用范围。例如，在数据标注过程中，Grounded SAM 可以与推理速度更快的 SAM 模型合作，如 FastSAM [ 85]、MobileSAM [ 76]、Light-HQ-SAM [ 24] 和 EfficientSAM [ 63]。这种合作可以大大缩短整体推理时间，加快标记工作流程。Grounded SAM 还可以利用 HQ-SAM [ 24] 模型（该模型能够生成更高质量的掩码）来提高注释的质量。在图像编辑领域，Grounded SAM 还能与新提出的生成模型（如 Stable-Diffusion-XL [ 52] 模型）协同作用，实现更高质量的图像编辑。此外，它还可以与 LaMa [ 56] 和 PaintByExample [ 68] 等模型相结合，实现精确的图像擦除和定制图像编辑。Grounded SAM 还可以与 DEVA [ 10] 等跟踪模型集成，根据特定文本提示执行对象跟踪。

4 Effectiveness of Grounded SAM
4接地SAM的效果

To validate the effectiveness of Grounded SAM, we evaluate its performance on the Segmentation in the Wild (SGinW) zero-shot benchmark, which comprises 25 zero-shot in-the-wild datasets. As demonstrated in Table. 1, the combination of Grounding DINO Base and Large Model with SAM-Huge results in significant performance improvements in the zero-shot settings of SGinW, when compared to previously unified open-set segmentation models such as UNINEXT [66] and OpenSeeD [79]. By incorporating HQ-SAM [24], which is capable of generating masks of higher quality than SAM, Grounded-HQ-SAM achieves even further performance improvement on SGinW.
为了验证 Grounded SAM 的有效性，我们评估了它在 "野外分割"（Segmentation in the Wild，SGinW）零镜头基准测试中的性能，该基准测试包括 25 个野外零镜头数据集。如表 1 所示与之前统一的开放集分割模型（如 UNINEXT [ 66] 和 OpenSeeD [ 79]）相比，将接地 DINO Base 和 Large Model 与 SAM-Huge 结合使用可显著提高 SGinW 的零镜头设置性能。由于 HQ-SAM [ 24] 能够生成比 SAM 质量更高的掩码，因此 Grounded-HQ-SAM 在 SGinW 上的性能得到了进一步提高。

5 Conclusion and Prospects
5结论与展望

The strengths of our proposed Grounded SAM and its extensions, which utilize the assembly of diverse expert models to accomplish various vision tasks, can be summarized as follows. First, the capability boundaries of the models can be seamlessly expanded by assembling various expert models. Previously, we could do $n$ tasks with $n$ models. Now, we can do up to $2^{n}-1$ tasks with $n$ expert models considering all possible model combinations. We can decouple a complex task into several sub-tasks that are solved by currently available expert models. Second, the model assembling pipeline is more explainable by decomposing a task into several sub-tasks. We can observe the output of each step to obtain the reasoning process of the final results. Finally, by combining various expert models, we can investigate new areas of research and applications, potentially leading to innovative results and technological advances.
我们提出的 Grounded SAM 及其扩展方案利用各种专家模型的组合来完成各种视觉任务，其优势可归纳如下。首先，通过组装各种专家模型，可以无缝扩展模型的能力边界。以前，我们可以用 $n$ 个模型完成 $n$ 项任务。现在，考虑到所有可能的模型组合，我们可以用 @3 个专家模型完成多达 @2 个任务。我们可以将一个复杂的任务解耦为多个子任务，由当前可用的专家模型来解决。其次，通过将一项任务分解为多个子任务，模型组装流水线更容易解释。我们可以通过观察每个步骤的输出，获得最终结果的推理过程。最后，通过组合各种专家模型，我们可以探索新的研究和应用领域，从而有可能取得创新成果和技术进步。

Prospects: A significant prospect of our methodology entails establishing a closed loop between annotation data and model training. Through the combination of expert models, substantial annotation costs can be saved. Moreover, the inclusion of human annotators at different stages facilitates the filtering or fine-tuning of inaccurate model predictions, thereby enhancing the quality of model annotations. The annotated data is then continually utilized to further train and improve the model. Another potential application of our method is to combine with Large Language Models (LLMs). Given our assembled models can do almost any computer vision (CV) tasks with various input and output modalities, especially language, it becomes straightforward for LLMs to invoke our API via language prompts to effectively execute CV tasks. Last but not least, the model can be used to generate new datasets bridging any pairs of modalities, especially when combined with generation models.
前景：我们的方法的一个重要前景是在标注数据和模型训练之间建立一个闭环。通过结合专家模型，可以节省大量标注成本。此外，在不同阶段加入人类注释者有助于过滤或微调不准确的模型预测，从而提高模型注释的质量。然后，注释数据将不断用于进一步训练和改进模型。我们的方法的另一个潜在应用是与大型语言模型（LLMs）相结合。鉴于我们组装的模型可以利用各种输入和输出模式（尤其是语言）完成几乎所有计算机视觉（CV）任务，因此 LLMs 可以通过语言提示直接调用我们的应用程序接口（API），从而有效地执行 CV 任务。最后但并非最不重要的一点是，该模型可用于生成新的数据集，尤其是在与生成模型相结合时，可以桥接任何成对的模态。

6 Contributions and Acknowledgments
6贡献与致谢

We would like to express our deepest gratitude to multiple persons from the research community for their substantial support in the Grounded SAM project. We have listed the main participating roles in the Grounded SAM Project below. Within each role, contributions are equal and are listed in a randomized order. Ordering within each role does not indicate the ordering of the contributions.
在此，我们对研究界的多位人士对 Grounded SAM 项目的大力支持深表感谢。我们在下文中列出了参与 Grounded SAM 项目的主要角色。每个角色的贡献是相同的，并以随机顺序排列。每个角色内部的排序并不表示贡献的排序。

Leads 领导

Tianhe Ren, Co-Lead, Grounded SAM & Grounded-SAM-SD pipeline.
任天和，联合负责人，Grounded SAM 和 Grounded-SAM-SD 管道。

Shilong Liu, Co-Lead, Grounded SAM pipeline and online demo.
刘世龙，联合负责人，Grounded SAM 管道和在线演示。

Ailing Zeng, Co-Lead, Grounded-SAM-OSX pipeline and demo.
曾爱玲，联合负责人，Grounded-SAM-OSX 管道和演示。

Jin Ling, Co-Lead, Grounded-SAM-OSX pipeline and demo.
Jin Ling，联合负责人，Grounded-SAM-OSX 管道和演示。

He Cao, Co-Lead, Grounded-SAM-SD pipeline and Interactive SAM Editing pipeline.
曹鹤，联合负责人，Grounded-SAM-SD 管道和交互式 SAM 编辑管道。

Kunchang Li, Co-Lead, BLIP-Grounded-SAM pipeline and ChatBot.
李坤昌，联合负责人，BLIP-Grounded-SAM 管道和聊天机器人。

Jiayu Chen, Co-Lead, Grounded SAM modelscope demo support and code optimization.
陈家宇，联合负责人，Grounded SAM 模型范围演示支持和代码优化。

Xinyu Huang, Co-Lead, RAM-Grounded-SAM demo support.
黄新宇，联合负责人，RAM-Grounded-SAM 演示支持。

Feng Yan, Co-Lead, Grounded SAM with VISAM tracking demo.
Feng Yan，联合负责人，Grounded SAM 与 VISAM 跟踪演示。

Yukang Chen, Co-Lead, 3D-Box via Segment Anything.
陈玉康，联合负责人，3D-Box via Segment Anything。

Core Contributors 核心贡献者

Overall Technical Leads 技术总负责人

Lei Zhang 张磊

References 参考资料

[1] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended Latent Diffusion, Jun 2022.
Omri Avrahami、Ohad Fried 和 Dani Lischinski。混合潜在扩散》，2022 年 6 月。
[2] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended Diffusion for Text-driven Editing of Natural Images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Sep 2022.
Omri Avrahami、Dani Lischinski 和 Ohad Fried。自然图像文本驱动编辑的混合扩散。In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Sep 2022.
[3] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond, 2023.
白晋泽、白帅、杨树生、王世杰、谭思南、王鹏、林俊扬、周畅、周敬仁。Qwen-VL：用于理解、定位、文本阅读及其他功能的通用视觉语言模型》，2023 年。
[4] Zhongang Cai, Wanqi Yin, Ailing Zeng, CHEN WEI, SUN Qingping, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, et al. Smpler-x: Scaling up expressive human pose and shape estimation. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
蔡中刚、尹万琪、曾爱玲、陈伟、孙庆平、王艳君、庞惠恩、梅海毅、张明远、张磊等 Smpler-x：扩展富有表现力的人体姿态和形状估计。第三十七届神经信息处理系统数据集与基准会议，2023年。
[5] Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, and Qifeng Chen. Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation. arXiv preprint arXiv:2401.04747, 2024.
陈俊明、刘云飞、王建安、曾爱玲、李宇、陈奇峰。Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation. ArXiv preprint arXiv:2401.04747, 2024.
[6] Ling-Hao Chen, Jiawei Zhang, Yewen Li, Yiren Pang, Xiaobo Xia, and Tongliang Liu. HumanMAC: Masked Motion Completion for Human Motion Prediction. 2023.
陈凌浩、张家伟、李毓文、庞宜人、夏小波、刘同亮。HumanMAC：用于人体运动预测的屏蔽运动补全。2023.
[7] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A Language Modeling Framework for Object Detection. arXiv preprint arXiv:2109.10852, 2021.
Ting Chen、Saurabh Saxena、Lala Li、David J Fleet 和 Geoffrey Hinton。Pix2seq: A Language Modeling Framework for Object Detection. arXiv preprint arXiv:2109.10852, 2021.
[8] Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, and Alexander G. Schwing. Mask2Former for Video Instance Segmentation. 2022.
Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, and Alexander G. Schwing.用于视频实例分割的 Mask2Former。2022.
[9] Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Per-Pixel Classification is Not All You Need for Semantic Segmentation. 2021.
Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov.每像素分类并非语义分割的全部所需2021.
[10] Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking Anything with Decoupled Video Segmentation. In ICCV, 2023.
Ho Kei Cheng、Seoung Wug Oh、Brian Price、Alexander Schwing 和 Joon-Young Lee。利用解耦视频分割追踪任何物体。In ICCV, 2023.
[11] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
Aakanksha Chowdhery、Sharan Narang、Jacob Devlin、Maarten Bosma、Gaurav Mishra、Adam Roberts、Paul Barham、Hyung Won Chung、Charles Sutton、Sebastian Gehrmann 等 Palm：ArXiv preprint arXiv:2204.02311, 2022.
[12] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, 2023.
戴文良、李俊楠、李东旭、Anthony Meng Huat Tiong、赵俊琦、王伟生、李博洋、冯帕斯卡尔和 Steven Hoi.InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, 2023.
[13] Luciano Floridi and Massimo Chiriatti. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
Luciano Floridi 和 Massimo Chiriatti.GPT-3: Its nature, scope, limits, and consequences.Minds and Machines, 30:681-694, 2020.
[14] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors.
Oran Gafni、Adam Polyak、Oron Ashual、Shelly Sheynin、Devi Parikh 和 Yaniv Taigman。Make-A-Scene：基于场景和人类先验的文本到图像生成。
[15] Roberto Gozalo-Brizuela and Eduardo C Garrido-Merchan. ChatGPT is not all you need. A State of the Art Review of large Generative AI models. arXiv preprint arXiv:2301.04655, 2023.
Roberto Gozalo-Brizuela 和 Eduardo C Garrido-Merchan.ChatGPT 并非你所需要的全部。A State of the Art Review of large Generative AI models. ArXiv preprint arXiv:2301.04655, 2023.
[16] Jie Hu, Linyan Huang, Tianhe Ren, Shengchuan Zhang, Rongrong Ji, and Liujuan Cao. You Only Segment Once: Towards Real-Time Panoptic Segmentation, 2023.
胡杰、黄林燕、任天和、张胜川、纪蓉蓉、曹柳娟。你只能分割一次：实现实时全景分割，2023 年。
[17] Xinyu Huang, Yi-Jie Huang, Youcai Zhang, Weiwei Tian, Rui Feng, Yuejie Zhang, Yanchun Xie, Yaqian Li, and Lei Zhang. Open-Set Image Tagging with Multi-Grained Text Supervision, 2023.
黄新宇、黄义杰、张友才、田伟伟、冯锐、张跃杰、谢艳春、李雅倩和张磊。多粒度文本监督下的开放集图像标记，2023年。
[18] Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2Text: Guiding Vision-Language Model via Image Tagging, 2023.
黄新宇、张有才、马金玉、田伟伟、冯锐、张悦杰、李雅倩、郭延东、张磊。Tag2Text：通过图像标记引导视觉语言模型，2023 年。
[19] Ding Jia, Yuhui Yuan, Haodi He, Xiaopei Wu, Haojun Yu, Weihong Lin, Lei Sun, Chao Zhang, and Han Hu. DETRs with Hybrid Matching. arXiv preprint arXiv:2207.13080, 2022.
Ding Jia, Yuhui Yuan, Haodi He, Xiaopei Wu, Haojun Yu, Weihong Lin, Lei Sun, Chao Zhang, and Han Hu.DETRs with Hybrid Matching. arXiv preprint arXiv:2207.13080, 2022.
[20] Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, and Lei Zhang. T-Rex: Counting by Visual Prompting, 2023.
Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, and Lei Zhang.T-Rex：通过视觉提示进行计数，2023年。
[21] Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506, 2023.
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu.直接反转：用3行代码提升基于扩散的编辑。arXiv预印本arXiv:2310.01506，2023。
[22] Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, and Qiang Xu. HumanSD: A native skeleton-guided diffusion model for human image generation. 2023.
Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, and Qiang Xu.HumanSD：用于生成人体图像的原生骨架引导扩散模型。2023.
[23] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, Taesung Park, and Postech Postech. Scaling up GANs for Text-to-Image Synthesis.
Minguk Kang、Jun-Yan Zhu、Richard Zhang、Jaesik Park、Eli Shechtman、Sylvain Paris、Taesung Park 和 Postech Postech。为文本到图像的合成扩展 GANs。
[24] Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment Anything in High Quality. arXiv:2306.01567, 2023.
柯磊、叶明桥、马丁-达内尔扬、刘一帆、戴裕荣、唐志强、余飞。高质量中的任何分段。arXiv:2306.01567, 2023.
[25] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment Anything. arXiv preprint arXiv:2304.02643, 2023.
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment Anything. ArXiv preprint arXiv:2304.02643, 2023.
[26] Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, and Jianfeng Gao. Visual In-Context Prompting, 2023.
李锋、蒋青、张浩、任天和、刘世龙、邹雪艳、徐怀哲、李红阳、李春元、杨建伟、张磊、高剑锋。视觉上下文提示，2023年。
[27] Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Computer Vision and Pattern Recognition (CVPR), 2022.
李锋、张浩、刘世龙、郭健、倪磊和张磊。DN-DETR：通过引入查询去噪加速 DETR 训练。计算机视觉与模式识别（CVPR），2022年。
[28] Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-SAM: Segment and Recognize Anything at Any Granularity, 2023.
李锋、张浩、孙培泽、邹雪艳、刘世龙、杨建伟、李春元、张磊、高剑锋。语义-SAM：以任何粒度分割和识别任何东西，2023 年。
[29] Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2023.
Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum.Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation.In Proceedings of the IEEE conference on computer vision and pattern recognition, 2023.
[30] Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe Ren, and Lei Zhang. DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6684–6693, October 2023.
李红阳、张浩、曾昭阳、刘世龙、李峰、任天和、张磊。DFA3D：用于二维到三维特征提取的三维可变形注意力。In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6684-6693, October 2023.
[31] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.BLIP：用于统一视觉语言理解和生成的引导式语言图像预训练。国际机器学习大会，第 12888-12900 页。PMLR, 2022.
[32] Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
林静、曾爱玲、吕顺林、蔡元浩、张瑞茂、王浩乾、张磊。Motion-x：大规模三维表现力全身人体运动数据集。第三十七届神经信息处理系统数据集和基准会议，2023年。
[33] Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
林静、曾爱玲、王浩乾、张磊、李宇使用组件感知变换器的单级三维全身网格恢复。IEEE/CVF计算机视觉与模式识别会议论文集，2023年。
[34] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. arXiv preprint arXiv:2304.08485, 2023.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.视觉指令调整。arXiv preprint arXiv:2304.08485, 2023.
[35] Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, and Chunyuan Li. LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents, 2023.
刘世龙、程浩、刘昊天、张浩、李峰、任天和、邹雪艳、杨建伟、苏航、朱军、张磊、高剑锋和李春元。LLaVA-Plus：创建多模态代理的学习使用工具》，2023 年。
[36] Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In International Conference on Learning Representations, 2022.
Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang.DAB-DETR：Dynamic Anchor Boxes are Better Queries for DETR.国际学习表征大会，2022年。
[37] Shilong Liu, Yaoyuan Liang, Feng Li, Shijia Huang, Hao Zhang, Hang Su, Jun Zhu, and Lei Zhang. DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding, 2022.
刘世龙、梁耀元、李锋、黄世佳、张浩、苏航、朱军、张磊。DQ-DETR：用于词组提取和接地的双查询检测变换器，2022.
[38] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023.
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. ArXiv preprint arXiv:2303.05499, 2023.
[39] Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. HumanTOMATO: Text-aligned Whole-body Motion Generation. arXiv preprint arXiv:2310.12978, 2023.
陆顺林、陈凌浩、曾爱玲、林静、张瑞茂、张磊、岑向阳。HumanTOMATO: Text-aligned Whole-body Motion Generation. arXiv preprint arXiv:2310.12978, 2023.
[40] Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models, 2023.
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji.便宜又快速：大型语言模型的高效视觉语言指令调整》，2023年。
[41] Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation, 2020.
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liuujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji.用于联合参考表达理解和分割的多任务协作网络，2020年。
[42] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Image Synthesis and Editing with Stochastic Differential Equations, Aug 2021.
孟晨林、宋洋、宋家明、吴佳俊、朱俊彦和斯特凡诺-埃尔蒙。SDEdit：利用随机微分方程进行图像合成和编辑，2021年8月。
[43] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional DETR for Fast Training Convergence. arXiv preprint arXiv:2108.06152, 2021.
孟德普、陈小康、范泽佳、曾刚、李厚强、袁玉辉、孙磊、王敬东。快速训练收敛的条件 DETR。arXiv preprint arXiv:2108.06152, 2021.
[44] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models, Feb 2023.
牟冲、王新涛、谢良斌、张健、齐中刚、单颖、郄小虎。T2I-适配器：学习适配器，挖掘文本到图像扩散模型的更多可控能力》，2023年2月。
[45] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.
Alex Nichol、Prafulla Dhariwal、Aditya Ramesh、Pranav Shyam、Pamela Mishkin、Bob McGrew、Ilya Sutskever 和 Mark Chen。GLIDE：利用文本引导的扩散模型实现逼真图像生成和编辑。
[46] OpenAI. GPT-4 Technical Report, 2023.
OpenAI.GPT-4 技术报告，2023 年。
[47] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision.国际机器学习大会，第 8748-8763 页。PMLR, 2021.
[48] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents.
Aditya Ramesh、Prafulla Dhariwal、Alex Nichol、Casey Chu 和 Mark Chen。使用 CLIP Latents 的分层文本条件图像生成。
[49] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
任少卿、何开明、罗斯-吉尔希克和孙健。更快的 R-CNN：利用区域建议网络实现实时物体检测。IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137-1149, 2017.
[50] Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, et al. detrex: Benchmarking Detection Transformers. arXiv preprint arXiv:2306.07265, 2023.
任天和，刘世龙，李峰，张浩，曾爱玲，杨杰，廖星宇，贾丁，李红阳，曹鹤，等：detrex：ArXiv preprint arXiv:2306.07265, 2023.
[51] Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, and Lei Zhang. A Strong and Reproducible Object Detector with Only Public Datasets, 2023.
任天和，杨建伟，刘世龙，曾爱玲，李峰，张浩，李红阳，曾朝阳，张磊。仅使用公共数据集的强大且可重现的物体检测器》，2023 年。
[52] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
Robin Rombach、Andreas Blattmann、Dominik Lorenz、Patrick Esser 和 Björn Ommer。使用潜在扩散模型的高分辨率图像合成。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 10684-10695 页，2022 年。
[53] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Aug 2022.
纳塔尼尔-鲁伊斯、李元珍、瓦伦-詹帕尼、雅伊尔-普里奇、迈克尔-鲁宾斯坦和克菲尔-阿伯曼。梦ooth：微调文本到图像扩散模型，用于主题驱动生成，2022 年 8 月。
[54] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar, Seyed Ghasemipour, Burcu Karagol, SSara Mahdavi, RaphaGontijo Lopes, Tim Salimans, Jonathan Ho, DavidJ Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.
Chitwan Saharia、William Chan、Saurabh Saxena、Lala Li、Jay Whang、Emily Denton、Seyed Kamyar、Seyed Ghasemipour、Burcu Karagol、SSara Mahdavi、RaphaGontijo Lopes、Tim Salimans、Jonathan Ho、DavidJ Fleet 和 Mohammad Norouzi。具有深度语言理解能力的逼真文本到图像扩散模型。
[55] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
沈永亮、宋开涛、谭旭、李东生、卢伟明和庄月婷。拥抱 GPT：用聊天 GPT 及其拥抱面的朋友解决人工智能任务。arXiv 预印本 arXiv:2303.17580, 2023.
[56] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust Large Mask Inpainting with Fourier Convolutions. arXiv preprint arXiv:2109.07161, 2021.
Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky.用傅立叶卷积进行分辨率稳健的大掩模涂色。arXiv 预印本 arXiv:2109.07161, 2021.
[57] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda：对话应用的语言模型。arXiv 预印本 arXiv:2201.08239, 2022.
[58] Jiaqi Wang, Pan Zhang, Tao Chu, Yuhang Cao, Yujie Zhou, Tong Wu, Bin Wang, Conghui He, and Dahua Lin. V3Det: Vast Vocabulary Visual Detection Dataset. arXiv preprint arXiv:2304.03752, 2023.
王佳琪、张攀、储涛、曹宇航、周宇杰、吴彤、王斌、何聪慧和林大华。V3Det：ArXiv preprint arXiv:2304.03752, 2023.
[59] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. In ICML, 2022.
王鹏、杨安、门睿、林俊阳、白帅、李志康、马建新、周畅、周敬仁、杨红霞。OFA：通过简单的序列到序列学习框架统一架构、任务和模式。In ICML, 2022.
[60] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. CogVLM: Visual Expert for Pretrained Language Models, 2023.
王维汉、吕青松、于文萌、洪文义、齐骥、王艳、纪俊辉、杨卓一、赵磊、宋茜璇、徐家正、徐斌、李娟子、董玉晓、丁明和唐杰。CogVLM：预训练语言模型的可视化专家》，2023 年。
[61] Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-based imitation of dynamic human-object interaction. arXiv preprint arXiv:2312.04393, 2023.
王银怀、林静、曾爱玲、罗正一、张健、张磊。Physhoi：基于物理的动态人-物交互模仿。arXiv 预印本 arXiv:2312.04393, 2023.
[62] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv preprint arXiv:2303.04671, 2023.
吴晨飞、尹胜明、祁维珍、王晓东、唐泽成、段楠。Visual ChatGPT：用可视化基础模型聊天、绘图和编辑。arXiv 预印本 arXiv:2303.04671, 2023.
[63] Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, et al. EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. arXiv preprint arXiv:2312.00863, 2023.
Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, et al. EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything. ArXiv preprint arXiv:2312.00863, 2023.
[64] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. arXiv preprint arXiv:2303.04803, 2023.
Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello.使用文本到图像扩散模型的开放词汇泛视分割。arXiv 预印本 arXiv:2303.04803, 2023.
[65] Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. Side Adapter Network for Open-Vocabulary Semantic Segmentation, 2023.
徐孟德、张政、魏方云、胡瀚、白翔。用于开放词汇语义分割的侧适配器网络，2023.
[66] Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Zehuan Yuan, Ping Luo, and Huchuan Lu. Universal Instance Perception as Object Discovery and Retrieval. In CVPR, 2023.
Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Zehuan Yuan, Ping Luo, and Huchuan Lu.作为对象发现和检索的通用实例感知。In CVPR, 2023.
[67] Feng Yan, Weixin Luo, Yujie Zhong, Yiyang Gan, and Lin Ma. Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object Tracking, 2023.
严锋、罗维新、钟宇杰、甘逸阳和马琳。弥合端到端和非端到端多目标跟踪之间的差距，2023 年。
[68] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by Example: Exemplar-based Image Editing with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
杨斌新、顾淑阳、张博、张婷、陈学津、孙晓燕、陈东、温芳。以实例作画：基于范例的扩散模型图像编辑。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 18381-18391 页，2023 年。
[69] Jie Yang, Bingliang Li, Fengyu Yang, Ailing Zeng, Lei Zhang, and Ruimao Zhang. Boosting human-object interaction detection with text-to-image diffusion model. arXiv preprint arXiv:2305.12252, 2023.
杨杰、李秉良、杨凤玉、曾爱玲、张磊和张瑞茂。利用文本-图像扩散模型提升人-物交互检测》。arXiv预印本arXiv:2305.12252，2023。
[70] Jie Yang, Chaoqun Wang, Zhen Li, Junle Wang, and Ruimao Zhang. Semantic human parsing via scalable semantic transfer over multiple label domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19424–19433, 2023.
杨杰、王超群、李振、王俊乐、张瑞茂。通过多标签域上的可扩展语义转移进行语义人类解析。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19424-19433, 2023.
[71] Jie Yang, Ailing Zeng, Feng Li, Shilong Liu, Ruimao Zhang, and Lei Zhang. Neural Interactive Keypoint Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15122–15132, 2023.
杨杰、曾爱玲、李峰、刘世龙、张瑞茂、张磊。神经交互式关键点检测。IEEE/CVF 计算机视觉国际会议论文集》，第 15122-15132 页，2023 年。
[72] Jie Yang, Ailing Zeng, Shilong Liu, Feng Li, Ruimao Zhang, and Lei Zhang. Explicit box detection unifies end-to-end multi-person pose estimation. In International Conference on Learning Representations, 2023.
杨杰、曾爱玲、刘世龙、李峰、张瑞茂、张磊。显式盒检测统一端到端多人姿态估计。国际学习表示会议，2023年。
[73] Jie Yang, Ailing Zeng, Ruimao Zhang, and Lei Zhang. Unipose: Detecting any keypoints. arXiv preprint arXiv:2310.08530, 2023.
Jie Yang, Ailing Zeng, Ruimao Zhang, and Lei Zhang.Unipose: Detecting any keypoints. ArXiv preprint arXiv:2310.08530, 2023.
[74] Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effective whole-body pose estimation with two-stages distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023.
杨振东、曾爱玲、袁淳、李宇。两阶段蒸馏的有效全身姿态估计。In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210-4220, 2023.
[75] Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-Tau Yih, and Memory Memory. Retrieval-Augmented Multimodal Language Modeling.
Michihiro Yasunaga、Armen Aghajanyan、Weijia Shi、Rich James、Jure Leskovec、Percy Liang、Mike Lewis、Luke Zettlemoyer、Wen-Tau Yih 和 Memory Memory。检索增强型多模态语言建模。
[76] Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv preprint arXiv:2306.14289, 2023.
Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong.Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. arXiv preprint arXiv:2306.14289, 2023.
[77] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, 2022.
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum.DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, 2022.
[78] Hao Zhang, Feng Li, Huaizhe Xu, Shijia Huang, Shilong Liu, Lionel M Ni, and Lei Zhang. MP-Former: Mask-Piloted Transformer for Image Segmentation. arXiv preprint arXiv:2303.07336, 2023.
Hao Zhang, Feng Li, Huaizhe Xu, Shijia Huang, Shilong Liu, Lionel M Ni, and Lei Zhang.MP-Former：用于图像分割的掩码导向变换器。arXiv 预印本 arXiv:2303.07336, 2023.
[79] Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang. A Simple Framework for Open-Vocabulary Segmentation and Detection. arXiv preprint arXiv:2303.08131, 2023.
Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang.开放词汇分割和检测的简单框架. arXiv preprint arXiv:2303.08131, 2023.
[80] Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, and Jianwei Yang. LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models, 2023.
Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, and Jianwei Yang.LLaVA-Grounding：使用大型多模态模型的接地视觉聊天，2023 年。
[81] Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. GLIPv2: Unifying Localization and Vision-Language Understanding. arXiv preprint arXiv:2206.05836, 2022.
Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao.GLIPv2：ArXiv preprint arXiv:2206.05836, 2022.
[82] Lvmin Zhang and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models.
Lvmin Zhang and Maneesh Agrawala.为文本到图像扩散模型添加条件控制。
[83] Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al. Recognize Anything: A Strong Image Tagging Model. arXiv preprint arXiv:2306.03514, 2023.
张有才，黄新宇，马金玉，李朝阳，罗兆川，谢艳春，秦玉琢，罗彤，李雅倩，刘世龙，等。识别任何事物：一种强图像标记模型》，arXiv 预印本 arXiv:2306.03514, 2023。
[84] Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. 2022.
张逸夫、孙培泽、蒋毅、于冬冬、翁福成、袁泽环、罗平、刘文宇、王兴刚。字节跟踪：通过关联每个检测框实现多目标跟踪。2022.
[85] Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast Segment Anything, 2023.
赵旭、丁文超、安永奇、杜英龙、于涛、李敏、唐明和王金桥。快速分段 Anything》，2023 年。
[86] Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. SeqTR: A Simple Yet Universal Network for Visual Grounding, page 598–615. Springer Nature Switzerland, 2022.
Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liuujuan Cao, Xiaoshuai Sun, and Rongrong Ji.SeqTR：简单而通用的视觉接地网络》，第 598-615 页。Springer Nature Switzerland, 2022.
[87] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR 2021: The Ninth International Conference on Learning Representations, 2021.
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai.可变形 DETR：端到端对象检测的可变形变换器。In ICLR 2021：第九届学习表征国际会议，2021年。
[88] Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee*, and Jianfeng Gao*. Generalized Decoding for Pixel, Image and Language. 2022.
邹雪艳*、窦子怡*、杨建伟*、甘喆、李林杰、李春元、戴希阳、王剑锋、袁璐、彭南云、王丽娟、李永在*、高剑锋*。像素、图像和语言的广义解码2022.

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks基础 SAM：为各种视觉任务组装开放世界模型

Abstract 摘要

1 Introduction 1引言

2 Related Work 2 相关工作

2.1 Task-specific Vision Models2.1 特定任务视觉模型

2.2 Unified Models 2.2 统一模型

2.3 Model Assembly with a Controller System2.3 利用控制系统组装模型

3 Grounded SAM Playground 3接地 SAM 游乐场

3.1 Preliminary 3.1 初探

3.2 Grounded SAM: Open-Vocabulary Detection and Segmentation3.2Grounded SAM：开放词汇检测和分割

3.3 RAM-Grounded-SAM: Automatic Dense Image Annotation3.3 RAM-Grounded-SAM：自动密集图像注释

3.4 Grounded-SAM-SD: Highly Accurate and Controllable Image Editing3.4Grounded-SAM-SD: 高度精确和可控的图像编辑

3.5 Grounded-SAM-OSX: Promptable Human Motion Analysis3.5Grounded-SAM-OSX: 可提示的人体运动分析

3.6 More Extensions for Grounded SAM3.6 Grounded SAM 的更多扩展

4 Effectiveness of Grounded SAM4接地SAM的效果

5 Conclusion and Prospects5结论与展望

6 Contributions and Acknowledgments6贡献与致谢