这是用户在 2025-5-5 15:49 为 https://app.immersivetranslate.com/pdf-pro/63f65f0b-1685-4b13-89a9-f98d8acf7cad/?isTrial=true 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Gear: Enable Efficient Container Storage and Deployment with a New Image Format
Gear:使用新的映像格式实现高效的容器存储和部署

Hao Fan*, Shengwei Bian*, Song Wu*, Song Jiang \dagger, Shadi Ibrahim § § ^(§){ }^{\S}§, and Hai Jin*
范浩*、卞圣伟*、吴松*、江松 \dagger 、沙迪·易卜拉欣 § § ^(§){ }^{\S}§ 和金海*
*National Engineering Research Center for Big Data Technology and System
*大数据技术与系统国家工程研究中心
*Services Computing Technology and System Lab, Cluster and Grid Computing Lab
*服务计算技术与系统实验室、集群与网格计算实验室
^(**){ }^{*} School of Computer Science and Technology, Huazhong University of Science and Technology, China
^(**){ }^{*} 华中科技大学计算机科学与技术学院
\dagger Department of Computer Science and Engineering, University of Texas at Arlington, U.S
\dagger 美国德克萨斯大学阿灵顿分校计算机科学与工程系
§ § ^(§){ }^{\S}§ Inria, Univ. Rennes, CNRS, IRISA, France
§ § ^(§){ }^{\S}§ Inria, 雷恩大学, CNRS, IRISA, 法国
{fanh, swbian, wusong, hjin}@hust.edu.cn, song.jiang@uta.edu, shadi.ibrahim@inria.fr
{fanh, swbian, wusong, hjin}@hust.edu.cn, song.jiang@uta.edu, shadi.ibrahim@inria.fr

Abstract  抽象

Containers have been widely used in various cloud platforms as they enable agile and elastic application deployment through their process-based virtualization and layered image system. However, different layers of a container image may contain substantial duplicate and unnecessary data, which slows down its deployment due to long image downloading time and increased burden on the image registry. To accelerate the deployment and reduce the size of the registry, we propose a new image format, named Gear image, that consists of two parts: a Gear index describing the structure of the image’s file system and a set of files that are required when running an application. The Gear index is represented as a single-layer image compatible with the existing deployment framework. Containers can be launched by pulling a Gear index and on demand retrieving files pointed to by the index. Furthermore, the Gear image enables a file-level sharing mechanism, which helps remove duplicate data in the registry and avoid repeated downloading of identical files by a client. We implement a prototype of the container framework, named Gear, supporting the new image format. Evaluation shows that Gear saves 54 % 54 % 54%54 \% storage capacity in the registry, speeds up container startup by up to 5 × 5 × 5xx5 \times, and reduces 84 % 84 % 84%84 \% bandwidth demands.
容器通过基于进程的虚拟化和分层镜像系统实现敏捷、弹性的应用程序部署,因此已广泛应用于各种云平台。但是,容器镜像的不同层可能包含大量重复和不必要的数据,由于镜像下载时间长,增加了镜像注册中心的负担,从而减慢了容器镜像的部署速度。为了加速部署并减小注册表的大小,我们提出了一种名为 Gear image 的新映像格式,它由两部分组成:一个描述映像文件系统结构的 Gear 索引和一组运行应用程序时所需的文件。Gear 索引表示为与现有部署框架兼容的单层映像。可以通过拉取 Gear 索引并按需检索索引指向的文件来启动容器。此外,Gear 映像还支持文件级共享机制,这有助于删除注册表中的重复数据,并避免客户端重复下载相同的文件。我们实现了一个名为 Gear 的容器框架原型,它支持新的图像格式。评估表明,Gear 节省了 54 % 54 % 54%54 \% 注册表中的存储容量,将容器启动速度提高了 5 × 5 × 5xx5 \times ,并降低了 84 % 84 % 84%84 \% 带宽需求。

Index Terms-container, image format, deployment time, registry
索引术语 - 容器、镜像格式、部署时间、注册表

I. Introduction  I. 引言

Container-based virtualization has been widely used to enable isolated execution environments for applications [13]. This is due to its advantages of fast startup and low resource consumption [4, 5]. Take the most popular container framework, Docker [6] as an example. Docker packages an application and its dependencies into a self-contained file system named image. Users can upload (push) their images to a registry, such as the Docker Hub [7], for storage and sharing, or download (pull) required images from the registry to start container instances. In this way, a user can conveniently store and deploy containers. However, current image format does not support efficient image deployment and storage. On the one hand, long cold-start latency, which is a critical challenge for emerging computing paradigms, such as serverless computing [8], is mainly caused by the image downloading process [9]. On the other hand, the surge in the number of images puts high pressure on the registry in terms of bandwidth and storage capacity.
基于容器的虚拟化已被广泛用于为应用程序启用隔离的执行环境 [13]。这是由于它快速启动和低资源消耗的优势 [4, 5]。以最流行的容器框架 Docker [6] 为例。Docker 将应用程序及其依赖项打包到名为 image 的自包含文件系统中。用户可以将他们的镜像上传(推送)到注册表,例如 Docker Hub [7],用于存储和共享,或者从注册表下载(拉取)所需的镜像以启动容器实例。通过这种方式,用户可以方便地存储和部署容器。但是,当前的镜像格式不支持高效的镜像部署和存储。一方面,较长的冷启动延迟是 serverless computing [8] 等新兴计算范式的关键挑战,这主要是由图像下载过程 [9] 引起的。另一方面,图像数量的激增在带宽和存储容量方面给注册表带来了很大的压力。
There are two reasons why the image format cannot support efficient image storage and deployment. (1) Regarding container storage, there is substantial redundant data between different layers of images, which results in a large waste of storage space [10, 11]. Although layered images allow layer-level deduplication to reduce storage footprint, there is still substantial data redundancy that cannot be detected and removed at this coarse deduplication granularity. (2) Regarding container deployment, an image often contains substantial unused data when launching an instance, which results in slow deployment. In addition, images cannot share files among each other and identical files are downloaded repeatedly.
镜像格式无法支持高效的镜像存储和部署的原因有两个。(1) 关于容器存储,不同图像层之间存在大量冗余数据,导致存储空间的大量浪费 [10, 11]。尽管分层映像允许分层级重复数据删除以减少存储占用空间,但在此粗略的重复数据删除粒度下,仍然存在无法检测和删除的大量数据冗余。(2) 在容器部署方面,在启动实例时,镜像中往往会包含大量未使用的数据,导致部署速度变慢。此外,图像之间无法共享文件,并且会重复下载相同的文件。
There have been many works focusing on accelerating container deployment and saving space in the registry. A general approach to accelerate container deployment is to read data on demand through remote images [12-15]. However, these works lack either compatibility or flexibility, because they need to adhere to a specific file system [12, 13] or build a fixed-size virtual block device for each container [14] to enable on-demand downloading. Besides, they require significant modifications to the container I/O stack which prevents these methods from being widely applied. As for the registry, previous works [ 10 , 16 , 17 ] [ 10 , 16 , 17 ] [10,16,17][10,16,17] mainly use deduplication to save space. However, these works neither reduce bandwidth demands nor accelerate the deployment of a container, because an entire image has to be reconstructed and downloaded. As the image format dictates how containers are stored and deployed, we design a new image format that is compatible with the current Docker framework to speed up container deployment under existing container I/O stack and save space in the registry. On the one hand, only required data are downloaded to deploy containers in the new format [9,18-20]. On the other hand, the new image allows file-level sharing among images [11, 21] to eliminate duplicate data in the registry as well as further reduce data to be downloaded in clients. To the best of our knowledge, this is the first work on improving both storage and deployment efficiencies of containers while being compatible with the current Docker framework.
目前有许多工作专注于加速容器部署和节省注册表中的空间。加速容器部署的一般方法是通过远程镜像按需读取数据 [12-15]。然而,这些作品要么缺乏兼容性,要么缺乏灵活性,因为它们需要遵守特定的文件系统 [12, 13] 或为每个容器构建一个固定大小的虚拟块设备 [14] 以实现按需下载。此外,它们需要对容器 I/O 堆栈进行重大修改,这会阻止这些方法得到广泛应用。至于注册表,之前的工作 [ 10 , 16 , 17 ] [ 10 , 16 , 17 ] [10,16,17][10,16,17] 主要使用去重来节省空间。但是,这些工作既不会减少带宽需求,也不会加速容器的部署,因为必须重建和下载整个映像。由于镜像格式决定了容器的存储和部署方式,因此我们设计了一种与当前 Docker 框架兼容的新镜像格式,以加快现有容器 I/O 堆栈下的容器部署速度并节省注册表中的空间。一方面,只需下载所需的数据即可以新格式部署容器 [9,18-20]。另一方面,新镜像允许在镜像之间进行文件级共享 [11, 21],以消除注册表中的重复数据,并进一步减少在客户端中下载的数据。据我们所知,这是在与当前 Docker 框架兼容的同时提高容器的存储和部署效率的第一项工作。
In particular, the proposed image format, named Gear image, introduces an index structure to support on-demand downloading of selected image data. A Gear image consists
特别是,建议的名为 Gear image 的图像格式引入了一种索引结构,以支持按需下载选定的图像数据。Gear 映像包括

of two components, a Gear index and a set of Gear files. The Gear index retains the directory structure in the corresponding Docker image. The actual files in the Docker image are taken out and stored separately as Gear files. In place of the index where an entry for a regular file should be stored, we record the file’s MD5 hash value. This regular file is then identified with its hash value. This decoupling of index structure and regular files enables file-level sharing for saving storage space. The process of starting a new container instance can be very fast, because only a very small Gear index (usually less than 1 MB ) needs to be retrieved before starting the container, and required Gear files will be retrieved on demand.
由两个组件组成,一个 Gear 索引和一组 Gear 文件。Gear 索引在相应的 Docker 镜像中保留目录结构。Docker 镜像中的实际文件被取出并作为 Gear 文件单独存储。为了代替应存储常规文件条目的索引,我们记录了文件的 MD5 哈希值。然后,此常规文件使用其哈希值进行标识。索引结构和常规文件的这种分离支持文件级共享以节省存储空间。启动新容器实例的过程可以非常快,因为在启动容器之前只需要检索非常小的 Gear 索引(通常小于 1 MB),并且会按需检索所需的 Gear 文件。
In summary, we make the following contributions:
总而言之,我们做出以下贡献:
  • We analyze the impact of image format on container deployment and the effectiveness of file-level sharing in both registry and client.
    我们分析了映像格式对容器部署的影响,以及注册表和客户端中文件级共享的有效性。
  • We propose a new image format (Gear image) that can run on top of existing container I/O stack, and design Gear framework that can store and deploy Gear images. Gear can ensure that files in images are shared, the required files that cannot be found in local images are downloaded on demand.
    我们提出了一种新的镜像格式(Gear 镜像),它可以在现有的容器 I/O 堆栈上运行,并设计了可以存储和部署 Gear 镜像的 Gear 框架。Gear 可以保证镜像中的文件被共享,在本地镜像中找不到的所需文件按需下载。
  • We integrate Gear into Docker, implement a system that supports the storage and deployment of the Gear images compatible to the Docker framework, and evaluate the efficiency of container storage and deployment. Experimental results show that Gear can save the storage footprint of the registry by 54 % 54 % 54%54 \%, reduce bandwidth demands by 84 % 84 % 84%84 \%, and increase the container start speed by 1.6 × 1.6 × 1.6 xx1.6 \times and 5 × 5 × 5xx5 \times under high and low bandwidths, respectively.
    我们将 Gear 集成到 Docker 中,实现一个支持存储和部署与 Docker 框架兼容的 Gear 镜像的系统,并评估容器存储和部署的效率。实验结果表明,Gear 可以通过以下方式 54 % 54 % 54%54 \% 节省注册中心的存储占用空间,减少带宽需求 84 % 84 % 84%84 \% ,分别在 1.6 × 1.6 × 1.6 xx1.6 \times 高带宽和 5 × 5 × 5xx5 \times 低带宽下提高容器启动速度。

    The rest of this paper is organized as follows. Section II describes how Docker stores and deploys containers, and why Docker image format is not efficient. Section III describes the design of Gear. Section IV describes the implementation of Gear. Section V presents extensive evaluation of a Gear prototype implementation. Section VI discusses the related work. And section VII concludes our work.
    本文的其余部分组织如下。第 II 节介绍了 Docker 如何存储和部署容器,以及为什么 Docker 镜像格式效率低下。第 III 部分描述了 Gear 的设计。第 IV 节描述了 Gear 的实现。第 V 节对 Gear 原型实现进行了广泛的评估。第六节讨论了相关工作。第七部分结束了我们的工作。

II. Background and Motivation
II. 背景和动机

In this section, we describe the Docker image format (§II-A), introduce how Docker stores (§II-B) and deploys (§II-C) containers, and explain why Docker image format is not efficient and the consequences.
本节将介绍 Docker 镜像格式 (§II-A),介绍 Docker 如何存储 (§II-B) 和部署 (§II-C) 容器,并解释 Docker 镜像格式效率低下的原因及其后果。

A. Docker Images  A. Docker 镜像

A Docker image is a read-only template for creating a container. It is composed of a series of layers that are stacked together. Each layer is identified by its digest, the SHA256 hash value of the layer’s content. When launching a container, a writable layer will be created at the top of the image layer stack, and all changes (e.g., creation or modification of files) will be preserved in the writable layer, which is achieved with the Copy-on-Write (COW) mechanism. Using the “commit” command, a writable layer can be turned into a read-only layer to produce a new image from the container instance. Multiple images may have common layers. Figure 1(a) shows
Docker 镜像是用于创建容器的只读模板。它由一系列堆叠在一起的层组成。每个层都由其摘要(层内容的 SHA256 哈希值)标识。启动容器时,将在图像层堆栈的顶部创建一个可写层,所有更改(例如,文件的创建或修改)都将保留在可写层中,这是通过写时复制 (COW) 机制实现的。使用 “commit” 命令,可以将可写层转换为只读层,以从容器实例生成新映像。多个图像可能具有公共图层。图 1(a) 显示


(a) Docker image format  (a) Docker 镜像格式
(b) Layers under Overlay2
(b) 叠加下的图层2
Fig. 1. Docker image format and its storage form under Overlay2. (a) Two images share the bottom layer; and (b) Layers of an image are COWed. And each image has its writable layer when its instance is launched.
图 1.Docker 镜像格式及其在 Overlay2 下的存储形式。(a) 两张图片共享底层;以及 (b) 图像的图层被 COWed。每个映像在启动其实例时都有其可写层。

a “Debian:buster-slim” image that has only one layer. Another image (“Nginx:1.17”) is built on top of the “Debian:busterslim” image and has two additional read-only layers. While “Nginx:1.17” has been used to launch a container, it has a writable layer on the top.
一个只有一个图层的 “Debian:buster-slim” 镜像。另一个镜像 (“Nginx:1.17”) 构建在 “Debian:busterslim” 镜像之上,并具有两个额外的只读层。虽然 “Nginx:1.17” 用于启动容器,但它的顶部有一个可写层。

B. Storage of Docker Images
B. Docker 镜像的存储

Docker images are stored in a centralized registry. The registry can be either a public (e.g., Docker Hub [7]) or a private one. In the registry, layers are stored in the form of compressed tarballs so that the storage space at the registry can be reduced. In addition to the layers, a JSON file named manifest is also stored in the registry together with the image. It records configuration information about this image. Among all the information, the most important one is the digests of the image’s layers. With the layered image format, different images can be deduplicated at the granularity of layer. Layerlevel deduplication is carried out by comparing the digests of the layers to be stored with the digests of the layers already in the registry. Unique layers will be sent to and stored in the registry. By combining layer-level deduplication and in-layer compression, Docker can reduce the storage footprint of the registry by 3.54 × 3.54 × 3.54 xx3.54 \times [21].
Docker 镜像存储在集中式注册表中。注册表可以是公共的(例如,Docker Hub [7])或私有的。在 registry 中,层以压缩的 tarball 形式存储,以便可以减少 registry 的存储空间。除了层之外,名为 manifest 的 JSON 文件也与映像一起存储在注册表中。它记录有关此映像的配置信息。在所有信息中,最重要的一个是图像图层的摘要。使用分层图像格式,可以按图层粒度对不同的图像进行重复数据删除。层级重复数据删除是通过将要存储的层的摘要与注册表中已有的层的摘要进行比较来执行的。唯一层将被发送到注册表并存储在注册表中。通过结合层级重复数据删除和层内压缩,Docker 可以将注册表的存储占用减少 3.54 × 3.54 × 3.54 xx3.54 \times [21]。

C. Deployment of Docker Containers
C. Docker 容器的部署

There are two steps for deploying (i.e., launching) a container instance, which are to download the image and then to use the image to start the instance. In the first step, the Docker daemon at a client retrieves the manifest of the target image and then downloads layers that are not yet present at the local storage. The graph driver is responsible for saving the image in the local storage and making image layers locally available for reuse. In the second step, the Docker daemon configures and launches the container instance. The graph driver is responsible for providing a complete and correct root file system for the container.
部署(即启动)容器实例有两个步骤,即下载镜像,然后使用该镜像启动实例。在第一步中,客户端上的 Docker 守护程序检索目标映像的清单,然后下载本地存储中尚不存在的层。图形驱动程序负责将图像保存在本地存储中,并使图像图层在本地可供重复使用。在第二步中,Docker 守护程序配置并启动容器实例。图形驱动程序负责为容器提供完整且正确的根文件系统。
In particular, the graph driver determines how to store the layers and to construct the root file system. The graph driver is implemented based on the underlying file system, or format of the image layers. Currently, the most widely used, and also the officially recommended graph driver is the Overlay2 [22]. Figure 1(b) shows the storage format of the two images shown in Figure 1(a) and two containers launched from them using Overlay2. Each layer is represented as a directory, and each directory has two core components: the diff/ directory and the lower file. All data of a layer are stored in the diff/ directory in its original form (e.g., files, directories, symbolic links). The
具体而言,图形驱动程序确定如何存储层和构建根文件系统。图形驱动程序是根据底层文件系统或图像层的格式实现的。目前,使用最广泛,也是官方推荐的图形驱动程序是 Overlay2 [22]。图 1(b) 显示了图 1(a) 中所示的两个图像的存储格式,以及使用 Overlay2 从它们启动的两个容器。每个层都表示为一个目录,每个目录都有两个核心组件:diff/ 目录和较低的文件。层的所有数据都以其原始形式存储在 diff/ 目录中(例如,文件、目录、符号链接)。这
TABLE I  表 I
The workloads  工作负载
Linux Distro:  Linux 发行版:
alpine, amazonlinux, busybox, centos, debian, ubuntu
Alpine、AmazonLinux、BusyBox、CentOS、Debian、Ubuntu

Language: golang, java, openjdk, php, python, ruby
语言: golang, java, openjdk, php, python, ruby

Database:  数据库:
cassandra, couchbase, crate, elasticsearch, influxdb, mariadb, memcached, mongo, mysql, postgres, redis
Cassandra、Couchbase、crate、Elasticsearch、InfluxDB、MariaDB、Memcached、Mongo、MySQL、Postgres、Redis

Web Component:  Web 组件:
consul, eclipse-mosquitto, haproxy, httpd, kibana, kong, nginx, node, telegraf, tomcat, traefik
consul、eclipse-mosquitto、haproxy、httpd、kibana、kong、nginx、node、telegraf、tomcat、traefik

Application Platform: drupal, ghost, jenkins, nextcloud, rabbitmq, solr, sonarqube, wordpress
应用平台:drupal、ghost、jenkins、nextcloud、rabbitmq、solr、sonarqube、wordpress

Others: chronograf, docker, gradle, hello-world, logstash, maven, registry, vault
其他:chronograf、docker、gradle、hello-world、logstash、maven、registry、vault

lower file holds the information of all the parent layers of the current layer. The writable layer of the container has the same structure as the normal layers. When launching containers, Overlay 2 mounts all read-only and writable layers to a mount point. Through the mount point containers can access the root file system.
lower file 保存当前图层的所有父图层的信息。容器的可写层与普通层具有相同的结构。启动容器时,Overlay 2 会将所有只读和可写层挂载到挂载点。通过挂载点,容器可以访问根文件系统。

D. Motivation  D. 动机

On-demand remote image format. Currently, when launching a Docker container at a server, the entire image has to be downloaded beforehand. The time spent on the potentially lengthy downloading process may be unacceptable for the container deployment. For example, in CI/CD [23] and Dev/Ops [24] scenarios, container versions can be updated frequently [25, 26], and old images have to be replaced quickly by new images for security and performance. Accordingly, researchers propose remote image formats [12-14] that only download a small portion of data (about 6.4 % 33.3 % 6.4 % 33.3 % 6.4%-33.3%6.4 \%-33.3 \% ) on demand. However, existing remote image formats (i.e., file system-based and block-based) show either weak compatibility or poor flexibility. A file system-based remote image usually adheres to a specific file system [ 12 , 13 ] [ 12 , 13 ] [12,13][12,13] (i.e., NFS or CIFS) to enable on-demand data access. This approach makes POSIXcompliant feature an unnecessary constraint for non-POSIX workloads, such as serverless applications. Furthermore, some desirable features (i.e., snapshot and compression) of popular file systems are missing for NFS and CIFS [14]. A block-based remote image needs to build a virtual block device for each container to enable layered image. This means that containers cannot share data among each other, potentially reducing the efficiency, and the size of virtual block device cannot be adjusted according to actual image size. Furthermore, existing remote image formats require significant modifications in I/O stack, i.e., designing new storage drivers, adhering to block devices, or specifying file systems. This holds back the wide adoption of these methods. Accordingly, we manage to design a remote image format based on Docker’s default and preferred storage driver, Overlay2, which can be built on various file systems (i.e., EXT4, ZFS, BtrFS) for high compatibility and flexibility.
按需远程图像格式。目前,在服务器上启动 Docker 容器时,必须事先下载整个映像。对于容器部署来说,在可能漫长的下载过程中花费的时间可能是不可接受的。例如,在 CI/CD [23] 和 Dev/Ops [24] 场景中,容器版本可以频繁更新[25, 26],并且为了安全性和性能,必须快速用新镜像替换旧镜像。因此,研究人员提出了远程图像格式 [12-14],即按需仅下载一小部分数据(大约 6.4 % 33.3 % 6.4 % 33.3 % 6.4%-33.3%6.4 \%-33.3 \% )。然而,现有的远程图像格式(即基于文件系统和基于块)要么兼容性较弱,要么灵活性差。基于文件系统的远程映像通常遵循特定的文件系统 [ 12 , 13 ] [ 12 , 13 ] [12,13][12,13] (即 NFS 或 CIFS)以实现按需数据访问。这种方法使 POSIX 兼容功能成为非 POSIX 工作负载(例如无服务器应用程序)的不必要约束。此外,NFS 和 CIFS 缺少一些流行文件系统的理想功能(即快照和压缩)[14]。块式远程镜像需要为每个容器构建一个虚拟块设备,才能实现分层镜像。这意味着容器之间无法共享数据,这可能会降低效率,并且虚拟块设备的大小无法根据实际镜像大小进行调整。此外,现有的远程图像格式需要在 I/O 堆栈中进行重大修改,即设计新的存储驱动程序、遵守块设备或指定文件系统。这阻碍了这些方法的广泛采用。 因此,我们设法基于 Docker 的默认和首选存储驱动程序 Overlay2 设计了一种远程镜像格式,该格式可以构建在各种文件系统(即 EXT4、ZFS、BtrFS)上,以实现高兼容性和灵活性。
Management granularity of remote image format. We manage remote images in file-level for the simplicity and
远端镜像格式管理粒度。我们在文件级别管理远程图像,以实现简单性和
TABLE II  表 II
Storage usage and number of objects under different DEDUPLICATION GRANULARITIES
不同去重粒度下的存储使用情况和对象数量
No   Layer-level  图层级别 File-level  文件级 Chunk-level  块级

存储 用法
Storage
Usage
Storage Usage| Storage | | :--- | | Usage |
370 GB  370 吉字节 98 GB  98 吉字节 47 GB  47 吉字节 43 GB  43 吉字节

对象 数
Object
Number
Object Number| Object | | :--- | | Number |
971 5,670 639,585 1 , 0478 , 675 1 , 0478 , 675 1,0478,6751,0478,675
No Layer-level File-level Chunk-level "Storage Usage" 370 GB 98 GB 47 GB 43 GB "Object Number" 971 5,670 639,585 1,0478,675| | No | Layer-level | File-level | Chunk-level | | :--- | :--- | :--- | :--- | :--- | | Storage <br> Usage | 370 GB | 98 GB | 47 GB | 43 GB | | Object <br> Number | 971 | 5,670 | 639,585 | $1,0478,675$ |
applicability of our design, because managing an image in chunks needs to introduce a virtual block device [14] or a specific file system [12, 13]. Furthermore, the management granularity of remote image not only affects the downloading efficiency of an image, but also determines the deduplication ratio in the registry. With the popularity of containers, registries have to manage a growing number of images. Docker Hub alone stores two million images with a total size of 1 PB [27]. As a result, organizations spend an increasing amount of their storage and networking infrastructure on operating image registries.
我们设计的适用性,因为以块为单位管理镜像需要引入虚拟块设备 [14] 或特定的文件系统 [12, 13]。此外,远程镜像的管理颗粒度不仅影响镜像的下载效率,还决定了注册表中的去重率。随着容器的普及,注册表必须管理越来越多的映像。仅 Docker Hub 就存储了 200 万张图像,总大小为 1 PB [27]。因此,组织将其存储和网络基础设施的大量用于运营映像注册表。
Deduplication is required to reduce duplicate data among images [28]. To observe the deduplication ratio in different deduplication granularity, we create a private registry by downloading the images listed in Table I. The registry unpacks the layers and removes duplicate data at different granularity (image layer, file, or chunk with a chunk size of 128 KB ) and compresses data also in different granularities. Table II shows results for various deduplications. Compared to the images without any deduplication, space demand of layerlevel, file-level, and chunk-level deduplication reduced by 74 % , 87 % 74 % , 87 % 74%,87%74 \%, 87 \%, and 88 % 88 % 88%88 \%, respectively. With the number of images increasing, duplicate data grows dramatically in the registry. It is reported that when the total size of unpacked images is 167 TB (unpacked from 47 TB compressed images), only 3.2 % 3.2 % 3.2%3.2 \% of files left after file-level deduplication, which in total only occupy 24 TB [11]. While file-level and chunk-level deduplication show similar space-saving effort, chunk-level deduplication can cause a dramatic increase in the number of unique objects, i.e., the number of unique objects of chunk-
需要去重以减少图像之间的重复数据 [28]。为了观察不同重复数据删除粒度的重复数据删除率,我们通过下载表 I 中列出的镜像来创建一个私有注册表。注册表解压缩层并删除不同粒度的重复数据(图像层、文件或块大小为 128 KB 的块),并压缩不同粒度的数据。表 II 显示了各种重复数据删除的结果。与未进行任何重复数据删除的图像相比,层级、文件级和块级重复数据删除的空间需求分别减少了 74 % , 87 % 74 % , 87 % 74%,87%74 \%, 87 \% 、 和 88 % 88 % 88%88 \% 。随着图像数量的增加,注册表中的重复数据会急剧增加。据报道,当解压后的镜像总大小为 167 TB(从 47 TB 压缩镜像解压)时,只有 3.2 % 3.2 % 3.2%3.2 \% 文件级去重后留下的文件,总共只占用 24 TB [11]。虽然文件级和块级重复数据删除显示出类似的空间节省工作,但数据块级重复数据删除可能会导致唯一对象的数量急剧增加,即 chunk-

Fig. 2. Redundancy between necessary data among different images in a common image series
图 2.通用映像系列中不同映像之间必要数据之间的冗余