这是用户在 2024-5-5 9:14 为 https://readit.site/a/Hksk6/2022-08-29-gits-database-internals-i-packed-object-store 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Skip to content

Git’s database internals I: packed object store
Git 的数据库内部 I:打包对象存储

This blog series will examine Git’s internals to help make your engineering system more efficient. Part I discusses how Git stores its data in packfiles using custom compression techniques.
本博客系列将检查 Git 的内部机制,以帮助您的工程系统更高效。第一部分讨论了 Git 如何使用自定义压缩技术在 packfiles 中存储其数据。

Git's database internals I: packed object store
Author 作者

Developers collaborate using Git. It is the medium that allows us to share code, work independently on our own machines, and then finally combine our efforts into a common understanding. For many, this is done by following some well-worn steps and sticking to that pattern. This works in the vast majority of use cases, but what happens when we need to do something new with Git? Knowing more about Git’s internals helps when exploring those new solutions.
开发者们使用 Git 进行协作。它是允许我们共享代码、在各自的机器上独立工作,然后最终将我们的努力合并到一个共同理解中的媒介。对许多人来说,这是通过遵循一些久经考验的步骤并坚持这种模式来完成的。这在绝大多数用例中都是有效的,但是当我们需要用 Git 做一些新事情时会发生什么呢?了解更多关于 Git 内部的知识在探索那些新解决方案时会有帮助。

In this five-part blog post series, we will illuminate Git’s internals to help you collaborate via Git, especially at scale.
在这个由五部分组成的博客文章系列中,我们将阐明 Git 的内部机制,以帮助您通过 Git 进行协作,特别是在大规模协作时。

It might also be interesting because you love data structures and algorithms. That’s what drives me to be interested in and contribute to Git.
如果你热爱数据结构和算法,这也可能会引起你的兴趣。这正是我对 Git 感兴趣并为其做出贡献的动力。

Git’s architecture follows patterns that may be familiar to developers, except the patterns come from a different context. Almost all applications use a database to persist and query data. When building software based on an application database system, it’s easy to get started without knowing any of the internals. However, when it’s time to scale your solution, you’ll have to dive into more advanced features like indexes and query plans.
Git 的架构遵循可能对开发者来说很熟悉的模式,只不过这些模式来自不同的上下文。几乎所有的应用程序都使用数据库来持久化和查询数据。基于应用程序数据库系统构建软件时,不了解任何内部机制也很容易上手。然而,当需要扩展您的解决方案时,您将不得不深入了解更高级的功能,如索引和查询计划。

The core idea I want to convey is this:
我想传达的核心思想是这样的:

Git is the distributed database at the core of your engineering system.
Git 是你的工程系统核心的分布式数据库。

Here are some very basic concepts that Git shares with application databases:
这里有一些 Git 与应用数据库共有的非常基础的概念:

  1. Data is persisted to disk.
    数据被持久化到磁盘。
  2. Queries allow users to request information based on that data.
    查询允许用户根据该数据请求信息。
  3. The data storage is optimized for these queries.
    数据存储针对这些查询进行了优化。
  4. The query algorithms are optimized to take advantage of these structures.
    查询算法已优化,以利用这些结构。
  5. Distributed nodes need to synchronize and agree on some common state.
    分布式节点需要同步并就某些共同状态达成一致。

While these concepts are common to all databases, Git is particularly specialized. Git was built to store plain-text source code files, where most change are small enough to read in a single sitting, even if the codebase contains millions of lines. People use Git to store many other kinds of data, such as documentation, web pages, or configuration files.
虽然这些概念对所有数据库都是通用的,但 Git 特别专业化。Git 被构建用于存储纯文本源代码文件,在大多数情况下,即使代码库包含数百万行代码,大部分更改也足够小,可以一次性阅读完。人们使用 Git 来存储许多其他类型的数据,如文档、网页或配置文件。

While many application databases use long-running processes with significant amounts of in-memory caching, Git uses short-lived processes and uses the filesystem to persist data between executions. Git’s data types are more restrictive than a typical application database. These aspects lead to very specialized data storage and access patterns.
尽管许多应用程序数据库使用长时间运行的进程和大量的内存缓存,但 Git 使用短暂的进程,并使用文件系统在执行之间持久化数据。Git 的数据类型比典型的应用程序数据库更为限制性。这些方面导致了非常专业化的数据存储和访问模式。

Today, let’s dig into the basics of what data Git stores and how it accesses that data. Specifically, we will learn about Git’s object store and how it uses packfiles to compress data that would otherwise contain redundant information.
今天,让我们深入了解 Git 存储的数据基础以及它如何访问这些数据。具体来说,我们将学习关于 Git 的对象存储以及它如何使用 packfiles 来压缩那些否则会包含冗余信息的数据。

Git’s object store Git 的对象存储

The most fundamental concepts in Git are Git objects. These are the “atoms” of your Git repository. They combine in interesting ways to create the larger structure. Let’s start with a quick overview of the important Git objects. Feel free to skip ahead if you know this, or you can dig deep into Git’s object model if you’re interested.
在 Git 中最基本的概念是 Git 对象。这些是你的 Git 仓库的“原子”。它们以有趣的方式组合,以创建更大的结构。让我们从快速概述重要的 Git 对象开始。如果你已经知道这些,可以跳过;或者,如果你对 Git 的对象模型感兴趣,可以深入研究。

In your local Git repositories, your data is stored in the .git directory. Inside, there is a .git/objects directory that contains your Git objects.
在你的本地 Git 仓库中,你的数据存储在 .git 目录中。里面有一个 .git/objects 目录,包含了你的 Git 对象。

$ ls .git/objects/
01  34  9a  df  info    pack

$ ls .git/objects/01/
12010547a8990673acf08117134bdc181bd735

$ ls .git/objects/pack/
multi-pack-index
pack-7017e6ce443801478cf19006fc5499ba1c4d2960.idx
pack-7017e6ce443801478cf19006fc5499ba1c4d2960.pack
pack-9f9258a8ffe4187f08a93bcba47784e07985d999.idx
pack-9f9258a8ffe4187f08a93bcba47784e07985d999.pack

The .git/objects directory is called the object store. It is a content-addressable data store, meaning that we can retrieve the contents of an object by providing a hash of those contents.
.git/objects 目录被称为对象存储。它是一个内容可寻址的数据存储,这意味着我们可以通过提供一个对象内容的哈希来检索该对象的内容。

In this way, the object store is like a database table with two columns: the object ID and the object content. The object ID is the hash of the object content and acts like a primary key.
通过这种方式,对象存储就像一个有两列的数据库表:对象 ID 和对象内容。对象 ID 是对象内容的哈希值,并且像主键一样起作用。

Table with columns labeled Object ID and Object Data

Upon first encountering content-addressable data stores, it is natural to ask, “How can we access an object by hash if we don’t already know its content?” We first need to have some starting points to navigate into the object store, and from there we can follow links between objects that exist in the structure of the object data.
当第一次遇到基于内容寻址的数据存储时,很自然会问:“如果我们不知道它的内容,我们怎么能通过哈希访问一个对象?”我们首先需要一些起点来导航进入对象存储,从那里我们可以跟随对象数据结构中存在的对象之间的链接。

First, Git has references that allow you to create named pointers to keys in the object database. The reference store mainly exists in the .git/refs/ directory and has its own advanced way of storing and querying references efficiently. For now, think of the reference store as a two-column table with columns for the reference name and the object ID. In the reference store, the reference name is the primary key.
首先,Git 有引用,允许你创建命名指针指向对象数据库中的键。引用存储主要存在于 .git/refs/ 目录中,并且有它自己高级的存储和高效查询引用的方式。现在,可以把引用存储想象为一个有两列的表,列分别是引用名和对象 ID。在引用存储中,引用名是主键。

Image showing how the Object ID table relates to the Object Store

Now that we have a reference store, we can navigate into the object store from some human-readable names. In addition to specifying a reference by its full name, such as refs/tags/v2.37.0, we can sometimes use short names, such as v2.37.0 where appropriate.
现在我们有了一个参考存储,我们可以从一些人类可读的名称导航进入对象存储。除了可以使用其完整名称指定参考,例如 refs/tags/v2.37.0 ,我们有时也可以在适当的情况下使用短名称,例如 v2.37.0

In the Git codebase, we can start from the v2.37.0 reference and follow the links to each kind of Git object.
在 Git 代码库中,我们可以从 v2.37.0 参考开始,沿着链接跟踪到每一种 Git 对象。

  • The refs/tags/v2.37.0 reference points to an annotated tag object. An annotated tag contains a reference to another object (by object ID) and a plain-text message.
    refs/tags/v2.37.0 参考指向一个带注释的标签对象。一个带注释的标签包含对另一个对象的引用(通过对象 ID)和一个纯文本消息。
  • That tag’s object references a commit object. A commit is a snapshot of the worktree at a point in time, along with connections to previous versions. It contains links to parent commits, a root tree, as well as metadata, such as commit time and commit message.
    该标签的对象引用了一个提交对象。提交是工作树在某一时间点的快照,以及与先前版本的连接。它包含了指向父提交、一个根树以及元数据的链接,如提交时间和提交信息。
  • That commit’s root tree references a tree object. A tree is similar to a directory in that it contains entries that link a path name to an object ID.
    该提交的根树引用了一个树对象。树与目录类似,因为它包含了将路径名与对象 ID 链接起来的条目。
  • From that tree, we can follow the entry for README.md to find a blob object. Blobs store file contents. They get their name from the tree that points to them.
    从那个树中,我们可以跟随 README.md 的条目找到一个 blob 对象。Blobs 存储文件内容。它们的名字来自于指向它们的树。

Image displaying hops through the object database in response to a user request.

From this example, we navigated from a ref to the contents of the README.md file at that position in the history. This very simple request of “give me the README at this tag” required several hops through the object database, linking an object ID to that object’s contents.
从这个例子中,我们从一个引用导航到了历史中该位置的 README.md 文件内容。这个非常简单的请求“给我这个标签下的 README”需要通过对象数据库进行几次跳转,将一个对象 ID 与该对象的内容关联起来。

These hops are critical to many interesting Git algorithms. We will explore how the graph structure of the object store is used by Git’s algorithms in parts two through four. For now, let’s focus on the critical operation of linking an object ID to the object contents.
这些跳转对于许多有趣的 Git 算法至关重要。我们将在第二部分到第四部分探讨对象存储的图结构是如何被 Git 算法使用的。现在,让我们专注于将对象 ID 与对象内容关联这一关键操作。

Object store queries 对象存储查询

To store and access information in an application database, developers interact with the database using a query language such as SQL. Git has its own type of query language: the command-line interface. Git commands are how we interact with the Git object store. Since Git has its own structure, we do not get the full flexibility of a relational database. However, there are some parallels.
为了在应用程序数据库中存储和访问信息,开发者使用如 SQL 这样的查询语言与数据库进行交互。Git 有它自己类型的查询语言:命令行界面。Git 命令是我们与 Git 对象存储进行交互的方式。由于 Git 有它自己的结构,我们不会得到关系型数据库的全部灵活性。然而,有一些相似之处。

To select object contents by object ID, the git cat-file command will do the object lookup and provide the necessary information. We’ve already been using git cat-file -p to present “pretty” versions of the Git object data by object ID. The raw content is not always fit for human readers, with object IDs stored as raw hashes and not hexadecimal digits, among other things like null bytes. We can also use git cat-file -t to show the type of an object, which is discoverable from the initial few bytes of the object data.
要通过对象 ID 选择对象内容, git cat-file command 将执行对象查找并提供必要的信息。我们已经在使用 git cat-file -p 来展示对象 ID 的 Git 对象数据的“美观”版本。原始内容并不总是适合人类阅读,对象 ID 存储为原始哈希而不是十六进制数字,以及其他一些事情,如空字节。我们也可以使用 git cat-file -t 来显示对象的类型,这可以从对象数据的最初几个字节中发现。

To insert an object into the object store, we can write directly to a blob using git hash-object. This command takes file content and writes it into a blob in the object store. After the input is complete, Git reports the object ID of the written blob.
要将一个对象插入到对象存储中,我们可以使用 git hash-object 直接写入一个 blob。这个命令接受文件内容并将其写入对象存储中的一个 blob。输入完成后,Git 会报告写入的 blob 的对象 ID。

$ git hash-object -w --stdin
Hello, world!
af5626b4a114abcb82d63db7c8082c3c4756e51b

$ git cat-file -t af5626b4a114abcb82d63db7c8082c3c4756e51b
blob

$ git cat-file -p af5626b4a114abcb82d63db7c8082c3c4756e51b
Hello, world!

More commonly, we not only add a file’s contents to the object store, but also prepare to create new commit and tree objects to reference that new content. The git add command hashes new changes in the worktree and stores their blobs in the object store then writes the list of objects to a staging area known as the Git index. The git commit command takes those staged changes and creates trees pointing to all of the new blobs, then creates a new commit object pointing to the new root tree. Finally, git commit also updates the current branch to point to the new commit.
更常见的是,我们不仅将文件内容添加到对象存储中,还准备创建新的提交和树对象来引用那些新内容。 git add command 对工作树中的新更改进行哈希处理,并将它们的 blobs 存储在对象存储中,然后将对象列表写入称为 Git 索引的暂存区。 git commit command 接着利用这些暂存的更改创建指向所有新 blobs 的树,然后创建一个指向新根树的新提交对象。最后, git commit 还会更新当前分支以指向新的提交。

The figure below shows the process of creating several Git objects and finally updating a reference that happens when running git commit -a -m "Update README.md" when the only local edit is a change to the README.md file.
下图显示了运行 git commit -a -m "Update README.md" 时发生的创建几个 Git 对象并最终更新引用的过程,当时唯一的本地编辑是对 README.md 文件的更改。

Image showing the process of creating several Git objects and updating references

We can do slightly more complicated queries based on object data. Using git log --pretty=format:<format-string>, we can make custom queries into the commits by pulling out “columns” such as the object ID and message, and even the committer and author names, emails, and dates. See the git log documentation for a full column list.
我们可以基于对象数据进行稍微复杂一些的查询。使用 git log --pretty=format:<format-string> ,我们可以通过提取“列”如对象 ID 和消息,甚至提交者和作者的姓名、电子邮件和日期,来对提交进行自定义查询。有关完整列列表,请参阅 git log 文档。

There are also some prebuilt formats ready for immediate use. For example, we can get a simple summary of a commit using git log --pretty=reference -1 <ref>. This query parses the commit at <ref> and provides the following information:
还有一些预建的格式可供立即使用。例如,我们可以使用 git log --pretty=reference -1 <ref> 来获取提交的简单摘要。此查询解析位于 <ref> 的提交,并提供以下信息:

  • An abbreviated object ID.
    一个缩写的对象 ID。
  • The first sentence of the commit message.
    提交消息的第一句话。
  • The commit date in short form.
    短格式的提交日期。
$ git log --pretty=reference -1 378b51993aa022c432b23b7f1bafd921b7c43835
378b51993aa0 (gc: simplify --cruft description, 2022-06-19)

Now that we’ve explored some of the queries we can make in Git, let’s dig into the actual storage of this data.
现在我们已经探索了一些在 Git 中可以进行的查询,让我们深入了解这些数据的实际存储。

Compressed object storage: packfiles
压缩对象存储:packfiles

Looking into the .git/objects directory again, we might see several directories with two-digit names. These directories then contain files with long hexadecimal names. These files are called loose objects, and the filename corresponds to the object ID of an object: the first two hexadecimal characters form the directory name while the rest form the filename. While the files themselves are compressed, there is not much interesting about querying these files, since Git relies on filesystem queries to satisfy most of these needs.
再次查看 .git/objects 目录,我们可能会看到几个有两位数名称的目录。这些目录中包含了具有长十六进制名称的文件。这些文件被称为松散对象,文件名对应于一个对象的对象 ID:前两个十六进制字符形成目录名,其余的形成文件名。虽然这些文件本身是压缩的,但查询这些文件并没有太多有趣之处,因为 Git 依赖于文件系统查询来满足大部分这些需求。

However, it does not take many objects before it is infeasible to store an entire Git repository using only loose objects. Not only does it strain the filesystem to have so many files, it is also inefficient when storing many versions of the same text file. Thus, Git’s packed object store in the .git/objects/pack/ directory forms a more efficient way to store Git objects.
然而,一个 Git 仓库使用松散对象存储并不需要太多对象,这在可行性上是有限的。不仅因为有如此多的文件而使文件系统承受压力,当存储同一个文本文件的多个版本时,这也是效率低下的。因此,Git 在 .git/objects/pack/ 目录中的打包对象存储形成了一种更有效的存储 Git 对象的方式。

Packfiles and pack-indexes
打包文件和打包索引

Each *.pack file in .git/objects/pack/ is called a packfile. Packfiles store multiple objects in compressed forms. Not only is each object compressed individually, they can also be compressed against each other to take advantage of common data.
每个 *.pack 文件在 .git/objects/pack/ 中被称为 packfile。Packfiles 以压缩形式存储多个对象。不仅每个对象被单独压缩,它们还可以相互压缩以利用共同的数据。

At its simplest, a packfile contains a concatenated list of objects. It only stores the object data, not the object ID. It is possible to read a packfile to find objects by object ID, but it requires decompressing and hashing each object to compare it to the input hash. Instead, each packfile is paired with a pack-index file ending with .idx. The pack-index file stores the list of object IDs in lexicographical order so a quick binary search is sufficient to discover if an object ID is in the packfile, then an offset value points to where the object’s data begins within the packfile. The pack-index operates like a query index that speeds up read queries that rely on the primary key (object ID).
最简单的情况下,一个 packfile 包含了一个对象的连续列表。它只存储对象数据,不存储对象 ID。虽然可以读取 packfile 来通过对象 ID 找到对象,但这需要解压并哈希每个对象以将其与输入的哈希值进行比较。相反,每个 packfile 都配有一个以 .idx 结尾的 pack-index 文件。Pack-index 文件按字典顺序存储对象 ID 列表,因此只需快速的二分查找就足以发现对象 ID 是否在 packfile 中,然后一个偏移值指向对象数据在 packfile 中的起始位置。Pack-index 的操作就像一个查询索引,加速了依赖主键(对象 ID)的读取查询。

One small optimization is that a fanout table of 256 entries provides boundaries within the full list of object IDs based on their first byte. This reduces the time spent by the binary search, specifically by focusing the search on a smaller number of memory pages. This works particularly well because object IDs are uniformly distributed so the fanout ranges are well-balanced.
一个小优化是使用一个包含 256 个条目的扇出表来提供基于对象 ID 首字节的全列表边界。这减少了二分搜索所花费的时间,特别是通过将搜索聚焦在更少的内存页上。由于对象 ID 是均匀分布的,所以这种方法特别有效,因为扇出范围是平衡的。

If we have a number of packfiles, then we could ask each pack-index in sequence to look up the object. A further enhancement to packfiles is to put several pack-indexes together in a single multi-pack-index, which stores the same offset data plus which packfile the object is in.
如果我们有多个 packfile,那么我们可以依次请求每个 pack-index 来查找对象。对 packfile 的进一步增强是将几个 pack-index 集合在一个单一的 multi-pack-index 中,它存储相同的偏移数据以及对象所在的 packfile。

Lookups and prefixes work the same as in pack-indexes, except now we can skip the linear issue with many packs. You can read more about the multi-pack-index file and how it helps scale monorepo maintenance at GitHub.
查找和前缀的工作方式与在 pack-index 中一样,除了现在我们可以跳过多个 packs 带来的线性问题。你可以阅读更多关于 multi-pack-index 文件以及它如何帮助 GitHub 扩展 monorepo 维护的信息。

Diffable object content 可比较对象内容

Packfiles also have a hyper-specialized version of row compression called deltification. Since read queries are only indexed by the object ID, we can perform extra compression on the object data part.
Packfiles 还有一种高度专业化的行压缩版本,称为 deltification。由于读取查询仅通过对象 ID 索引,我们可以对对象数据部分执行额外的压缩。

Git was built to store source code, which consists of plain-text files that are used as input to a compiler or interpreter to create applications. Git was also built to store many versions of this source code as it is changed by humans. This provides additional context about the kind of data typically stored in Git: diffable files with significant portions in common. If you’ve ever wondered why you shouldn’t store large binary files in Git repositories, this is the reason.
Git 被构建用来存储源代码,源代码由纯文本文件组成,这些文件被用作编译器或解释器的输入以创建应用程序。Git 也被构建用来存储这些源代码的多个版本,随着人类的修改而变化。这提供了关于通常存储在 Git 中的数据类型的额外上下文:具有大量共同部分的可比较文件。如果你曾经想知道为什么不应该在 Git 仓库中存储大型二进制文件,这就是原因。

The field of software engineering has made it clear that it is difficult to understand applications in their entirety. Humans can grasp a very high-level view of an architecture and can parse small sections of code, but we cannot store enough information in our brains to grasp huge amounts of concrete code at once. You can read more about this in the excellent book, The Programmer’s Brain by Dr. Felienne Hermans.
软件工程领域已经明确指出,要完全理解应用程序是困难的。人类可以把握架构的非常高层次的视图,并且可以解析小段的代码,但我们的大脑无法存储足够的信息来一次性把握大量的具体代码。你可以在 Dr. Felienne Hermans 的优秀书籍《程序员的大脑》中了解更多关于这个话题的信息。

Because of the limited size of our working memory, it is best to change code in small, well-documented iterations. This helps the code author, any code reviewers, and future developers looking at the code history. Between iterations, a significant majority of the code remains fixed while only small portions change. This allows Git to use difference algorithms to identify small diffs between the content of blob objects.
由于我们的工作记忆容量有限,最好是通过小的、有良好文档记录的迭代来更改代码。这有助于代码作者、任何代码审查者,以及未来查看代码历史的开发者。在迭代之间,绝大多数代码保持不变,只有小部分发生变化。这允许 Git 使用差异算法来识别 blob 对象内容之间的小差异。

There are many ways to compute a difference between two blobs. Git has several difference algorithms implemented which can have drastically different results. Instead of focusing on unstructured differences, I want to focus on differences between structured object data. Specifically, tree objects usually change in small ways that are easy to compress.
计算两个 blob 之间差异的方法有很多种。Git 实现了几种不同的差异算法,这些算法可能会产生截然不同的结果。我不想关注非结构化的差异,而是想关注结构化对象数据之间的差异。具体来说,树对象通常以容易压缩的小方式发生变化。

Tree diffs 树差异

Git’s tree objects can also be compared using a difference algorithm that is aware of the structure of tree entries. Each tree entry stores a mode (think Unix file permissions), an object type, a name, and an object ID. Object IDs are for all intents and purposes random, but most edits will change a file without changing its mode, type, or name. Further, large trees are likely to have only a few entries change at a time.
Git 的树对象也可以使用一种差异算法进行比较,该算法能够识别树条目的结构。每个树条目存储一个模式(想象 Unix 文件权限)、一个对象类型、一个名称和一个对象 ID。对象 ID 基本上是随机的,但是大多数编辑会更改文件而不改变其模式、类型或名称。此外,大型树很可能一次只有几个条目发生变化。

For example, the tip commit at any major Git release only changes one file: the GIT-VERSION-GEN file. This means also that the root tree only has one entry different from the previous root tree:
例如,在任何主要 Git 发布的尖端提交中,只更改一个文件: GIT-VERSION-GEN 文件。这也意味着,根树只有一个条目与前一个根树不同:

$ git diff v2.37.0~1 v2.37.0
diff --git a/GIT-VERSION-GEN b/GIT-VERSION-GEN
index 120af376c1..b210b306b7 100755
--- a/GIT-VERSION-GEN
+++ b/GIT-VERSION-GEN
@@ -1,7 +1,7 @@
 #!/bin/sh

 GVF=GIT-VERSION-FILE
-DEF_VER=v2.37.0-rc2
+DEF_VER=v2.37.0

 LF='
 '

$ git cat-file -p v2.37.0~1^{tree} >old
$ git cat-file -p v2.37.0^{tree} >new

$ diff old new
13c13
< 100755 blob 120af376c147799e6c0069bac1f61709a0286cd6  GIT-VERSION-GEN
---
> 100755 blob b210b306b7554f28dc687d1c503517d2a5f87082  GIT-VERSION-GEN

Once we have an algorithm that can compute diffs for Git objects, the packfile format can take advantage of that.
一旦我们有了一个能够为 Git 对象计算差异的算法,packfile 格式就可以利用这一点。

Delta compression 增量压缩

The packfile format begins with some simple header information, but then it contains Git object data concatenated together. Each object’s data starts with a type and a length. The type could be the object type, in which case the content in the packfile is the full object content (subject to DEFLATE compression). The object’s type could instead be an offset delta, in which case the data is based on the content of a previous object in the packfile.
packfile 格式以一些简单的头信息开始,但之后它包含了连续在一起的 Git 对象数据。每个对象的数据以一个类型和长度开始。类型可能是对象类型,在这种情况下,packfile 中的内容是完整的对象内容(受 DEFLATE compression 的约束)。对象的类型也可能是偏移增量,在这种情况下,数据是基于 packfile 中先前对象的内容。

An offset delta begins with an integer offset value pointing to the relative position of a previous object in the packfile. The remaining data specifies a list of instructions which either instruct how to copy data from the base object or to write new data chunks.
偏移量增量以一个整数偏移值开始,指向 packfile 中前一个对象的相对位置。剩余数据指定了一系列指令,这些指令要么指导如何从基对象复制数据,要么指导写入新的数据块。

Thinking back to our example of the root tree for Git’s v2.37.0 tag, we can store that tree as an offset delta to the previous root tree by copying the tree up until the object ID 120af37..., then write the new object ID b210b30..., and finally copy the rest of the previous root tree.
回想一下我们之前关于 Git 的 v2.37.0 标签的根树的例子,我们可以通过复制树到对象 ID 120af37... 为止,然后写入新的对象 ID b210b30... ,最后复制前一个根树的其余部分,将该树存储为前一个根树的偏移量增量。

Keep in mind that these instructions are also DEFLATE compressed, so the new data chunks can also be compressed similarly to the base object. For the example above, we can see that the root tree for v2.37.0 is around 19KB uncompressed, 14KB compressed, but can be represented as an offset delta in only 50 bytes.
请记住,这些指令也是 DEFLATE 压缩的,所以新的数据块也可以类似于基对象那样被压缩。对于上面的例子,我们可以看到 v2.37.0 的根树未压缩时大约为 19KB,压缩后为 14KB,但可以仅用 50 字节表示为一个偏移量增量。

$ git rev-parse v2.37.0^{tree}
a4a2aa60ab45e767b52a26fc80a0a576aef2a010

$ git cat-file -s v2.37.0^{tree}
19388

$ ls -al .git/objects/a4/a2aa60ab45e767b52a26fc80a0a576aef2a010
-r--r--r--   1 ... ... 13966 Aug  1 13:24 a2aa60ab45e767b52a26fc80a0a576aef2a010

$ git rev-parse v2.37.0^{tree} | git cat-file --batch-check="%(objectsize:disk)"
50

Also, an offset delta can be based on another object that is also an offset delta. This creates a delta chain that requires computing the object data for each object in the list. In fact, we need to traverse the delta links in order to even determine the object type.
此外,一个偏移量增量可以基于另一个也是偏移量增量的对象。这创建了一个需要计算列表中每个对象的对象数据的增量链。实际上,我们需要遍历增量链接以确定对象类型。

For this reason, there is a cost to storing objects efficiently this way. At read time, we need to do a bit extra work to materialize the raw object content Git needs to parse to satisfy its queries. There are multiple ways that Git tries to optimize this trade-off.
因此,以这种方式高效存储对象是有成本的。在读取时,我们需要做一些额外的工作来实现 Git 需要解析以满足其查询的原始对象内容。Git 尝试以多种方式优化这种权衡。

One way Git minimizes the extra work when parsing delta chains is by keeping the delta-chains short. The pack.depth config value specifies an upper limit on how long delta chains can be while creating a packfile. The default limit is 50.
Git 在解析增量链时减少额外工作的一种方式是保持增量链的短小。 pack.depth config value 指定了在创建 packfile 时增量链可以有的最大长度。默认限制是 50。

When writing a packfile, Git attempts to use a recent object as the base and order the delta chain in reverse-chronological order. This allows the queries that involve recent objects to have minimum overhead, while the queries that involve older objects have slightly more overhead.
在写入 packfile 时,Git 会尝试使用一个较新的对象作为基础,并以逆向时间顺序排列增量链。这样可以使涉及最近对象的查询具有最小的开销,而涉及较旧对象的查询则开销略大。

However, while thinking about the overhead of computing object contents from a delta chain, it is important to think about what kind of resources are being used. For example, to compute the diff between v2.37.0 and its parent, we need to load both root trees. If these root trees are in the same delta chain, then that chain’s data on disk is smaller than if they were stored in raw form. Since the packfile also places delta chains in adjacent locations in the packfile, the cost of reading the base object and its delta from disk is almost identical to reading just the base object. The extra overhead of some CPU during the parse is very small compared to the disk read. In this way, reading multiple objects in the same delta chain is faster than reading multiple objects across different chains.
然而,在考虑从增量链计算对象内容的开销时,重要的是要思考正在使用哪种资源。例如,为了计算 v2.37.0 及其父对象之间的差异,我们需要加载两个根树。如果这些根树在同一个增量链中,那么该链在磁盘上的数据会比它们以原始形式存储时更小。由于 packfile 还将增量链放置在 packfile 中相邻的位置,因此从磁盘读取基础对象及其增量的成本几乎与仅读取基础对象的成本相同。与磁盘读取相比,解析过程中的一些额外 CPU 开销非常小。通过这种方式,读取同一增量链中的多个对象比跨不同链读取多个对象要快。

In addition, some Git commands query the object store in such a way that we are very likely to parse multiple objects in the same delta chain. We will cover this more in part III when discussing file history queries.
此外,一些 Git 命令查询对象存储的方式使我们很可能在同一个 delta 链中解析多个对象。我们将在第三部分讨论文件历史查询时更详细地覆盖这个话题。

In addition to persisting data efficiently to disk, the packfile format is also critical to how Git synchronizes Git object data across distributed copies of the repository during git fetch and git push. We will learn more about this in part IV when discussing distributed synchronization.
除了高效地将数据持久化到磁盘之外,packfile 格式对于 Git 在分布式仓库副本之间同步 Git 对象数据也至关重要,这一点在 git fetchgit push 期间尤为明显。我们将在第四部分讨论分布式同步时学到更多相关信息。

Packfile maintenance Packfile 维护

In order to take advantage of packfiles and their compressed representation of Git objects, Git needs to actually write these packfiles. It is too expensive to create a packfile for every object write, so Git batches the packfile write into certain commands.
为了利用 packfiles 及其对 Git 对象的压缩表示,Git 需要实际写入这些 packfiles。对于每次对象写入都创建一个 packfile 来说代价太高,因此 Git 将 packfile 写入批处理到特定命令中。

You could roll your own packfile using git pack-objects and create a pack-index for it using git index-pack. However, you instead might want to recompute a new packfile containing your entire object store using git repack -a or git gc.
你可以使用 git pack-objects 自己制作一个 packfile,并使用 git index-pack 为其创建一个 pack-index。然而,你可能更想使用 git repack -agit gc 重新计算一个包含你整个对象存储的新 packfile。

As your repository grows, it becomes more difficult to replace your entire object store with a new packfile. For starters, you need enough space to store two copies of your Git object data. In addition, the computation effort to find good delta compression is very expensive and demanding. An optimal way to do delta compression takes quadratic time over the number objects, which is quickly infeasible. Git uses several heuristics to help with this, but still the cost of repacking everything all at once can be more than we are willing to spend, especially if we are just a client repository and not responsible for serving our Git data to multiple users.
随着你的仓库增长,用一个新的 packfile 替换整个对象存储变得更加困难。首先,你需要足够的空间来存储两份 Git 对象数据的副本。此外,寻找良好的增量压缩的计算工作非常昂贵和要求高。进行最优的增量压缩需要在对象数量上花费二次方的时间,这很快就变得不可行。Git 使用了几种启发式方法来帮助解决这个问题,但是一次性重新打包所有东西的成本可能比我们愿意承担的要多,特别是如果我们只是一个客户端仓库而不负责向多个用户提供我们的 Git 数据的话。

There are two primary ways to update your object store for efficient reads without rewriting the entire object store into a new packfile. One is the geometric repacking option where you can run git repack --geometric to repack only a portion of packfiles until the resulting packfiles form a geometric sequence. That is, each packfile is some fixed multiple smaller than the next largest one. This uses the multi-pack-index to keep logarithmic performance for object lookups, but will occasionally tip over to repack all of the object data. That “tip over” moment only happens when the repository doubles in size, which does not happen very often.
有两种主要方法可以更新您的对象存储以实现高效读取,而无需将整个对象存储重写到新的 packfile 中。一种是几何重打包选项,您可以运行 git repack --geometric 来仅重打包部分 packfiles,直到结果的 packfiles 形成一个几何序列。也就是说,每个 packfile 都是下一个最大 packfile 的固定倍数。这使用了 multi-pack-index 来保持对象查找的对数性能,但偶尔会倾覆以重打包所有对象数据。那个“倾覆”时刻只有在仓库大小翻倍时才会发生,这并不经常发生。

Another approach to reducing the amount of work spent repacking is the incremental repack task in the git maintenance command. This task collects packfiles below a fixed size threshold and groups them together, at least until their total size is above that threshold. The default threshold is two gigabytes. This task is used by default when you enable background maintenance with the git maintenance start command. This also uses the multi-pack-index to keep fast lookups, but also will not rewrite the entire object store for large repositories since once a packfile is larger than the threshold it is not considered for repacking. The storage is slightly inefficient here, since objects in newer packfiles could be stored as deltas to objects in those fixed packs, but the simplicity in avoiding expensive repository maintenance is worth that slight overhead.
另一种减少重新打包工作量的方法是在 git maintenance 命令中使用增量重新打包任务。该任务会收集低于固定大小阈值的 packfile,并将它们组合在一起,至少直到它们的总大小超过该阈值。默认阈值是两个千兆字节。当你启用 git maintenance start command 时,默认会使用此任务。这也使用了 multi-pack-index 来保持快速查找,但也不会为大型仓库重写整个对象存储,因为一旦 packfile 的大小超过阈值,它就不会被考虑用于重新打包。这里的存储效率略低,因为新 packfile 中的对象可以存储为固定 pack 中对象的增量,但是为了避免昂贵的仓库维护的简单性,这点轻微的开销是值得的。

If you’re interested in keeping your repositories well maintained, then think about these options. You can always perform a full repack that recomputes all delta chains using git repack -adf at any time you are willing to spend that upfront maintenance cost.
如果您有兴趣保持您的仓库维护得当,那么可以考虑这些选项。您可以随时执行完整的重新打包,使用 git repack -adf 重新计算所有增量链,只要您愿意承担那些前期的维护成本。

What could Git learn from other databases?
Git 能从其他数据库中学到什么?

Now that we have some understanding about how Git stores and accesses packed object data, let’s think about features that exist in application database systems that might be helpful here.
现在我们对 Git 如何存储和访问打包对象数据有了一些了解,让我们思考一下在应用数据库系统中存在的可能在这里有用的功能。

One thing to note is that there are no B-trees to be found! Almost every database introduction talks about how B-trees are used to efficiently index data in a database table. Why are they not present here in Git?
需要注意的一点是,这里找不到 B 树!几乎每一本数据库入门书都会讲到 B 树是如何被用来在数据库表中高效索引数据的。为什么它们在 Git 中不存在呢?

The main reason Git does not use B-trees is because it doesn’t do “live updating” of packfiles and pack-indexes. Once a packfile is written, it is static until it is replaced by another packfile containing its objects. That packfile is also not accessed by Git processes until its pack-index is completely written. 

In this world, objects are dynamically added to the object store by adding new loose object files (such as in git add or git commit) or by adding new packfiles (such as in git fetch). If a packfile has fixed content, then we can do the most space-and-time efficient index: a binary search tree. Specifically, performing binary search on the list of object IDs in a pack-index is very efficient. It’s not an exact binary search because there is an initial fan-out table for the first byte of the object ID. It’s kind of like a rooted binary tree, except the root node has 256 children instead of only two. 

B-trees excel when data is being inserted or removed from the tree. Being able to track those modifications with minimal modifications to the overall tree structure is critical for an application database serving many concurrent requests. 

Git does not currently have the capability to update a packfile in real time without shutting down concurrent reads from that file. Such a change could be possible, but it would require updating Git’s storage significantly. I think this is one area where a database expert could contribute to the Git project in really interesting ways.
Git 目前没有能力在不关闭对该文件的并发读取的情况下实时更新 packfile。这样的改变是可能的,但它需要显著更新 Git 的存储。我认为这是数据库专家可以以非常有趣的方式为 Git 项目做出贡献的一个领域。

Another difference between Git and most database systems is that Git runs as short-lived processes. Typically, we think of the database as a process that has data cached in memory. We send queries to the existing process and it returns results and keeps running. Instead, Git starts a new process with every “query” and relies on the filesystem for persisted state. Git also relies on the operating system to cache the disk pages during and between the processes. Expert database systems tell the kernel to stop managing disk pages and instead the database manages the page cache since it knows its usage needs better than a general purpose operating system could predict.
Git 和大多数数据库系统之间的另一个区别是 Git 作为短暂的进程运行。通常,我们认为数据库是一个有数据缓存在内存中的进程。我们向现有进程发送查询,它返回结果并继续运行。相反,Git 对每个“查询”启动一个新进程,并依赖文件系统来保持状态持久化。Git 还依赖操作系统在进程进行期间和之间缓存磁盘页面。专家级数据库系统会告诉内核停止管理磁盘页面,而是由数据库管理页面缓存,因为它比通用操作系统能更好地预测其使用需求。

What if Git had a long-running daemon that could satisfy queries on-demand, but also keep that in-memory representation of data instead of needing to parse objects from disk every time? Although the current architecture of Git is not well-suited to this, I believe it is an idea worth exploring in the future.
如果 Git 有一个长期运行的守护进程,可以根据需求满足查询,同时还能保持内存中的数据表示,而不是每次都需要从磁盘解析对象,那会怎样?尽管 Git 当前的架构并不适合这样做,但我相信这是一个值得在未来探索的想法。

Come back tomorrow for more!
明天再来获取更多内容!

In the next part of this blog series, we will explore how Git commit history queries use the structure of Git commits to present interesting information to the user. We’ll also explore the commit-graph file and how it acts as a specialized query index for these commands.
在这个博客系列的下一部分,我们将探讨 Git 提交历史查询是如何使用 Git 提交的结构向用户展示有趣信息的。我们还会探讨 commit-graph 文件以及它是如何作为这些命令的专门查询索引的。

I’ll also be speaking at Git Merge 2022 covering all five parts of this blog series, so I look forward to seeing you there!
我也将在 2022 年的 Git Merge 大会上发言,涵盖这个博客系列的全部五个部分,所以我期待在那里见到你!

Explore more from GitHub
探索更多来自 GitHub 的内容

Engineering

Engineering 工程

Posts straight from the GitHub engineering team.
来自 GitHub 工程团队的直接帖子。
GitHub Universe 2024

GitHub Universe 2024

Get tickets to the 10th anniversary of our global developer event on AI, DevEx, and security.
获取我们全球开发者活动 10 周年的门票,活动内容涉及 AI、DevEx 和安全。
GitHub Copilot

GitHub Copilot

Don't fly solo. Try 30 days for free.
不要单飞。尝试免费 30 天。
Work at GitHub!

Work at GitHub! 在 GitHub 工作!

Check out our current job openings.
查看我们当前的职位空缺。