这是用户在 2024-5-27 23:16 为 https://arrow.apache.org/docs/format/Columnar.html#struct-layout 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Arrow Columnar Format# Arrow 列格式#

Version: 1.4 版本:1.4

The Arrow columnar format includes a language-agnostic in-memory data structure specification, metadata serialization, and a protocol for serialization and generic data transport.
Arrow 列式格式包括语言无关的内存数据结构规范、元数据序列化以及序列化和通用数据传输的协议。

This document is intended to provide adequate detail to create a new implementation of the columnar format without the aid of an existing implementation. We utilize Google’s Flatbuffers project for metadata serialization, so it will be necessary to refer to the project’s Flatbuffers protocol definition files while reading this document.
本文档旨在提供足够的细节,以便在没有现有实现的帮助下创建新的列式格式实现。我们使用 Google 的 Flatbuffers 项目进行元数据序列化,因此在阅读本文档时,将需要参考该项目的 Flatbuffers 协议定义文件。

The columnar format has some key features:

  • Data adjacency for sequential access (scans)

  • O(1) (constant-time) random access

  • SIMD and vectorization-friendly
    SIMD 和向量化友好

  • Relocatable without “pointer swizzling”, allowing for true zero-copy access in shared memory

The Arrow columnar format provides analytical performance and data locality guarantees in exchange for comparatively more expensive mutation operations. This document is concerned only with in-memory data representation and serialization details; issues such as coordinating mutation of data structures are left to be handled by implementations.
Arrow 列式格式提供了分析性能和数据局部性保证,以换取相对更昂贵的变更操作。本文档仅关注内存数据表示和序列化细节;诸如协调数据结构变更的问题则留给实现来处理。

Terminology# 术语 #

Since different projects have used different words to describe various concepts, here is a small glossary to help disambiguate.

  • Array or Vector: a sequence of values with known length all having the same type. These terms are used interchangeably in different Arrow implementations, but we use “array” in this document.
    数组或向量:具有已知长度且所有值类型相同的值序列。这些术语在不同的 Arrow 实现中可以互换使用,但在本文档中我们使用“数组”。

  • Slot: a single logical value in an array of some particular data type

  • Buffer or Contiguous memory region: a sequential virtual address space with a given length. Any byte can be reached via a single pointer offset less than the region’s length.

  • Physical Layout: The underlying memory layout for an array without taking into account any value semantics. For example, a 32-bit signed integer array and 32-bit floating point array have the same layout.
    物理布局:不考虑任何值语义的情况下,数组的基础内存布局。例如,32 位有符号整数数组和 32 位浮点数数组具有相同的布局。

  • Parent and child arrays: names to express relationships between physical value arrays in a nested type structure. For example, a List<T>-type parent array has a T-type array as its child (see more on lists below).
    父子数组:用于表示嵌套类型结构中物理值数组之间关系的名称。例如,一个 List<T> 类型的父数组拥有一个 T 类型的子数组(有关列表的更多信息请参见下文)。

  • Primitive type: a data type having no child types. This includes such types as fixed bit-width, variable-size binary, and null types.

  • Nested type: a data type whose full structure depends on one or more other child types. Two fully-specified nested types are equal if and only if their child types are equal. For example, List<U> is distinct from List<V> iff U and V are different types.
    嵌套类型:其完整结构依赖于一个或多个其他子类型的数据类型。如果且仅当它们的子类型相等时,两个完全指定的嵌套类型才相等。例如,当且仅当 U 和 V 是不同的类型时, List<U> 才与 List<V> 不同。

  • Logical type: An application-facing semantic value type that is implemented using some physical layout. For example, Decimal values are stored as 16 bytes in a fixed-size binary layout. Similarly, strings can be stored as List<1-byte>. A timestamp may be stored as 64-bit fixed-size layout.
    逻辑类型:一种使用某种物理布局实现的面向应用的语义值类型。例如,十进制值以固定大小的 16 字节二进制布局存储。同样,字符串可以存储为 List<1-byte> 。时间戳可能以 64 位固定大小布局存储。

Physical Memory Layout# 物理内存布局 #

Arrays are defined by a few pieces of metadata and data:

  • A logical data type. 逻辑数据类型。

  • A sequence of buffers. 一系列的缓冲区。

  • A length as a 64-bit signed integer. Implementations are permitted to be limited to 32-bit lengths, see more on this below.
    长度作为 64 位有符号整数。实现可以限制为 32 位长度,有关此的更多信息请参见下文。

  • A null count as a 64-bit signed integer.
    一个空计数作为一个 64 位有符号整数。

  • An optional dictionary, for dictionary-encoded arrays.

Nested arrays additionally have a sequence of one or more sets of these items, called the child arrays.

Each logical data type has a well-defined physical layout. Here are the different physical layouts defined by Arrow:
每种逻辑数据类型都有明确定义的物理布局。以下是 Arrow 定义的不同物理布局:

  • Primitive (fixed-size): a sequence of values each having the same byte or bit width

  • Variable-size Binary: a sequence of values each having a variable byte length. Two variants of this layout are supported using 32-bit and 64-bit length encoding.
    可变大小二进制:一系列每个具有可变字节长度的值。支持使用 32 位和 64 位长度编码的这种布局的两种变体。

  • View of Variable-size Binary: a sequence of values each having a variable byte length. In contrast to Variable-size Binary, the values of this layout are distributed across potentially multiple buffers instead of densely and sequentially packed in a single buffer.

  • Fixed-size List: a nested layout where each value has the same number of elements taken from a child data type.

  • Variable-size List: a nested layout where each value is a variable-length sequence of values taken from a child data type. Two variants of this layout are supported using 32-bit and 64-bit length encoding.
    可变大小列表:一种嵌套布局,其中每个值都是从子数据类型中取得的可变长度值序列。支持使用 32 位和 64 位长度编码的这种布局的两种变体。

  • View of Variable-size List: a nested layout where each value is a variable-length sequence of values taken from a child data type. This layout differs from Variable-size List by having an additional buffer containing the sizes of each list value. This removes a constraint on the offsets buffer — it does not need to be in order.

  • Struct: a nested layout consisting of a collection of named child fields each having the same length but possibly different types.

  • Sparse and Dense Union: a nested layout representing a sequence of values, each of which can have type chosen from a collection of child array types.

  • Dictionary-Encoded: a layout consisting of a sequence of integers (any bit-width) which represent indexes into a dictionary which could be of any type.

  • Run-End Encoded (REE): a nested layout consisting of two child arrays, one representing values, and one representing the logical index where the run of a corresponding value ends.

  • Null: a sequence of all null values, having null logical type

The Arrow columnar memory layout only applies to data and not metadata. Implementations are free to represent metadata in-memory in whichever form is convenient for them. We handle metadata serialization in an implementation-independent way using Flatbuffers, detailed below.
Arrow 列式内存布局仅适用于数据,而不适用于元数据。实现可以自由地以对它们方便的形式表示内存中的元数据。我们使用 Flatbuffers 以与实现无关的方式处理元数据序列化,详细信息如下。

Buffer Alignment and Padding#
缓冲对齐和填充 #

Implementations are recommended to allocate memory on aligned addresses (multiple of 8- or 64-bytes) and pad (overallocate) to a length that is a multiple of 8 or 64 bytes. When serializing Arrow data for interprocess communication, these alignment and padding requirements are enforced. If possible, we suggest that you prefer using 64-byte alignment and padding. Unless otherwise noted, padded bytes do not need to have a specific value.
建议实现在对齐的地址(8 或 64 字节的倍数)上分配内存,并填充(超额分配)到 8 或 64 字节的倍数长度。在序列化 Arrow 数据进行进程间通信时,将强制执行这些对齐和填充要求。如果可能,我们建议您更倾向于使用 64 字节的对齐和填充。除非另有说明,填充字节不需要具有特定值。

The alignment requirement follows best practices for optimized memory access:

  • Elements in numeric arrays will be guaranteed to be retrieved via aligned access.

  • On some architectures alignment can help limit partially used cache lines.

The recommendation for 64 byte alignment comes from the Intel performance guide that recommends alignment of memory to match SIMD register width. The specific padding length was chosen because it matches the largest SIMD instruction registers available on widely deployed x86 architecture (Intel AVX-512).
对 64 字节对齐的推荐来自英特尔性能指南,该指南建议将内存对齐以匹配 SIMD 寄存器宽度。选择特定的填充长度是因为它与广泛部署的 x86 架构上可用的最大 SIMD 指令寄存器(英特尔 AVX-512)相匹配。

The recommended padding of 64 bytes allows for using SIMD instructions consistently in loops without additional conditional checks. This should allow for simpler, efficient and CPU cache-friendly code. In other words, we can load the entire 64-byte buffer into a 512-bit wide SIMD register and get data-level parallelism on all the columnar values packed into the 64-byte buffer. Guaranteed padding can also allow certain compilers to generate more optimized code directly (e.g. One can safely use Intel’s -qopt-assume-safe-padding).
推荐的 64 字节填充允许在循环中一致使用 SIMD 指令,无需额外的条件检查。这应该可以实现更简单、高效且 CPU 缓存友好的代码。换句话说,我们可以将整个 64 字节缓冲区加载到 512 位宽的 SIMD 寄存器中,并在所有打包到 64 字节缓冲区的列值上获取数据级并行性。保证的填充也可以让某些编译器直接生成更优化的代码(例如,可以安全地使用 Intel 的 -qopt-assume-safe-padding )。

Array lengths# 数组长度 #

Array lengths are represented in the Arrow metadata as a 64-bit signed integer. An implementation of Arrow is considered valid even if it only supports lengths up to the maximum 32-bit signed integer, though. If using Arrow in a multi-language environment, we recommend limiting lengths to 2 31 - 1 elements or less. Larger data sets can be represented using multiple array chunks.
数组长度在 Arrow 元数据中以 64 位有符号整数表示。即使 Arrow 的实现只支持最大的 32 位有符号整数长度,也被认为是有效的。如果在多语言环境中使用 Arrow,我们建议将长度限制为 2 31 - 1 个元素或更少。可以使用多个数组块来表示更大的数据集。

Null count# 空值计数 #

The number of null value slots is a property of the physical array and considered part of the data structure. The null count is represented in the Arrow metadata as a 64-bit signed integer, as it may be as large as the array length.
空值插槽的数量是物理数组的属性,被视为数据结构的一部分。空值计数在 Arrow 元数据中以 64 位有符号整数表示,因为它可能与数组长度一样大。

Validity bitmaps# 有效位图 #

Any value in an array may be semantically null, whether primitive or nested type.

All array types, with the exception of union types (more on these later), utilize a dedicated memory buffer, known as the validity (or “null”) bitmap, to encode the nullness or non-nullness of each value slot. The validity bitmap must be large enough to have at least 1 bit for each array slot.
所有数组类型,除联合类型(稍后将详细介绍)外,都使用一个专用的内存缓冲区,称为有效性(或“空”)位图,来编码每个值槽的空或非空状态。有效性位图必须足够大,至少为每个数组槽提供 1 位。

Whether any array slot is valid (non-null) is encoded in the respective bits of this bitmap. A 1 (set bit) for index j indicates that the value is not null, while a 0 (bit not set) indicates that it is null. Bitmaps are to be initialized to be all unset at allocation time (this includes padding):
此位图的各个位编码了数组插槽是否有效(非空)。索引 j 的 1(设定位)表示该值不为空,而 0(未设定位)表示该值为空。在分配时间,位图应初始化为全部未设定(包括填充):

is_valid[j] -> bitmap[j / 8] & (1 << (j % 8))

We use least-significant bit (LSB) numbering (also known as bit-endianness). This means that within a group of 8 bits, we read right-to-left:
我们使用最低有效位(LSB)编号(也称为位端序)。这意味着在一组 8 位中,我们从右向左阅读:

values = [0, 1, null, 2, null, 3]

j mod 8   7  6  5  4  3  2  1  0
          0  0  1  0  1  0  1  1

Arrays having a 0 null count may choose to not allocate the validity bitmap; how this is represented depends on the implementation (for example, a C++ implementation may represent such an “absent” validity bitmap using a NULL pointer). Implementations may choose to always allocate a validity bitmap anyway as a matter of convenience. Consumers of Arrow arrays should be ready to handle those two possibilities.
具有 0 空值计数的数组可以选择不分配有效性位图;这如何表示取决于实现(例如,C++实现可能使用 NULL 指针表示这样一个“缺失”的有效性位图)。实现可能会选择始终分配一个有效性位图,以方便为主。Arrow 数组的消费者应准备好处理这两种可能性。

Nested type arrays (except for union types as noted above) have their own top-level validity bitmap and null count, regardless of the null count and valid bits of their child arrays.

Array slots which are null are not required to have a particular value; any “masked” memory can have any value and need not be zeroed, though implementations frequently choose to zero memory for null values.

Fixed-size Primitive Layout#
固定大小的原始布局 #

A primitive value array represents an array of values each having the same physical slot width typically measured in bytes, though the spec also provides for bit-packed types (e.g. boolean values encoded in bits).

Internally, the array contains a contiguous memory buffer whose total size is at least as large as the slot width multiplied by the array length. For bit-packed types, the size is rounded up to the nearest byte.

The associated validity bitmap is contiguously allocated (as described above) but does not need to be adjacent in memory to the values buffer.

Example Layout: Int32 Array
示例布局:Int32 数组

For example a primitive array of int32s:
例如一个原始的 int32s 数组:

[1, null, 2, 4, 8]

Would look like: 看起来像:

* Length: 5, Null count: 1
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  | 00011101                 | 0 (padding)           |

* Value Buffer:

  | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63           |
  | 1           | unspecified | 2           | 4           | 8           | unspecified (padding) |

Example Layout: Non-null int32 Array
示例布局:非空 int32 数组

[1, 2, 3, 4, 8] has two possible layouts:
[1, 2, 3, 4, 8] 有两种可能的布局:

* Length: 5, Null count: 0
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  | 00011111                 | 0 (padding)           |

* Value Buffer:

  | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63           |
  | 1           | 2           | 3           | 4           | 8           | unspecified (padding) |

or with the bitmap elided:

* Length 5, Null count: 0
* Validity bitmap buffer: Not required
* Value Buffer:

  | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | bytes 12-15 | bytes 16-19 | Bytes 20-63           |
  | 1           | 2           | 3           | 4           | 8           | unspecified (padding) |

Variable-size Binary Layout#

Each value in this layout consists of 0 or more bytes. While primitive arrays have a single values buffer, variable-size binary have an offsets buffer and data buffer.
此布局中的每个值由 0 个或更多字节组成。虽然原始数组只有一个值缓冲区,但可变大小的二进制有一个偏移缓冲区和数据缓冲区。

The offsets buffer contains length + 1 signed integers (either 32-bit or 64-bit, depending on the logical type), which encode the start position of each slot in the data buffer. The length of the value in each slot is computed using the difference between the offset at that slot’s index and the subsequent offset. For example, the position and length of slot j is computed as:
偏移量缓冲区包含长度+1 的有符号整数(根据逻辑类型,可以是 32 位或 64 位),这些整数编码了数据缓冲区中每个插槽的起始位置。每个插槽中的值的长度是通过计算该插槽索引处的偏移量和后续偏移量之间的差值来计算的。例如,插槽 j 的位置和长度是这样计算的:

slot_position = offsets[j]
slot_length = offsets[j + 1] - offsets[j]  // (for 0 <= j < length)

It should be noted that a null value may have a positive slot length. That is, a null value may occupy a non-empty memory space in the data buffer. When this is true, the content of the corresponding memory space is undefined.

Offsets must be monotonically increasing, that is offsets[j+1] >= offsets[j] for 0 <= j < length, even for null slots. This property ensures the location for all values is valid and well defined.
偏移量必须单调增加,即使对于空槽, 0 <= j < lengthoffsets[j+1] >= offsets[j] 也是如此。这个属性确保所有值的位置都是有效的和明确的。

Generally the first slot in the offsets array is 0, and the last slot is the length of the values array. When serializing this layout, we recommend normalizing the offsets to start at 0.
通常,偏移数组的第一个插槽为 0,最后一个插槽是值数组的长度。在序列化此布局时,我们建议将偏移量标准化为从 0 开始。

Example Layout: ``VarBinary``

['joe', null, null, 'mark']

will be represented as follows:

* Length: 4, Null count: 2
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  | 00001001                 | 0 (padding)           |

* Offsets buffer:

  | Bytes 0-19     | Bytes 20-63           |
  | 0, 3, 3, 3, 7  | unspecified (padding) |

 * Value buffer:

  | Bytes 0-6      | Bytes 7-63            |
  | joemark        | unspecified (padding) |

Variable-size Binary View Layout#

New in version Arrow: Columnar Format 1.4
在 Arrow 版本中的新功能:列式格式 1.4

Each value in this layout consists of 0 or more bytes. These bytes’ locations are indicated using a views buffer, which may point to one of potentially several data buffers or may contain the characters inline.
此布局中的每个值由 0 个或更多字节组成。这些字节的位置由视图缓冲区指示,该缓冲区可能指向潜在的多个数据缓冲区之一,或者可能包含内联字符。

The views buffer contains length view structures with the following layout:

* Short strings, length <= 12
  | Bytes 0-3  | Bytes 4-15                            |
  | length     | data (padded with 0)                  |

* Long strings, length > 12
  | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 |
  | length     | prefix     | buf. index | offset      |

In both the long and short string cases, the first four bytes encode the length of the string and can be used to determine how the rest of the view should be interpreted.

In the short string case the string’s bytes are inlined — stored inside the view itself, in the twelve bytes which follow the length. Any remaining bytes after the string itself are padded with 0.
在短字符串的情况下,字符串的字节是内联的——存储在视图本身内部,在长度后面的十二个字节中。字符串本身后面的任何剩余字节都用 0 填充。

In the long string case, a buffer index indicates which data buffer stores the data bytes and an offset indicates where in that buffer the data bytes begin. Buffer index 0 refers to the first data buffer, IE the first buffer after the validity buffer and the views buffer. The half-open range [offset, offset + length) must be entirely contained within the indicated buffer. A copy of the first four bytes of the string is stored inline in the prefix, after the length. This prefix enables a profitable fast path for string comparisons, which are frequently determined within the first four bytes.
在长字符串情况下,缓冲区索引指示哪个数据缓冲区存储数据字节,偏移量指示在该缓冲区中数据字节开始的位置。缓冲区索引 0 指的是第一个数据缓冲区,即有效性缓冲区和视图缓冲区之后的第一个缓冲区。半开放范围 [offset, offset + length) 必须完全包含在指示的缓冲区内。字符串的前四个字节的副本存储在前缀中,紧跟在长度之后。这个前缀使得字符串比较的快速路径变得有利可图,这些比较通常在前四个字节内确定。

All integers (length, buffer index, and offset) are signed.

This layout is adapted from TU Munich’s UmbraDB.
此布局改编自 TU 慕尼黑的 UmbraDB。

Note that this layout uses one additional buffer to store the variadic buffer lengths in the Arrow C data interface.
请注意,此布局使用了一个额外的缓冲区来存储 Arrow C 数据接口中的可变参数缓冲区长度。

Variable-size List Layout#

List is a nested type which is semantically similar to variable-size binary. There are two list layout variations — “list” and “list-view” — and each variation can be delimited by either 32-bit or 64-bit offsets integers.
列表是一种嵌套类型,语义上类似于可变大小的二进制。有两种列表布局变体 - “列表”和“列表视图” - 每种变体都可以由 32 位或 64 位偏移整数分隔。

List Layout# 列表布局#

The List layout is defined by two buffers, a validity bitmap and an offsets buffer, and a child array. The offsets are the same as in the variable-size binary case, and both 32-bit and 64-bit signed integer offsets are supported options for the offsets. Rather than referencing an additional data buffer, instead these offsets reference the child array.
列表布局由两个缓冲区、一个有效性位图和一个偏移缓冲区以及一个子数组定义。偏移量与变量大小二进制情况相同,32 位和 64 位有符号整数偏移量都是偏移量的支持选项。这些偏移量不是引用额外的数据缓冲区,而是引用子数组。

Similar to the layout of variable-size binary, a null value may correspond to a non-empty segment in the child array. When this is true, the content of the corresponding segment can be arbitrary.

A list type is specified like List<T>, where T is any type (primitive or nested). In these examples we use 32-bit offsets where the 64-bit offset version would be denoted by LargeList<T>.
列表类型被指定为 List<T> ,其中 T 可以是任何类型(基本类型或嵌套类型)。在这些示例中,我们使用 32 位偏移量,如果使用 64 位偏移量版本,则会被表示为 LargeList<T>

Example Layout: ``List<Int8>`` Array
示例布局:``List<Int8>`` 数组

We illustrate an example of List<Int8> with length 4 having values:
我们举例说明一个长度为 4,具有值的 List<Int8>

[[12, -7, 25], null, [0, -127, 127, 50], []]

will have the following representation:

* Length: 4, Null count: 1
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  | 00001101                 | 0 (padding)           |

* Offsets buffer (int32)

  | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63           |
  | 0          | 3           | 3           | 7           | 7           | unspecified (padding) |

* Values array (Int8Array):
  * Length: 7,  Null count: 0
  * Validity bitmap buffer: Not required
  * Values buffer (int8)

    | Bytes 0-6                    | Bytes 7-63            |
    | 12, -7, 25, 0, -127, 127, 50 | unspecified (padding) |

Example Layout: ``List<List<Int8>>``

[[[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]]

will be represented as follows:

* Length 3
* Nulls count: 0
* Validity bitmap buffer: Not required
* Offsets buffer (int32)

  | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 | Bytes 16-63           |
  | 0          |  2         |  5         |  6          | unspecified (padding) |

* Values array (`List<Int8>`)
  * Length: 6, Null count: 1
  * Validity bitmap buffer:

    | Byte 0 (validity bitmap) | Bytes 1-63  |
    | 00110111                 | 0 (padding) |

  * Offsets buffer (int32)

    | Bytes 0-27           | Bytes 28-63           |
    | 0, 2, 4, 7, 7, 8, 10 | unspecified (padding) |

  * Values array (Int8):
    * Length: 10, Null count: 0
    * Validity bitmap buffer: Not required

      | Bytes 0-9                     | Bytes 10-63           |
      | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified (padding) |

ListView Layout# ListView 布局#

New in version Arrow: Columnar Format 1.4
Arrow 新版本:列式格式 1.4

The ListView layout is defined by three buffers: a validity bitmap, an offsets buffer, and an additional sizes buffer. Sizes and offsets have the identical bit width and both 32-bit and 64-bit signed integer options are supported.
ListView 布局由三个缓冲区定义:有效性位图、偏移缓冲区和额外的大小缓冲区。大小和偏移具有相同的位宽,支持 32 位和 64 位有符号整数选项。

As in the List layout, the offsets encode the start position of each slot in the child array. In contrast to the List layout, list lengths are stored explicitly in the sizes buffer instead of inferred. This allows offsets to be out of order. Elements of the child array do not have to be stored in the same order they logically appear in the list elements of the parent array.

Every list-view value, including null values, has to guarantee the following invariants:

0 <= offsets[i] <= length of the child array
0 <= offsets[i] + size[i] <= length of the child array

A list-view type is specified like ListView<T>, where T is any type (primitive or nested). In these examples we use 32-bit offsets and sizes where the 64-bit version would be denoted by LargeListView<T>.
列表视图类型被指定为 ListView<T> ,其中 T 可以是任何类型(基本类型或嵌套类型)。在这些示例中,我们使用 32 位偏移和大小,其中 64 位版本将由 LargeListView<T> 表示。

Example Layout: ``ListView<Int8>`` Array
示例布局:``ListView<Int8>`` 数组

We illustrate an example of ListView<Int8> with length 4 having values:
我们举例说明一个长度为 4,具有值的 ListView<Int8>

[[12, -7, 25], null, [0, -127, 127, 50], []]

It may have the following representation:

* Length: 4, Null count: 1
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  | 00001101                 | 0 (padding)           |

* Offsets buffer (int32)

  | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-63           |
  | 0          | 7           | 3           | 0           | unspecified (padding) |

* Sizes buffer (int32)

  | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-63           |
  | 3          | 0           | 4           | 0           | unspecified (padding) |

* Values array (Int8Array):
  * Length: 7,  Null count: 0
  * Validity bitmap buffer: Not required
  * Values buffer (int8)

    | Bytes 0-6                    | Bytes 7-63            |
    | 12, -7, 25, 0, -127, 127, 50 | unspecified (padding) |

Example Layout: ``ListView<Int8>`` Array
示例布局:``ListView<Int8>`` 数组

We continue with the ListView<Int8> type, but this instance illustrates out of order offsets and sharing of child array values. It is an array with length 5 having logical values:
我们继续使用 ListView<Int8> 类型,但这个实例说明了顺序错乱的偏移量和子数组值的共享。它是一个长度为 5 的数组,具有逻辑值:

[[12, -7, 25], null, [0, -127, 127, 50], [], [50, 12]]

It may have the following representation:

* Length: 4, Null count: 1
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  | 00011101                 | 0 (padding)           |

* Offsets buffer (int32)

  | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63           |
  | 4          | 7           | 0           | 0           | 3           | unspecified (padding) |

* Sizes buffer (int32)

  | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63           |
  | 3          | 0           | 4           | 0           | 2           | unspecified (padding) |

* Values array (Int8Array):
  * Length: 7,  Null count: 0
  * Validity bitmap buffer: Not required
  * Values buffer (int8)

    | Bytes 0-6                    | Bytes 7-63            |
    | 0, -127, 127, 50, 12, -7, 25 | unspecified (padding) |

Fixed-Size List Layout# 固定大小列表布局#

Fixed-Size List is a nested type in which each array slot contains a fixed-size sequence of values all having the same type.

A fixed size list type is specified like FixedSizeList<T>[N], where T is any type (primitive or nested) and N is a 32-bit signed integer representing the length of the lists.
固定大小的列表类型被指定为 FixedSizeList<T>[N] ,其中 T 可以是任何类型(基本类型或嵌套类型), N 是一个 32 位有符号整数,表示列表的长度。

A fixed size list array is represented by a values array, which is a child array of type T. T may also be a nested type. The value in slot j of a fixed size list array is stored in an N-long slice of the values array, starting at an offset of j * N.
固定大小的列表数组由值数组表示,该值数组是类型为 T 的子数组。 T 也可能是嵌套类型。固定大小列表数组中 j 插槽的值存储在值数组的 N 长切片中,从偏移量 j * N 开始。

Example Layout: ``FixedSizeList<byte>[4]`` Array
示例布局:``FixedSizeList<byte>[4]`` 数组

Here we illustrate FixedSizeList<byte>[4].
在这里,我们展示 FixedSizeList<byte>[4]

For an array of length 4 with respective values:
对于一个长度为 4 的数组,其各自的值为:

[[192, 168, 0, 12], null, [192, 168, 0, 25], [192, 168, 0, 1]]

will have the following representation:

* Length: 4, Null count: 1
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  | 00001101                 | 0 (padding)           |

* Values array (byte array):
  * Length: 16,  Null count: 0
  * validity bitmap buffer: Not required

    | Bytes 0-3       | Bytes 4-7   | Bytes 8-15                      |
    | 192, 168, 0, 12 | unspecified | 192, 168, 0, 25, 192, 168, 0, 1 |

Struct Layout# 结构布局#

A struct is a nested type parameterized by an ordered sequence of types (which can all be distinct), called its fields. Each field must have a UTF8-encoded name, and these field names are part of the type metadata.
结构体是一种嵌套类型,由一系列有序的类型(可以全部不同)参数化,称为其字段。每个字段必须有一个 UTF8 编码的名称,这些字段名称是类型元数据的一部分。

Physically, a struct array has one child array for each field. The child arrays are independent and need not be adjacent to each other in memory. A struct array also has a validity bitmap to encode top-level validity information.

For example, the struct (field names shown here as strings for illustration purposes):

Struct <
  name: VarBinary
  age: Int32

has two child arrays, one VarBinary array (using variable-size binary layout) and one 4-byte primitive value array having Int32 logical type.
具有两个子数组,一个 VarBinary 数组(使用可变大小的二进制布局)和一个具有 Int32 逻辑类型的 4 字节原始值数组。

Example Layout: ``Struct<VarBinary, Int32>``
示例布局:``Struct<VarBinary, Int32>``

The layout for [{'joe', 1}, {null, 2}, null, {'mark', 4}], having child arrays ['joe', null, 'alice', 'mark'] and [1, 2, null, 4] would be:
对于 [{'joe', 1}, {null, 2}, null, {'mark', 4}] 的布局,拥有子数组 ['joe', null, 'alice', 'mark'][1, 2, null, 4] 将会是:

* Length: 4, Null count: 1
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  | 00001011                 | 0 (padding)           |

* Children arrays:
  * field-0 array (`VarBinary`):
    * Length: 4, Null count: 1
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      | 00001101                 | 0 (padding)           |

    * Offsets buffer:

      | Bytes 0-19     | Bytes 20-63           |
      | 0, 3, 3, 8, 12 | unspecified (padding) |

     * Value buffer:

      | Bytes 0-11     | Bytes 12-63           |
      | joealicemark   | unspecified (padding) |

  * field-1 array (int32 array):
    * Length: 4, Null count: 1
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      | 00001011                 | 0 (padding)           |

    * Value Buffer:

      | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-63           |
      | 1           | 2           | unspecified | 4           | unspecified (padding) |

Struct Validity# 结构有效性 #

A struct array has its own validity bitmap that is independent of its child arrays’ validity bitmaps. The validity bitmap for the struct array might indicate a null when one or more of its child arrays has a non-null value in its corresponding slot; or conversely, a child array might indicate a null in its validity bitmap while the struct array’s validity bitmap shows a non-null value.

Therefore, to know whether a particular child entry is valid, one must take the logical AND of the corresponding bits in the two validity bitmaps (the struct array’s and the child array’s).

This is illustrated in the example above, one of the child arrays has a valid entry 'alice' for the null struct but it is “hidden” by the struct array’s validity bitmap. However, when treated independently, corresponding entries of the children array will be non-null.
如上例所示,子数组中的一个有一个对空结构有效的条目 'alice' ,但它被结构数组的有效性位图“隐藏”了。然而,当独立处理时,子数组的相应条目将是非空的。

Union Layout# 联合布局 #

A union is defined by an ordered sequence of types; each slot in the union can have a value chosen from these types. The types are named like a struct’s fields, and the names are part of the type metadata.

Unlike other data types, unions do not have their own validity bitmap. Instead, the nullness of each slot is determined exclusively by the child arrays which are composed to create the union.

We define two distinct union types, “dense” and “sparse”, that are optimized for different use cases.

Dense Union# 密集联盟 #

Dense union represents a mixed-type array with 5 bytes of overhead for each value. Its physical layout is as follows:
密集联合表示每个值需要 5 字节开销的混合类型数组。其物理布局如下:

  • One child array for each type

  • Types buffer: A buffer of 8-bit signed integers. Each type in the union has a corresponding type id whose values are found in this buffer. A union with more than 127 possible types can be modeled as a union of unions.
    类型缓冲区:一个 8 位有符号整数的缓冲区。联合中的每种类型都有一个相应的类型 id,其值在此缓冲区中找到。一个可能的类型超过 127 的联合可以被建模为联合的联合。

  • Offsets buffer: A buffer of signed Int32 values indicating the relative offset into the respective child array for the type in a given slot. The respective offsets for each child value array must be in order / increasing.
    偏移缓冲区:一个包含有符号 Int32 值的缓冲区,表示给定插槽中类型在各自子数组中的相对偏移量。每个子值数组的各自偏移量必须按顺序/递增排列。

Example Layout: ``DenseUnion<f: Float32, i: Int32>``
示例布局:``DenseUnion<f: Float32, i: Int32>``

For the union array:

[{f=1.2}, null, {f=3.4}, {i=5}]

will have the following layout:

* Length: 4, Null count: 0
* Types buffer:

  | Byte 0   | Byte 1      | Byte 2   | Byte 3   | Bytes 4-63            |
  | 0        | 0           | 0        | 1        | unspecified (padding) |

* Offset buffer:

  | Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-63           |
  | 0         | 1           | 2          | 0           | unspecified (padding) |

* Children arrays:
  * Field-0 array (f: Float32):
    * Length: 3, Null count: 1
    * Validity bitmap buffer: 00000101

    * Value Buffer:

      | Bytes 0-11     | Bytes 12-63           |
      | 1.2, null, 3.4 | unspecified (padding) |

  * Field-1 array (i: Int32):
    * Length: 1, Null count: 0
    * Validity bitmap buffer: Not required

    * Value Buffer:

      | Bytes 0-3 | Bytes 4-63            |
      | 5         | unspecified (padding) |

Sparse Union# 稀疏联合 #

A sparse union has the same structure as a dense union, with the omission of the offsets array. In this case, the child arrays are each equal in length to the length of the union.

While a sparse union may use significantly more space compared with a dense union, it has some advantages that may be desirable in certain use cases:

  • A sparse union is more amenable to vectorized expression evaluation in some use cases.

  • Equal-length arrays can be interpreted as a union by only defining the types array.

Example layout: ``SparseUnion<i: Int32, f: Float32, s: VarBinary>``
示例布局:``SparseUnion<i: Int32, f: Float32, s: VarBinary>``

For the union array:

[{i=5}, {f=1.2}, {s='joe'}, {f=3.4}, {i=4}, {s='mark'}]

will have the following layout:

* Length: 6, Null count: 0
* Types buffer:

 | Byte 0     | Byte 1      | Byte 2      | Byte 3      | Byte 4      | Byte 5       | Bytes  6-63           |
 | 0          | 1           | 2           | 1           | 0           | 2            | unspecified (padding) |

* Children arrays:

  * i (Int32):
    * Length: 6, Null count: 4
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      | 00010001                 | 0 (padding)           |

    * Value buffer:

      | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-23  | Bytes 24-63           |
      | 5           | unspecified | unspecified | unspecified | 4           |  unspecified | unspecified (padding) |

  * f (Float32):
    * Length: 6, Null count: 4
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      | 00001010                 | 0 (padding)           |

    * Value buffer:

      | Bytes 0-3    | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-63           |
      | unspecified  | 1.2         | unspecified | 3.4         | unspecified | unspecified | unspecified (padding) |

  * s (`VarBinary`)
    * Length: 6, Null count: 4
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      | 00100100                 | 0 (padding)           |

    * Offsets buffer (Int32)

      | Bytes 0-3  | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-27 | Bytes 28-63            |
      | 0          | 0           | 0           | 3           | 3           | 3           | 7           | unspecified (padding)  |

    * Values buffer:

      | Bytes 0-6  | Bytes 7-63            |
      | joemark    | unspecified (padding) |

Only the slot in the array corresponding to the type index is considered. All “unselected” values are ignored and could be any semantically correct array value.

Null Layout# 空布局 #

We provide a simplified memory-efficient layout for the Null data type where all values are null. In this case no memory buffers are allocated.
我们为 Null 数据类型提供了一种简化的内存高效布局,其中所有值都为空。在这种情况下,不会分配任何内存缓冲区。

Dictionary-encoded Layout#

Dictionary encoding is a data representation technique to represent values by integers referencing a dictionary usually consisting of unique values. It can be effective when you have data with many repeated values.

Any array can be dictionary-encoded. The dictionary is stored as an optional property of an array. When a field is dictionary encoded, the values are represented by an array of non-negative integers representing the index of the value in the dictionary. The memory layout for a dictionary-encoded array is the same as that of a primitive integer layout. The dictionary is handled as a separate columnar array with its own respective layout.

As an example, you could have the following data:

type: VarBinary

['foo', 'bar', 'foo', 'bar', null, 'baz']

In dictionary-encoded form, this could appear as:

data VarBinary (dictionary-encoded)
   index_type: Int32
   values: [0, 1, 0, 1, null, 2]

   type: VarBinary
   values: ['foo', 'bar', 'baz']

Note that a dictionary is permitted to contain duplicate values or nulls:

data VarBinary (dictionary-encoded)
   index_type: Int32
   values: [0, 1, 3, 1, 4, 2]

   type: VarBinary
   values: ['foo', 'bar', 'baz', 'foo', null]

The null count of such arrays is dictated only by the validity bitmap of its indices, irrespective of any null values in the dictionary.

Since unsigned integers can be more difficult to work with in some cases (e.g. in the JVM), we recommend preferring signed integers over unsigned integers for representing dictionary indices. Additionally, we recommend avoiding using 64-bit unsigned integer indices unless they are required by an application.
由于在某些情况下(例如在 JVM 中)无符号整数可能更难处理,我们建议优先使用有符号整数而不是无符号整数来表示字典索引。此外,我们建议除非应用程序需要,否则避免使用 64 位无符号整数索引。

We discuss dictionary encoding as it relates to serialization further below.

Run-End Encoded Layout# 运行结束编码布局#

New in version Arrow: Columnar Format 1.3
在 Arrow 版本中的新功能:列式格式 1.3

Run-end encoding (REE) is a variation of run-length encoding (RLE). These encodings are well-suited for representing data containing sequences of the same value, called runs. In run-end encoding, each run is represented as a value and an integer giving the index in the array where the run ends.

Any array can be run-end encoded. A run-end encoded array has no buffers by itself, but has two child arrays. The first child array, called the run ends array, holds either 16, 32, or 64-bit signed integers. The actual values of each run are held in the second child array. For the purposes of determining field names and schemas, these child arrays are prescribed the standard names of run_ends and values respectively.
任何数组都可以进行行末编码。行末编码的数组本身没有缓冲区,但有两个子数组。第一个子数组被称为行末数组,包含 16、32 或 64 位的有符号整数。每个运行的实际值都存储在第二个子数组中。为了确定字段名称和模式,这些子数组被规定为 run_ends 和 values 的标准名称。

The values in the first child array represent the accumulated length of all runs from the first to the current one, i.e. the logical index where the current run ends. This allows relatively efficient random access from a logical index using binary search. The length of an individual run can be determined by subtracting two adjacent values. (Contrast this with run-length encoding, in which the lengths of the runs are represented directly, and in which random access is less efficient.)

Note 注意

Because the run_ends child array cannot have nulls, it’s reasonable to consider why the run_ends are a child array instead of just a buffer, like the offsets for a Variable-size List Layout. This layout was considered, but it was decided to use the child arrays.
因为 run_ends 子数组不能有空值,所以有理由考虑为什么 run_ends 是子数组而不仅仅是一个缓冲区,就像可变大小列表布局的偏移量一样。考虑过这种布局,但最终决定使用子数组。

Child arrays allow us to keep the “logical length” (the decoded length) associated with the parent array and the “physical length” (the number of run ends) associated with the child arrays. If run_ends was a buffer in the parent array then the size of the buffer would be unrelated to the length of the array and this would be confusing.
子数组使我们能够保持与父数组相关的“逻辑长度”(解码长度)和与子数组相关的“物理长度”(运行结束的数量)。如果 run_ends 是父数组中的缓冲区,那么缓冲区的大小将与数组的长度无关,这将令人困惑。

A run must have a length of at least 1. This means the values in the run ends array all are positive and in strictly ascending order. A run end cannot be null.
跑步长度必须至少为 1。这意味着跑步结束数组中的所有值都是正数,并且严格按升序排列。跑步结束不能为 null。

The REE parent has no validity bitmap, and it’s null count field should always be 0. Null values are encoded as runs with the value null.
REE 父项没有有效性位图,其空值计数字段应始终为 0。空值被编码为值为空的运行。

As an example, you could have the following data:

type: Float32
[1.0, 1.0, 1.0, 1.0, null, null, 2.0]

In Run-end-encoded form, this could appear as:

* Length: 7, Null count: 0
* Child Arrays:

  * run_ends (Int32):
    * Length: 3, Null count: 0 (Run Ends cannot be null)
    * Validity bitmap buffer: Not required (if it exists, it should be all 1s)
    * Values buffer

      | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-63           |
      | 4           | 6           | 7           | unspecified (padding) |

  * values (Float32):
    * Length: 3, Null count: 1
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      | 00000101                 | 0 (padding)           |

    * Values buffer

      | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-63           |
      | 1.0         | unspecified | 2.0         | unspecified (padding) |

Buffer Listing for Each Layout#
每个布局的缓冲区列表 #

For the avoidance of ambiguity, we provide listing the order and type of memory buffers for each layout.

Buffer Layouts# 缓冲区布局 #

Layout Type 布局类型

Buffer 0 缓冲区 0

Buffer 1 缓冲区 1

Buffer 2 缓冲区 2

Variadic Buffers 可变参数缓冲区




Variable Binary 二进制变量




Variable Binary View 变量二进制视图







Fixed-size List 固定大小的列表




Sparse Union 稀疏联合

type ids 类型标识符

Dense Union 密集联盟

type ids 类型标识符





data (indices) 数据(索引)

Run-end encoded 运行结束编码

Logical Types# 逻辑类型 #

The Schema.fbs defines built-in logical types supported by the Arrow columnar format. Each logical type uses one of the above physical layouts. Nested logical types may have different physical layouts depending on the particular realization of the type.
Schema.fbs 定义了 Arrow 列式格式支持的内置逻辑类型。每种逻辑类型都使用上述物理布局之一。嵌套的逻辑类型可能会根据类型的特定实现有不同的物理布局。

We do not go into detail about the logical types definitions in this document as we consider Schema.fbs to be authoritative.
我们在此文档中并未详细讨论逻辑类型定义,因为我们认为 Schema.fbs 是权威的。

Serialization and Interprocess Communication (IPC)#

The primitive unit of serialized data in the columnar format is the “record batch”. Semantically, a record batch is an ordered collection of arrays, known as its fields, each having the same length as one another but potentially different data types. A record batch’s field names and types collectively form the batch’s schema.

In this section we define a protocol for serializing record batches into a stream of binary payloads and reconstructing record batches from these payloads without need for memory copying.

The columnar IPC protocol utilizes a one-way stream of binary messages of these types:
柱状 IPC 协议利用这些类型的单向二进制消息流:

  • Schema 模式

  • RecordBatch 记录批次

  • DictionaryBatch 字典批次

We specify a so-called encapsulated IPC message format which includes a serialized Flatbuffer type along with an optional message body. We define this message format before describing how to serialize each constituent IPC message type.
我们指定了一种所谓的封装 IPC 消息格式,其中包括一个序列化的 Flatbuffer 类型以及一个可选的消息主体。我们在描述如何序列化每个组成 IPC 消息类型之前定义了这种消息格式。

Encapsulated message format#

For simple streaming and file-based serialization, we define a “encapsulated” message format for interprocess communication. Such messages can be “deserialized” into in-memory Arrow array objects by examining only the message metadata without any need to copy or move any of the actual data.
对于简单的流式和基于文件的序列化,我们定义了一种“封装”的消息格式用于进程间通信。通过仅检查消息元数据,无需复制或移动任何实际数据,就可以将这些消息“反序列化”为内存中的 Arrow 数组对象。

The encapsulated binary message format is as follows:

  • A 32-bit continuation indicator. The value 0xFFFFFFFF indicates a valid message. This component was introduced in version 0.15.0 in part to address the 8-byte alignment requirement of Flatbuffers
    32 位继续指示器。值 0xFFFFFFFF 表示有效的消息。这个组件在 0.15.0 版本中引入,部分是为了解决 Flatbuffers 的 8 字节对齐要求。

  • A 32-bit little-endian length prefix indicating the metadata size
    表示元数据大小的 32 位小端长度前缀

  • The message metadata as using the Message type defined in Message.fbs
    消息元数据使用 Message.fbs 中定义的 Message 类型

  • Padding bytes to an 8-byte boundary
    将字节填充到 8 字节边界

  • The message body, whose length must be a multiple of 8 bytes
    消息体的长度必须是 8 字节的倍数

Schematically, we have: 示意性地,我们有:

<continuation: 0xFFFFFFFF>
<metadata_size: int32>
<metadata_flatbuffer: bytes>
<message body>

The complete serialized message must be a multiple of 8 bytes so that messages can be relocated between streams. Otherwise the amount of padding between the metadata and the message body could be non-deterministic.
完整的序列化消息必须是 8 字节的倍数,以便可以在流之间重新定位消息。否则,元数据和消息主体之间的填充量可能是不确定的。

The metadata_size includes the size of the Message plus padding. The metadata_flatbuffer contains a serialized Message Flatbuffer value, which internally includes:
metadata_size 包括 Message 的大小加上填充。 metadata_flatbuffer 包含一个序列化的 Message Flatbuffer 值,其内部包括:

  • A version number 版本号

  • A particular message value (one of Schema, RecordBatch, or DictionaryBatch)
    特定的消息值( SchemaRecordBatchDictionaryBatch 中的一个)

  • The size of the message body

  • A custom_metadata field for any application-supplied metadata
    一个 custom_metadata 字段,用于任何应用程序提供的元数据

When read from an input stream, generally the Message metadata is initially parsed and validated to obtain the body size. Then the body can be read.
从输入流读取时,通常首先解析并验证 Message 元数据以获取主体大小。然后可以读取主体。

Schema message# 模式消息#

The Flatbuffers files Schema.fbs contains the definitions for all built-in logical data types and the Schema metadata type which represents the schema of a given record batch. A schema consists of an ordered sequence of fields, each having a name and type. A serialized Schema does not contain any data buffers, only type metadata.
Flatbuffers 文件 Schema.fbs 包含所有内置逻辑数据类型和 Schema 元数据类型的定义,后者代表给定记录批次的模式。模式由一系列有序字段组成,每个字段都有一个名称和类型。序列化的 Schema 不包含任何数据缓冲区,只包含类型元数据。

The Field Flatbuffers type contains the metadata for a single array. This includes:
Field Flatbuffers 类型包含单个数组的元数据。这包括:

  • The field’s name 字段的名称

  • The field’s logical type 字段的逻辑类型

  • Whether the field is semantically nullable. While this has no bearing on the array’s physical layout, many systems distinguish nullable and non-nullable fields and we want to allow them to preserve this metadata to enable faithful schema round trips.

  • A collection of child Field values, for nested types
    一组嵌套类型的子 Field 值的集合

  • A dictionary property indicating whether the field is dictionary-encoded or not. If it is, a dictionary “id” is assigned to allow matching a subsequent dictionary IPC message with the appropriate field.
    一个 dictionary 属性,用于指示字段是否进行了字典编码。如果是,将分配一个字典“id”以允许将后续的字典 IPC 消息与适当的字段进行匹配。

We additionally provide both schema-level and field-level custom_metadata attributes allowing for systems to insert their own application defined metadata to customize behavior.
我们还提供了模式级别和字段级别的 custom_metadata 属性,允许系统插入自己的应用程序定义的元数据以定制行为。

RecordBatch message# 记录批次消息#

A RecordBatch message contains the actual data buffers corresponding to the physical memory layout determined by a schema. The metadata for this message provides the location and size of each buffer, permitting Array data structures to be reconstructed using pointer arithmetic and thus no memory copying.
RecordBatch 消息包含与模式确定的物理内存布局相对应的实际数据缓冲区。此消息的元数据提供了每个缓冲区的位置和大小,允许使用指针算术重构数组数据结构,因此不需要复制内存。

The serialized form of the record batch is the following:

  • The data header, defined as the RecordBatch type in Message.fbs.
    data header ,在 Message.fbs 中定义为 RecordBatch 类型。

  • The body, a flat sequence of memory buffers written end-to-end with appropriate padding to ensure a minimum of 8-byte alignment
    body ,一个平坦的内存缓冲区序列,端对端编写并适当填充以确保至少 8 字节的对齐

The data header contains the following:

  • The length and null count for each flattened field in the record batch

  • The memory offset and length of each constituent Buffer in the record batch’s body
    记录批次主体中每个构成部分 Buffer 的内存偏移量和长度

Fields and buffers are flattened by a pre-order depth-first traversal of the fields in the record batch. For example, let’s consider the schema

col1: Struct<a: Int32, b: List<item: Int64>, c: Float64>
col2: Utf8

The flattened version of this is:

FieldNode 0: Struct name='col1'
FieldNode 1: Int32 name='a'
FieldNode 2: List name='b'
FieldNode 3: Int64 name='item'
FieldNode 4: Float64 name='c'
FieldNode 5: Utf8 name='col2'

For the buffers produced, we would have the following (refer to the table above):

buffer 0: field 0 validity
buffer 1: field 1 validity
buffer 2: field 1 values
buffer 3: field 2 validity
buffer 4: field 2 offsets
buffer 5: field 3 validity
buffer 6: field 3 values
buffer 7: field 4 validity
buffer 8: field 4 values
buffer 9: field 5 validity
buffer 10: field 5 offsets
buffer 11: field 5 data

The Buffer Flatbuffers value describes the location and size of a piece of memory. Generally these are interpreted relative to the encapsulated message format defined below.
Buffer Flatbuffers 值描述了一块内存的位置和大小。通常,这些相对于下面定义的封装消息格式进行解释。

The size field of Buffer is not required to account for padding bytes. Since this metadata can be used to communicate in-memory pointer addresses between libraries, it is recommended to set size to the actual memory size rather than the padded size.
size 字段的 Buffer 不需要考虑填充字节。由于此元数据可用于在库之间传递内存指针地址,因此建议将 size 设置为实际内存大小,而不是填充大小。

Variadic buffers# 可变参数缓冲区 #

New in version Arrow: Columnar Format 1.4
在 Arrow 版本中的新特性:列式格式 1.4

Some types such as Utf8View are represented using a variable number of buffers. For each such Field in the pre-ordered flattened logical schema, there will be an entry in variadicBufferCounts to indicate the number of variadic buffers which belong to that Field in the current RecordBatch.
某些类型,如 Utf8View,是通过可变数量的缓冲区来表示的。对于预排序的扁平化逻辑模式中的每个此类字段,在 variadicBufferCounts 中都会有一个条目,用来指示当前 RecordBatch 中属于该字段的可变缓冲区的数量。

For example, consider the schema

col1: Struct<a: Int32, b: BinaryView, c: Float64>
col2: Utf8View

This has two fields with variadic buffers, so variadicBufferCounts will have two entries in each RecordBatch. For a RecordBatch of this schema with variadicBufferCounts = [3, 2], the flattened buffers would be:
这有两个字段带有可变缓冲区,所以 variadicBufferCounts 将在每个 RecordBatch 中有两个条目。对于具有 variadicBufferCounts = [3, 2] 的此模式的 RecordBatch,扁平化的缓冲区将是:

buffer 0:  col1    validity
buffer 1:  col1.a  validity
buffer 2:  col1.a  values
buffer 3:  col1.b  validity
buffer 4:  col1.b  views
buffer 5:  col1.b  data
buffer 6:  col1.b  data
buffer 7:  col1.b  data
buffer 8:  col1.c  validity
buffer 9:  col1.c  values
buffer 10: col2    validity
buffer 11: col2    views
buffer 12: col2    data
buffer 13: col2    data

Byte Order (Endianness)#

The Arrow format is little endian by default.
Arrow 格式默认为小端模式。

Serialized Schema metadata has an endianness field indicating endianness of RecordBatches. Typically this is the endianness of the system where the RecordBatch was generated. The main use case is exchanging RecordBatches between systems with the same Endianness. At first we will return an error when trying to read a Schema with an endianness that does not match the underlying system. The reference implementation is focused on Little Endian and provides tests for it. Eventually we may provide automatic conversion via byte swapping.
序列化的 Schema 元数据有一个表示 RecordBatches 字节顺序的字段。这通常是生成 RecordBatch 的系统的字节顺序。主要用例是在具有相同字节顺序的系统之间交换 RecordBatches。起初,当试图读取一个与底层系统字节顺序不匹配的 Schema 时,我们会返回一个错误。参考实现主要关注小端字节顺序,并为其提供测试。最终,我们可能会通过字节交换提供自动转换。

IPC Streaming Format# IPC 流媒体格式#

We provide a streaming protocol or “format” for record batches. It is presented as a sequence of encapsulated messages, each of which follows the format above. The schema comes first in the stream, and it is the same for all of the record batches that follow. If any fields in the schema are dictionary-encoded, one or more DictionaryBatch messages will be included. DictionaryBatch and RecordBatch messages may be interleaved, but before any dictionary key is used in a RecordBatch it should be defined in a DictionaryBatch.
我们提供一种用于记录批次的流协议或“格式”。它被呈现为一系列封装的消息,每个消息都遵循上述格式。模式首先出现在流中,并且对于所有后续的记录批次都是相同的。如果模式中的任何字段都是字典编码的,则会包含一个或多个 DictionaryBatch 消息。 DictionaryBatchRecordBatch 消息可能会交错,但在 RecordBatch 中使用任何字典键之前,应在 DictionaryBatch 中定义。

<EOS [optional]: 0xFFFFFFFF 0x00000000>

Note 注意

An edge-case for interleaved dictionary and record batches occurs when the record batches contain dictionary encoded arrays that are completely null. In this case, the dictionary for the encoded column might appear after the first record batch.

When a stream reader implementation is reading a stream, after each message, it may read the next 8 bytes to determine both if the stream continues and the size of the message metadata that follows. Once the message flatbuffer is read, you can then read the message body.
当流读取器实现正在读取一个流时,每读取一条消息后,它可能会读取接下来的 8 个字节,以确定流是否继续以及接下来的消息元数据的大小。一旦读取了消息的 flatbuffer,您就可以读取消息体了。

The stream writer can signal end-of-stream (EOS) either by writing 8 bytes containing the 4-byte continuation indicator (0xFFFFFFFF) followed by 0 metadata length (0x00000000) or closing the stream interface. We recommend the “.arrows” file extension for the streaming format although in many cases these streams will not ever be stored as files.
流写入器可以通过写入包含 4 字节续行指示符( 0xFFFFFFFF )的 8 字节,后跟 0 元数据长度( 0x00000000 ),或者关闭流接口来发出流结束(EOS)信号。我们建议使用“.arrows”文件扩展名作为流格式,尽管在许多情况下,这些流不会被存储为文件。

IPC File Format# IPC 文件格式#

We define a “file format” supporting random access that is an extension of the stream format. The file starts and ends with a magic string ARROW1 (plus padding). What follows in the file is identical to the stream format. At the end of the file, we write a footer containing a redundant copy of the schema (which is a part of the streaming format) plus memory offsets and sizes for each of the data blocks in the file. This enables random access to any record batch in the file. See File.fbs for the precise details of the file footer.
我们定义了一种支持随机访问的“文件格式”,它是流格式的扩展。文件以魔术字符串 ARROW1 (加上填充)开始和结束。文件中的后续内容与流格式相同。在文件的末尾,我们写入一个包含模式的冗余副本(这是流格式的一部分)以及文件中每个数据块的内存偏移量和大小的页脚。这使得可以随机访问文件中的任何记录批次。请参阅 File.fbs 以获取文件页脚的精确细节。

Schematically we have: 我们在图示中有:

<magic number "ARROW1">
<empty padding bytes [to 8 byte boundary]>
<FOOTER SIZE: int32>
<magic number "ARROW1">

In the file format, there is no requirement that dictionary keys should be defined in a DictionaryBatch before they are used in a RecordBatch, as long as the keys are defined somewhere in the file. Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported). Delta dictionaries are applied in the order they appear in the file footer. We recommend the “.arrow” extension for files created with this format. Note that files created with this format are sometimes called “Feather V2” or with the “.feather” extension, the name and the extension derived from “Feather (V1)”, which was a proof of concept early in the Arrow project for language-agnostic fast data frame storage for Python (pandas) and R.
在文件格式中,没有要求字典键必须在 DictionaryBatch 中定义,然后在 RecordBatch 中使用,只要键在文件的某个地方定义即可。此外,每个字典 ID 不能有多于一个的非增量字典批次(即不支持字典替换)。增量字典按照它们在文件页脚中出现的顺序应用。我们建议使用“.arrow”扩展名创建的文件。请注意,使用此格式创建的文件有时被称为“Feather V2”或带有“.feather”扩展名,这个名称和扩展名源自“Feather (V1)”,这是 Arrow 项目早期的一个概念验证,用于 Python(pandas)和 R 的语言无关的快速数据帧存储。

Dictionary Messages# 词典信息 #

Dictionaries are written in the stream and file formats as a sequence of record batches, each having a single field. The complete semantic schema for a sequence of record batches, therefore, consists of the schema along with all of the dictionaries. The dictionary types are found in the schema, so it is necessary to read the schema to first determine the dictionary types so that the dictionaries can be properly interpreted:

table DictionaryBatch {
  id: long;
  data: RecordBatch;
  isDelta: boolean = false;

The dictionary id in the message metadata can be referenced one or more times in the schema, so that dictionaries can even be used for multiple fields. See the Dictionary-encoded Layout section for more about the semantics of dictionary-encoded data.
消息元数据中的字典 id 可以在模式中被引用一次或多次,以便字典甚至可以用于多个字段。有关字典编码数据语义的更多信息,请参阅字典编码布局部分。

The dictionary isDelta flag allows existing dictionaries to be expanded for future record batch materializations. A dictionary batch with isDelta set indicates that its vector should be concatenated with those of any previous batches with the same id. In a stream which encodes one column, the list of strings ["A", "B", "C", "B", "D", "C", "E", "A"], with a delta dictionary batch could take the form:
字典 isDelta 标志允许现有的字典扩展,以便于未来的记录批次实现。带有 isDelta 设置的字典批次表示,其向量应与具有相同 id 的任何先前批次的向量进行连接。在编码一个列的流中,字符串列表 ["A", "B", "C", "B", "D", "C", "E", "A"] ,带有一个增量字典批次可能会采取以下形式:

(0) "A"
(1) "B"
(2) "C"


(3) "D"
(4) "E"


Alternatively, if isDelta is set to false, then the dictionary replaces the existing dictionary for the same ID. Using the same example as above, an alternate encoding could be:
或者,如果 isDelta 设置为假,则该字典将替换具有相同 ID 的现有字典。使用上述相同的例子,一个替代的编码可能是:

(0) "A"
(1) "B"
(2) "C"


(0) "A"
(1) "C"
(2) "D"
(3) "E"


Custom Application Metadata#
自定义应用程序元数据 #

We provide a custom_metadata field at three levels to provide a mechanism for developers to pass application-specific metadata in Arrow protocol messages. This includes Field, Schema, and Message.
我们在三个级别提供了一个 custom_metadata 字段,以便为开发人员在 Arrow 协议消息中传递特定于应用程序的元数据提供一种机制。这包括 FieldSchemaMessage

The colon symbol : is to be used as a namespace separator. It can be used multiple times in a key.
冒号符号 : 将被用作命名空间分隔符。它可以在一个键中多次使用。

The ARROW pattern is a reserved namespace for internal Arrow use in the custom_metadata fields. For example, ARROW:extension:name.
ARROW 模式是 Arrow 内部使用的保留命名空间,在 custom_metadata 字段中使用。例如, ARROW:extension:name

Extension Types# 扩展类型 #

User-defined “extension” types can be defined setting certain KeyValue pairs in custom_metadata in the Field metadata structure. These extension keys are:
用户可以通过在 Field 元数据结构中的 custom_metadata 中设置某些 KeyValue 对来定义自定义的“扩展”类型。这些扩展键是:

  • 'ARROW:extension:name' for the string name identifying the custom data type. We recommend that you use a “namespace”-style prefix for extension type names to minimize the possibility of conflicts with multiple Arrow readers and writers in the same application. For example, use myorg.name_of_type instead of simply name_of_type
    用于标识自定义数据类型的字符串名称。我们建议您使用“命名空间”风格的前缀来命名扩展类型,以最小化在同一应用程序中的多个 Arrow 读取器和写入器之间的冲突可能性。例如,使用 myorg.name_of_type 而不仅仅是 name_of_type

  • 'ARROW:extension:metadata' for a serialized representation of the ExtensionType necessary to reconstruct the custom type
    'ARROW:extension:metadata' 用于序列化表示 ExtensionType ,以重构自定义类型

Note 注意

Extension names beginning with arrow. are reserved for canonical extension types, they should not be used for third-party extension types.
arrow. 开头的扩展名是为规范扩展类型保留的,不应用于第三方扩展类型。

This extension metadata can annotate any of the built-in Arrow logical types. The intent is that an implementation that does not support an extension type can still handle the underlying data. For example a 16-byte UUID value could be embedded in FixedSizeBinary(16), and implementations that do not have this extension type can still work with the underlying binary values and pass along the custom_metadata in subsequent Arrow protocol messages.
此扩展元数据可以注释 Arrow 内置的任何逻辑类型。其意图是,即使某个实现不支持扩展类型,仍然可以处理底层数据。例如,16 字节的 UUID 值可以嵌入 FixedSizeBinary(16) 中,而不具有此扩展类型的实现仍然可以处理底层的二进制值,并在后续的 Arrow 协议消息中传递 custom_metadata

Extension types may or may not use the 'ARROW:extension:metadata' field. Let’s consider some example extension types:
扩展类型可能会也可能不会使用 'ARROW:extension:metadata' 字段。让我们考虑一些示例扩展类型:

  • uuid represented as FixedSizeBinary(16) with empty metadata
    uuid 表示为 FixedSizeBinary(16) ,元数据为空

  • latitude-longitude represented as struct<latitude: double, longitude: double>, and empty metadata
    latitude-longitude 表示为 struct<latitude: double, longitude: double> ,并且元数据为空

  • tensor (multidimensional array) stored as Binary values and having serialized metadata indicating the data type and shape of each value. This could be JSON like {'type': 'int8', 'shape': [4, 5]} for a 4x5 cell tensor.
    tensor (多维数组)以 Binary 值的形式存储,并具有序列化的元数据,指示每个值的数据类型和形状。这可以像 {'type': 'int8', 'shape': [4, 5]} 一样的 JSON,用于 4x5 单元的张量。

  • trading-time represented as Timestamp with serialized metadata indicating the market trading calendar the data corresponds to
    trading-time 表示为 Timestamp ,序列化元数据指示数据对应的市场交易日历

Implementation guidelines#

An execution engine (or framework, or UDF executor, or storage engine, etc) can implement only a subset of the Arrow spec and/or extend it given the following constraints:
执行引擎(或框架,或 UDF 执行器,或存储引擎等)只能实现 Arrow 规范的一个子集,并/或在满足以下约束的情况下对其进行扩展:

Implementing a subset of the spec#
实现规范的一个子集 #

  • If only producing (and not consuming) arrow vectors: Any subset of the vector spec and the corresponding metadata can be implemented.

  • If consuming and producing vectors: There is a minimal subset of vectors to be supported. Production of a subset of vectors and their corresponding metadata is always fine. Consumption of vectors should at least convert the unsupported input vectors to the supported subset (for example Timestamp.millis to timestamp.micros or int32 to int64).
    如果消费和生产向量:需要支持的向量有一个最小的子集。生产一个向量的子集及其相应的元数据总是可以的。消费向量至少应将不支持的输入向量转换为支持的子集(例如,将 Timestamp.millis 转换为 timestamp.micros 或 int32 转换为 int64)。

Extensibility# 可扩展性 #

An execution engine implementor can also extend their memory representation with their own vectors internally as long as they are never exposed. Before sending data to another system expecting Arrow data, these custom vectors should be converted to a type that exist in the Arrow spec.
执行引擎实现者也可以使用他们自己的向量扩展内部的内存表示,只要他们永远不被暴露。在将数据发送到期望接收 Arrow 数据的另一个系统之前,这些自定义向量应该被转换为 Arrow 规范中存在的类型。