Arrow Columnar Format# Arrow 列格式#
Version: 1.4 版本:1.4
The Arrow columnar format includes a language-agnostic in-memory
data structure specification, metadata serialization, and a protocol
for serialization and generic data transport.
Arrow 列式格式包括语言无关的内存数据结构规范、元数据序列化以及序列化和通用数据传输的协议。
This document is intended to provide adequate detail to create a new
implementation of the columnar format without the aid of an existing
implementation. We utilize Google’s Flatbuffers project for
metadata serialization, so it will be necessary to refer to the
project’s Flatbuffers protocol definition files
while reading this document.
本文档旨在提供足够的细节,以便在没有现有实现的帮助下创建新的列式格式实现。我们使用 Google 的 Flatbuffers 项目进行元数据序列化,因此在阅读本文档时,将需要参考该项目的 Flatbuffers 协议定义文件。
The columnar format has some key features:
列式格式具有一些关键特性:
Data adjacency for sequential access (scans)
顺序访问(扫描)的数据邻接性O(1) (constant-time) random access
O(1)(常数时间)随机访问SIMD and vectorization-friendly
SIMD 和向量化友好Relocatable without “pointer swizzling”, allowing for true zero-copy access in shared memory
可在无需“指针扭转”的情况下进行重新定位,允许在共享内存中进行真正的零复制访问
The Arrow columnar format provides analytical performance and data
locality guarantees in exchange for comparatively more expensive
mutation operations. This document is concerned only with in-memory
data representation and serialization details; issues such as
coordinating mutation of data structures are left to be handled by
implementations.
Arrow 列式格式提供了分析性能和数据局部性保证,以换取相对更昂贵的变更操作。本文档仅关注内存数据表示和序列化细节;诸如协调数据结构变更的问题则留给实现来处理。
Terminology# 术语 #
Since different projects have used different words to describe various
concepts, here is a small glossary to help disambiguate.
由于不同的项目使用了不同的词语来描述各种概念,这里有一个小词汇表来帮助消除歧义。
Array or Vector: a sequence of values with known length all having the same type. These terms are used interchangeably in different Arrow implementations, but we use “array” in this document.
数组或向量:具有已知长度且所有值类型相同的值序列。这些术语在不同的 Arrow 实现中可以互换使用,但在本文档中我们使用“数组”。Slot: a single logical value in an array of some particular data type
插槽:某一特定数据类型数组中的单个逻辑值Buffer or Contiguous memory region: a sequential virtual address space with a given length. Any byte can be reached via a single pointer offset less than the region’s length.
缓冲区或连续内存区域:具有给定长度的顺序虚拟地址空间。任何字节都可以通过小于区域长度的单个指针偏移来访问。Physical Layout: The underlying memory layout for an array without taking into account any value semantics. For example, a 32-bit signed integer array and 32-bit floating point array have the same layout.
物理布局:不考虑任何值语义的情况下,数组的基础内存布局。例如,32 位有符号整数数组和 32 位浮点数数组具有相同的布局。Parent and child arrays: names to express relationships between physical value arrays in a nested type structure. For example, a
List<T>
-type parent array has a T-type array as its child (see more on lists below).
父子数组:用于表示嵌套类型结构中物理值数组之间关系的名称。例如,一个List<T>
类型的父数组拥有一个 T 类型的子数组(有关列表的更多信息请参见下文)。Primitive type: a data type having no child types. This includes such types as fixed bit-width, variable-size binary, and null types.
原始类型:没有子类型的数据类型。这包括固定位宽、可变大小二进制和空类型等类型。Nested type: a data type whose full structure depends on one or more other child types. Two fully-specified nested types are equal if and only if their child types are equal. For example,
List<U>
is distinct fromList<V>
iff U and V are different types.
嵌套类型:其完整结构依赖于一个或多个其他子类型的数据类型。如果且仅当它们的子类型相等时,两个完全指定的嵌套类型才相等。例如,当且仅当 U 和 V 是不同的类型时,List<U>
才与List<V>
不同。Logical type: An application-facing semantic value type that is implemented using some physical layout. For example, Decimal values are stored as 16 bytes in a fixed-size binary layout. Similarly, strings can be stored as
List<1-byte>
. A timestamp may be stored as 64-bit fixed-size layout.
逻辑类型:一种使用某种物理布局实现的面向应用的语义值类型。例如,十进制值以固定大小的 16 字节二进制布局存储。同样,字符串可以存储为List<1-byte>
。时间戳可能以 64 位固定大小布局存储。
Physical Memory Layout# 物理内存布局 #
Arrays are defined by a few pieces of metadata and data:
数组由一些元数据和数据定义:
A logical data type. 逻辑数据类型。
A sequence of buffers. 一系列的缓冲区。
A length as a 64-bit signed integer. Implementations are permitted to be limited to 32-bit lengths, see more on this below.
长度作为 64 位有符号整数。实现可以限制为 32 位长度,有关此的更多信息请参见下文。A null count as a 64-bit signed integer.
一个空计数作为一个 64 位有符号整数。An optional dictionary, for dictionary-encoded arrays.
用于字典编码数组的可选字典。
Nested arrays additionally have a sequence of one or more sets of
these items, called the child arrays.
嵌套数组还有一系列一个或多个这些项目的集合,称为子数组。
Each logical data type has a well-defined physical layout. Here are
the different physical layouts defined by Arrow:
每种逻辑数据类型都有明确定义的物理布局。以下是 Arrow 定义的不同物理布局:
Primitive (fixed-size): a sequence of values each having the same byte or bit width
原始(固定大小):一系列每个具有相同字节或位宽度的值Variable-size Binary: a sequence of values each having a variable byte length. Two variants of this layout are supported using 32-bit and 64-bit length encoding.
可变大小二进制:一系列每个具有可变字节长度的值。支持使用 32 位和 64 位长度编码的这种布局的两种变体。View of Variable-size Binary: a sequence of values each having a variable byte length. In contrast to Variable-size Binary, the values of this layout are distributed across potentially multiple buffers instead of densely and sequentially packed in a single buffer.
可变大小二进制视图:每个值都有可变的字节长度的值序列。与可变大小二进制相比,此布局的值可能分布在多个缓冲区中,而不是密集且顺序地打包在单个缓冲区中。Fixed-size List: a nested layout where each value has the same number of elements taken from a child data type.
固定大小列表:一个嵌套布局,其中每个值都从子数据类型中取得相同数量的元素。Variable-size List: a nested layout where each value is a variable-length sequence of values taken from a child data type. Two variants of this layout are supported using 32-bit and 64-bit length encoding.
可变大小列表:一种嵌套布局,其中每个值都是从子数据类型中取得的可变长度值序列。支持使用 32 位和 64 位长度编码的这种布局的两种变体。View of Variable-size List: a nested layout where each value is a variable-length sequence of values taken from a child data type. This layout differs from Variable-size List by having an additional buffer containing the sizes of each list value. This removes a constraint on the offsets buffer — it does not need to be in order.
可变大小列表视图:一个嵌套布局,其中每个值都是从子数据类型中取得的可变长度值序列。这种布局与可变大小列表的不同之处在于,它有一个额外的缓冲区,包含每个列表值的大小。这消除了对偏移缓冲区的约束——它不需要按顺序排列。Struct: a nested layout consisting of a collection of named child fields each having the same length but possibly different types.
结构体:一个嵌套布局,由一组具有相同长度但可能类型不同的命名子字段组成。Sparse and Dense Union: a nested layout representing a sequence of values, each of which can have type chosen from a collection of child array types.
稀疏和密集联合:表示一系列值的嵌套布局,每个值可以从子数组类型的集合中选择类型。Dictionary-Encoded: a layout consisting of a sequence of integers (any bit-width) which represent indexes into a dictionary which could be of any type.
字典编码:由一系列整数(任何位宽)组成的布局,这些整数代表了对任何类型的字典的索引。Run-End Encoded (REE): a nested layout consisting of two child arrays, one representing values, and one representing the logical index where the run of a corresponding value ends.
行尾编码(REE):一种嵌套布局,由两个子数组组成,一个代表值,一个代表对应值的运行结束的逻辑索引。Null: a sequence of all null values, having null logical type
空值:所有空值的序列,具有空的逻辑类型
The Arrow columnar memory layout only applies to data and not
metadata. Implementations are free to represent metadata in-memory
in whichever form is convenient for them. We handle metadata
serialization in an implementation-independent way using
Flatbuffers, detailed below.
Arrow 列式内存布局仅适用于数据,而不适用于元数据。实现可以自由地以对它们方便的形式表示内存中的元数据。我们使用 Flatbuffers 以与实现无关的方式处理元数据序列化,详细信息如下。
Buffer Alignment and Padding#
缓冲对齐和填充 #
Implementations are recommended to allocate memory on aligned
addresses (multiple of 8- or 64-bytes) and pad (overallocate) to a
length that is a multiple of 8 or 64 bytes. When serializing Arrow
data for interprocess communication, these alignment and padding
requirements are enforced. If possible, we suggest that you prefer
using 64-byte alignment and padding. Unless otherwise noted, padded
bytes do not need to have a specific value.
建议实现在对齐的地址(8 或 64 字节的倍数)上分配内存,并填充(超额分配)到 8 或 64 字节的倍数长度。在序列化 Arrow 数据进行进程间通信时,将强制执行这些对齐和填充要求。如果可能,我们建议您更倾向于使用 64 字节的对齐和填充。除非另有说明,填充字节不需要具有特定值。
The alignment requirement follows best practices for optimized memory
access:
对齐要求遵循优化内存访问的最佳实践:
Elements in numeric arrays will be guaranteed to be retrieved via aligned access.
数字数组中的元素将保证通过对齐访问来检索。On some architectures alignment can help limit partially used cache lines.
在某些架构中,对齐可以帮助限制部分使用的缓存行。
The recommendation for 64 byte alignment comes from the Intel
performance guide that recommends alignment of memory to match SIMD
register width. The specific padding length was chosen because it
matches the largest SIMD instruction registers available on widely
deployed x86 architecture (Intel AVX-512).
对 64 字节对齐的推荐来自英特尔性能指南,该指南建议将内存对齐以匹配 SIMD 寄存器宽度。选择特定的填充长度是因为它与广泛部署的 x86 架构上可用的最大 SIMD 指令寄存器(英特尔 AVX-512)相匹配。
The recommended padding of 64 bytes allows for using SIMD
instructions consistently in loops without additional conditional
checks. This should allow for simpler, efficient and CPU
cache-friendly code. In other words, we can load the entire 64-byte
buffer into a 512-bit wide SIMD register and get data-level
parallelism on all the columnar values packed into the 64-byte
buffer. Guaranteed padding can also allow certain compilers to
generate more optimized code directly (e.g. One can safely use Intel’s
-qopt-assume-safe-padding
).
推荐的 64 字节填充允许在循环中一致使用 SIMD 指令,无需额外的条件检查。这应该可以实现更简单、高效且 CPU 缓存友好的代码。换句话说,我们可以将整个 64 字节缓冲区加载到 512 位宽的 SIMD 寄存器中,并在所有打包到 64 字节缓冲区的列值上获取数据级并行性。保证的填充也可以让某些编译器直接生成更优化的代码(例如,可以安全地使用 Intel 的 -qopt-assume-safe-padding
)。
Array lengths# 数组长度 #
Array lengths are represented in the Arrow metadata as a 64-bit signed
integer. An implementation of Arrow is considered valid even if it only
supports lengths up to the maximum 32-bit signed integer, though. If using
Arrow in a multi-language environment, we recommend limiting lengths to
2 31 - 1 elements or less. Larger data sets can be represented using
multiple array chunks.
数组长度在 Arrow 元数据中以 64 位有符号整数表示。即使 Arrow 的实现只支持最大的 32 位有符号整数长度,也被认为是有效的。如果在多语言环境中使用 Arrow,我们建议将长度限制为 2 31 - 1 个元素或更少。可以使用多个数组块来表示更大的数据集。
Null count# 空值计数 #
The number of null value slots is a property of the physical array and
considered part of the data structure. The null count is represented
in the Arrow metadata as a 64-bit signed integer, as it may be as
large as the array length.
空值插槽的数量是物理数组的属性,被视为数据结构的一部分。空值计数在 Arrow 元数据中以 64 位有符号整数表示,因为它可能与数组长度一样大。
Validity bitmaps# 有效位图 #
Any value in an array may be semantically null, whether primitive or nested
type.
数组中的任何值,无论是基本类型还是嵌套类型,都可能在语义上为空。
All array types, with the exception of union types (more on these later),
utilize a dedicated memory buffer, known as the validity (or “null”) bitmap, to
encode the nullness or non-nullness of each value slot. The validity bitmap
must be large enough to have at least 1 bit for each array slot.
所有数组类型,除联合类型(稍后将详细介绍)外,都使用一个专用的内存缓冲区,称为有效性(或“空”)位图,来编码每个值槽的空或非空状态。有效性位图必须足够大,至少为每个数组槽提供 1 位。
Whether any array slot is valid (non-null) is encoded in the respective bits of
this bitmap. A 1 (set bit) for index j
indicates that the value is not null,
while a 0 (bit not set) indicates that it is null. Bitmaps are to be
initialized to be all unset at allocation time (this includes padding):
此位图的各个位编码了数组插槽是否有效(非空)。索引 j
的 1(设定位)表示该值不为空,而 0(未设定位)表示该值为空。在分配时间,位图应初始化为全部未设定(包括填充):
is_valid[j] -> bitmap[j / 8] & (1 << (j % 8))
We use least-significant bit (LSB) numbering (also known as
bit-endianness). This means that within a group of 8 bits, we read
right-to-left:
我们使用最低有效位(LSB)编号(也称为位端序)。这意味着在一组 8 位中,我们从右向左阅读:
values = [0, 1, null, 2, null, 3]
bitmap
j mod 8 7 6 5 4 3 2 1 0
0 0 1 0 1 0 1 1
Arrays having a 0 null count may choose to not allocate the validity
bitmap; how this is represented depends on the implementation (for
example, a C++ implementation may represent such an “absent” validity
bitmap using a NULL pointer). Implementations may choose to always allocate
a validity bitmap anyway as a matter of convenience. Consumers of Arrow
arrays should be ready to handle those two possibilities.
具有 0 空值计数的数组可以选择不分配有效性位图;这如何表示取决于实现(例如,C++实现可能使用 NULL 指针表示这样一个“缺失”的有效性位图)。实现可能会选择始终分配一个有效性位图,以方便为主。Arrow 数组的消费者应准备好处理这两种可能性。
Nested type arrays (except for union types as noted above) have their own
top-level validity bitmap and null count, regardless of the null count and
valid bits of their child arrays.
嵌套类型数组(除了上述的联合类型)有自己的顶级有效位图和空值计数,无论其子数组的空值计数和有效位是什么。
Array slots which are null are not required to have a particular value;
any “masked” memory can have any value and need not be zeroed, though
implementations frequently choose to zero memory for null values.
数组的空槽位并不需要有特定的值;任何“屏蔽”的内存可以有任何值,不必为零,尽管实现时常常选择为零值的内存进行清零。
Fixed-size Primitive Layout#
固定大小的原始布局 #
A primitive value array represents an array of values each having the
same physical slot width typically measured in bytes, though the spec
also provides for bit-packed types (e.g. boolean values encoded in
bits).
原始值数组代表一个值数组,每个值具有相同的物理槽宽度,通常以字节为单位,尽管规范也提供了位打包类型(例如,以位编码的布尔值)。
Internally, the array contains a contiguous memory buffer whose total
size is at least as large as the slot width multiplied by the array
length. For bit-packed types, the size is rounded up to the nearest
byte.
在内部,数组包含一个连续的内存缓冲区,其总大小至少与插槽宽度乘以数组长度一样大。对于位打包类型,大小向最近的字节四舍五入。
The associated validity bitmap is contiguously allocated (as described
above) but does not need to be adjacent in memory to the values
buffer.
关联的有效性位图是连续分配的(如上所述),但不需要与值缓冲区在内存中相邻。
Example Layout: Int32 Array
示例布局:Int32 数组
For example a primitive array of int32s:
例如一个原始的 int32s 数组:
[1, null, 2, 4, 8]
Would look like: 看起来像:
* Length: 5, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00011101 | 0 (padding) |
* Value Buffer:
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|-------------|-------------|-------------|-------------|-------------|-----------------------|
| 1 | unspecified | 2 | 4 | 8 | unspecified (padding) |
Example Layout: Non-null int32 Array
示例布局:非空 int32 数组
[1, 2, 3, 4, 8]
has two possible layouts:
[1, 2, 3, 4, 8]
有两种可能的布局:
* Length: 5, Null count: 0
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00011111 | 0 (padding) |
* Value Buffer:
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|-------------|-------------|-------------|-------------|-------------|-----------------------|
| 1 | 2 | 3 | 4 | 8 | unspecified (padding) |
or with the bitmap elided:
或省略位图:
* Length 5, Null count: 0
* Validity bitmap buffer: Not required
* Value Buffer:
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 | Bytes 20-63 |
|-------------|-------------|-------------|-------------|-------------|-----------------------|
| 1 | 2 | 3 | 4 | 8 | unspecified (padding) |
Variable-size Binary Layout#
可变大小的二进制布局#
Each value in this layout consists of 0 or more bytes. While primitive
arrays have a single values buffer, variable-size binary have an
offsets buffer and data buffer.
此布局中的每个值由 0 个或更多字节组成。虽然原始数组只有一个值缓冲区,但可变大小的二进制有一个偏移缓冲区和数据缓冲区。
The offsets buffer contains length + 1 signed integers (either
32-bit or 64-bit, depending on the logical type), which encode the
start position of each slot in the data buffer. The length of the
value in each slot is computed using the difference between the offset
at that slot’s index and the subsequent offset. For example, the
position and length of slot j is computed as:
偏移量缓冲区包含长度+1 的有符号整数(根据逻辑类型,可以是 32 位或 64 位),这些整数编码了数据缓冲区中每个插槽的起始位置。每个插槽中的值的长度是通过计算该插槽索引处的偏移量和后续偏移量之间的差值来计算的。例如,插槽 j 的位置和长度是这样计算的:
slot_position = offsets[j]
slot_length = offsets[j + 1] - offsets[j] // (for 0 <= j < length)
It should be noted that a null value may have a positive slot length.
That is, a null value may occupy a non-empty memory space in the data
buffer. When this is true, the content of the corresponding memory space
is undefined.
应当注意,空值可能具有正的插槽长度。也就是说,空值可能在数据缓冲区中占据非空的内存空间。当这种情况成立时,相应内存空间的内容是未定义的。
Offsets must be monotonically increasing, that is offsets[j+1] >= offsets[j]
for 0 <= j < length
, even for null slots. This property ensures the
location for all values is valid and well defined.
偏移量必须单调增加,即使对于空槽, 0 <= j < length
的 offsets[j+1] >= offsets[j]
也是如此。这个属性确保所有值的位置都是有效的和明确的。
Generally the first slot in the offsets array is 0, and the last slot
is the length of the values array. When serializing this layout, we
recommend normalizing the offsets to start at 0.
通常,偏移数组的第一个插槽为 0,最后一个插槽是值数组的长度。在序列化此布局时,我们建议将偏移量标准化为从 0 开始。
Example Layout: ``VarBinary``
``VarBinary``
['joe', null, null, 'mark']
will be represented as follows:
将以如下方式表示:
* Length: 4, Null count: 2
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00001001 | 0 (padding) |
* Offsets buffer:
| Bytes 0-19 | Bytes 20-63 |
|----------------|-----------------------|
| 0, 3, 3, 3, 7 | unspecified (padding) |
* Value buffer:
| Bytes 0-6 | Bytes 7-63 |
|----------------|-----------------------|
| joemark | unspecified (padding) |
Variable-size Binary View Layout#
可变大小的二进制视图布局#
New in version Arrow: Columnar Format 1.4
在 Arrow 版本中的新功能:列式格式 1.4
Each value in this layout consists of 0 or more bytes. These bytes’
locations are indicated using a views buffer, which may point to one
of potentially several data buffers or may contain the characters
inline.
此布局中的每个值由 0 个或更多字节组成。这些字节的位置由视图缓冲区指示,该缓冲区可能指向潜在的多个数据缓冲区之一,或者可能包含内联字符。
The views buffer contains length view structures with the following layout:
视图缓冲区包含具有以下布局的长度视图结构:
* Short strings, length <= 12
| Bytes 0-3 | Bytes 4-15 |
|------------|---------------------------------------|
| length | data (padded with 0) |
* Long strings, length > 12
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 |
|------------|------------|------------|-------------|
| length | prefix | buf. index | offset |
In both the long and short string cases, the first four bytes encode the
length of the string and can be used to determine how the rest of the view
should be interpreted.
在长字符串和短字符串的情况下,前四个字节编码了字符串的长度,可以用来确定应如何解释视图的其余部分。
In the short string case the string’s bytes are inlined — stored inside the
view itself, in the twelve bytes which follow the length. Any remaining bytes
after the string itself are padded with 0.
在短字符串的情况下,字符串的字节是内联的——存储在视图本身内部,在长度后面的十二个字节中。字符串本身后面的任何剩余字节都用 0 填充。
In the long string case, a buffer index indicates which data buffer
stores the data bytes and an offset indicates where in that buffer the
data bytes begin. Buffer index 0 refers to the first data buffer, IE
the first buffer after the validity buffer and the views buffer.
The half-open range [offset, offset + length)
must be entirely contained
within the indicated buffer. A copy of the first four bytes of the string is
stored inline in the prefix, after the length. This prefix enables a
profitable fast path for string comparisons, which are frequently determined
within the first four bytes.
在长字符串情况下,缓冲区索引指示哪个数据缓冲区存储数据字节,偏移量指示在该缓冲区中数据字节开始的位置。缓冲区索引 0 指的是第一个数据缓冲区,即有效性缓冲区和视图缓冲区之后的第一个缓冲区。半开放范围 [offset, offset + length)
必须完全包含在指示的缓冲区内。字符串的前四个字节的副本存储在前缀中,紧跟在长度之后。这个前缀使得字符串比较的快速路径变得有利可图,这些比较通常在前四个字节内确定。
All integers (length, buffer index, and offset) are signed.
所有整数(长度,缓冲区索引和偏移量)都是有符号的。
This layout is adapted from TU Munich’s UmbraDB.
此布局改编自 TU 慕尼黑的 UmbraDB。
Note that this layout uses one additional buffer to store the variadic buffer
lengths in the Arrow C data interface.
请注意,此布局使用了一个额外的缓冲区来存储 Arrow C 数据接口中的可变参数缓冲区长度。
Variable-size List Layout#
可变大小列表布局#
List is a nested type which is semantically similar to variable-size
binary. There are two list layout variations — “list” and “list-view” —
and each variation can be delimited by either 32-bit or 64-bit offsets
integers.
列表是一种嵌套类型,语义上类似于可变大小的二进制。有两种列表布局变体 - “列表”和“列表视图” - 每种变体都可以由 32 位或 64 位偏移整数分隔。
List Layout# 列表布局#
The List layout is defined by two buffers, a validity bitmap and an offsets
buffer, and a child array. The offsets are the same as in the
variable-size binary case, and both 32-bit and 64-bit signed integer
offsets are supported options for the offsets. Rather than referencing
an additional data buffer, instead these offsets reference the child
array.
列表布局由两个缓冲区、一个有效性位图和一个偏移缓冲区以及一个子数组定义。偏移量与变量大小二进制情况相同,32 位和 64 位有符号整数偏移量都是偏移量的支持选项。这些偏移量不是引用额外的数据缓冲区,而是引用子数组。
Similar to the layout of variable-size binary, a null value may
correspond to a non-empty segment in the child array. When this is
true, the content of the corresponding segment can be arbitrary.
类似于可变大小二进制的布局,空值可能对应于子数组中的非空段。当这是真的时候,相应段的内容可以是任意的。
A list type is specified like List<T>
, where T
is any type
(primitive or nested). In these examples we use 32-bit offsets where
the 64-bit offset version would be denoted by LargeList<T>
.
列表类型被指定为 List<T>
,其中 T
可以是任何类型(基本类型或嵌套类型)。在这些示例中,我们使用 32 位偏移量,如果使用 64 位偏移量版本,则会被表示为 LargeList<T>
。
Example Layout: ``List<Int8>`` Array
示例布局:``List<Int8>`` 数组
We illustrate an example of List<Int8>
with length 4 having values:
我们举例说明一个长度为 4,具有值的 List<Int8>
:
[[12, -7, 25], null, [0, -127, 127, 50], []]
will have the following representation:
将有以下表示:
* Length: 4, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00001101 | 0 (padding) |
* Offsets buffer (int32)
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|------------|-------------|-------------|-------------|-------------|-----------------------|
| 0 | 3 | 3 | 7 | 7 | unspecified (padding) |
* Values array (Int8Array):
* Length: 7, Null count: 0
* Validity bitmap buffer: Not required
* Values buffer (int8)
| Bytes 0-6 | Bytes 7-63 |
|------------------------------|-----------------------|
| 12, -7, 25, 0, -127, 127, 50 | unspecified (padding) |
Example Layout: ``List<List<Int8>>``
示例布局:``List<List<Int8>>``
[[[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]]
will be represented as follows:
将以如下方式表示:
* Length 3
* Nulls count: 0
* Validity bitmap buffer: Not required
* Offsets buffer (int32)
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
|------------|------------|------------|-------------|-----------------------|
| 0 | 2 | 5 | 6 | unspecified (padding) |
* Values array (`List<Int8>`)
* Length: 6, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-------------|
| 00110111 | 0 (padding) |
* Offsets buffer (int32)
| Bytes 0-27 | Bytes 28-63 |
|----------------------|-----------------------|
| 0, 2, 4, 7, 7, 8, 10 | unspecified (padding) |
* Values array (Int8):
* Length: 10, Null count: 0
* Validity bitmap buffer: Not required
| Bytes 0-9 | Bytes 10-63 |
|-------------------------------|-----------------------|
| 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified (padding) |
ListView Layout# ListView 布局#
New in version Arrow: Columnar Format 1.4
Arrow 新版本:列式格式 1.4
The ListView layout is defined by three buffers: a validity bitmap, an offsets
buffer, and an additional sizes buffer. Sizes and offsets have the identical bit
width and both 32-bit and 64-bit signed integer options are supported.
ListView 布局由三个缓冲区定义:有效性位图、偏移缓冲区和额外的大小缓冲区。大小和偏移具有相同的位宽,支持 32 位和 64 位有符号整数选项。
As in the List layout, the offsets encode the start position of each slot in the
child array. In contrast to the List layout, list lengths are stored explicitly
in the sizes buffer instead of inferred. This allows offsets to be out of order.
Elements of the child array do not have to be stored in the same order they
logically appear in the list elements of the parent array.
与列表布局一样,偏移量编码了子数组中每个插槽的起始位置。与列表布局不同,列表长度在大小缓冲区中被明确存储,而不是推断出来。这允许偏移量无序。子数组的元素不必按照它们在父数组的列表元素中逻辑上出现的顺序存储。
Every list-view value, including null values, has to guarantee the following
invariants:
每个列表视图值,包括空值,都必须保证以下不变量:
0 <= offsets[i] <= length of the child array
0 <= offsets[i] + size[i] <= length of the child array
A list-view type is specified like ListView<T>
, where T
is any type
(primitive or nested). In these examples we use 32-bit offsets and sizes where
the 64-bit version would be denoted by LargeListView<T>
.
列表视图类型被指定为 ListView<T>
,其中 T
可以是任何类型(基本类型或嵌套类型)。在这些示例中,我们使用 32 位偏移和大小,其中 64 位版本将由 LargeListView<T>
表示。
Example Layout: ``ListView<Int8>`` Array
示例布局:``ListView<Int8>`` 数组
We illustrate an example of ListView<Int8>
with length 4 having values:
我们举例说明一个长度为 4,具有值的 ListView<Int8>
:
[[12, -7, 25], null, [0, -127, 127, 50], []]
It may have the following representation:
它可能有以下表示:
* Length: 4, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00001101 | 0 (padding) |
* Offsets buffer (int32)
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
|------------|-------------|-------------|-------------|-----------------------|
| 0 | 7 | 3 | 0 | unspecified (padding) |
* Sizes buffer (int32)
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
|------------|-------------|-------------|-------------|-----------------------|
| 3 | 0 | 4 | 0 | unspecified (padding) |
* Values array (Int8Array):
* Length: 7, Null count: 0
* Validity bitmap buffer: Not required
* Values buffer (int8)
| Bytes 0-6 | Bytes 7-63 |
|------------------------------|-----------------------|
| 12, -7, 25, 0, -127, 127, 50 | unspecified (padding) |
Example Layout: ``ListView<Int8>`` Array
示例布局:``ListView<Int8>`` 数组
We continue with the ListView<Int8>
type, but this instance illustrates out
of order offsets and sharing of child array values. It is an array with length 5
having logical values:
我们继续使用 ListView<Int8>
类型,但这个实例说明了顺序错乱的偏移量和子数组值的共享。它是一个长度为 5 的数组,具有逻辑值:
[[12, -7, 25], null, [0, -127, 127, 50], [], [50, 12]]
It may have the following representation:
它可能有以下表示:
* Length: 4, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00011101 | 0 (padding) |
* Offsets buffer (int32)
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|------------|-------------|-------------|-------------|-------------|-----------------------|
| 4 | 7 | 0 | 0 | 3 | unspecified (padding) |
* Sizes buffer (int32)
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|------------|-------------|-------------|-------------|-------------|-----------------------|
| 3 | 0 | 4 | 0 | 2 | unspecified (padding) |
* Values array (Int8Array):
* Length: 7, Null count: 0
* Validity bitmap buffer: Not required
* Values buffer (int8)
| Bytes 0-6 | Bytes 7-63 |
|------------------------------|-----------------------|
| 0, -127, 127, 50, 12, -7, 25 | unspecified (padding) |
Fixed-Size List Layout# 固定大小列表布局#
Fixed-Size List is a nested type in which each array slot contains a
fixed-size sequence of values all having the same type.
固定大小列表是一种嵌套类型,其中每个数组插槽都包含一系列固定大小的值,这些值都具有相同的类型。
A fixed size list type is specified like FixedSizeList<T>[N]
,
where T
is any type (primitive or nested) and N
is a 32-bit
signed integer representing the length of the lists.
固定大小的列表类型被指定为 FixedSizeList<T>[N]
,其中 T
可以是任何类型(基本类型或嵌套类型), N
是一个 32 位有符号整数,表示列表的长度。
A fixed size list array is represented by a values array, which is a
child array of type T. T may also be a nested type. The value in slot
j
of a fixed size list array is stored in an N
-long slice of
the values array, starting at an offset of j * N
.
固定大小的列表数组由值数组表示,该值数组是类型为 T 的子数组。 T 也可能是嵌套类型。固定大小列表数组中 j
插槽的值存储在值数组的 N
长切片中,从偏移量 j * N
开始。
Example Layout: ``FixedSizeList<byte>[4]`` Array
示例布局:``FixedSizeList<byte>[4]`` 数组
Here we illustrate FixedSizeList<byte>[4]
.
在这里,我们展示 FixedSizeList<byte>[4]
。
For an array of length 4 with respective values:
对于一个长度为 4 的数组,其各自的值为:
[[192, 168, 0, 12], null, [192, 168, 0, 25], [192, 168, 0, 1]]
will have the following representation:
将有以下表示:
* Length: 4, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00001101 | 0 (padding) |
* Values array (byte array):
* Length: 16, Null count: 0
* validity bitmap buffer: Not required
| Bytes 0-3 | Bytes 4-7 | Bytes 8-15 |
|-----------------|-------------|---------------------------------|
| 192, 168, 0, 12 | unspecified | 192, 168, 0, 25, 192, 168, 0, 1 |
Struct Layout# 结构布局#
A struct is a nested type parameterized by an ordered sequence of
types (which can all be distinct), called its fields. Each field must
have a UTF8-encoded name, and these field names are part of the type
metadata.
结构体是一种嵌套类型,由一系列有序的类型(可以全部不同)参数化,称为其字段。每个字段必须有一个 UTF8 编码的名称,这些字段名称是类型元数据的一部分。
Physically, a struct array has one child array for each field. The
child arrays are independent and need not be adjacent to each other in
memory. A struct array also has a validity bitmap to encode top-level
validity information.
在物理上,结构数组对每个字段都有一个子数组。子数组是独立的,不需要在内存中相邻。结构数组还有一个有效性位图,用于编码顶级有效性信息。
For example, the struct (field names shown here as strings for illustration
purposes):
例如,结构体(这里以字符串形式显示字段名称以便说明):
Struct <
name: VarBinary
age: Int32
>
has two child arrays, one VarBinary
array (using variable-size binary
layout) and one 4-byte primitive value array having Int32
logical
type.
具有两个子数组,一个 VarBinary
数组(使用可变大小的二进制布局)和一个具有 Int32
逻辑类型的 4 字节原始值数组。
Example Layout: ``Struct<VarBinary, Int32>``
示例布局:``Struct<VarBinary, Int32>``
The layout for [{'joe', 1}, {null, 2}, null, {'mark', 4}]
, having
child arrays ['joe', null, 'alice', 'mark']
and [1, 2, null, 4]
would be:
对于 [{'joe', 1}, {null, 2}, null, {'mark', 4}]
的布局,拥有子数组 ['joe', null, 'alice', 'mark']
和 [1, 2, null, 4]
将会是:
* Length: 4, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00001011 | 0 (padding) |
* Children arrays:
* field-0 array (`VarBinary`):
* Length: 4, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00001101 | 0 (padding) |
* Offsets buffer:
| Bytes 0-19 | Bytes 20-63 |
|----------------|-----------------------|
| 0, 3, 3, 8, 12 | unspecified (padding) |
* Value buffer:
| Bytes 0-11 | Bytes 12-63 |
|----------------|-----------------------|
| joealicemark | unspecified (padding) |
* field-1 array (int32 array):
* Length: 4, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00001011 | 0 (padding) |
* Value Buffer:
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
|-------------|-------------|-------------|-------------|-----------------------|
| 1 | 2 | unspecified | 4 | unspecified (padding) |
Struct Validity# 结构有效性 #
A struct array has its own validity bitmap that is independent of its
child arrays’ validity bitmaps. The validity bitmap for the struct
array might indicate a null when one or more of its child arrays has
a non-null value in its corresponding slot; or conversely, a child
array might indicate a null in its validity bitmap while the struct array’s
validity bitmap shows a non-null value.
结构数组有自己的有效性位图,这与其子数组的有效性位图无关。结构数组的有效性位图可能在其一个或多个子数组在相应的插槽中有非空值时指示一个空值;或者相反,子数组可能在其有效性位图中指示一个空值,而结构数组的有效性位图显示非空值。
Therefore, to know whether a particular child entry is valid, one must
take the logical AND of the corresponding bits in the two validity bitmaps
(the struct array’s and the child array’s).
因此,要知道特定子条目是否有效,必须在两个有效性位图(结构数组和子数组的)中对应的位进行逻辑与运算。
This is illustrated in the example above, one of the child arrays has a
valid entry 'alice'
for the null struct but it is “hidden” by the
struct array’s validity bitmap. However, when treated independently,
corresponding entries of the children array will be non-null.
如上例所示,子数组中的一个有一个对空结构有效的条目 'alice'
,但它被结构数组的有效性位图“隐藏”了。然而,当独立处理时,子数组的相应条目将是非空的。
Union Layout# 联合布局 #
A union is defined by an ordered sequence of types; each slot in the
union can have a value chosen from these types. The types are named
like a struct’s fields, and the names are part of the type metadata.
联合体由一系列有序的类型定义;联合体中的每个插槽可以从这些类型中选择一个值。这些类型的命名方式类似于结构体的字段,而这些名称是类型元数据的一部分。
Unlike other data types, unions do not have their own validity bitmap. Instead,
the nullness of each slot is determined exclusively by the child arrays which
are composed to create the union.
与其他数据类型不同,联合并没有自己的有效性位图。相反,每个插槽的空值完全由组成联合的子数组确定。
We define two distinct union types, “dense” and “sparse”, that are
optimized for different use cases.
我们定义了两种不同的联合类型,“密集”和“稀疏”,它们针对不同的使用情况进行了优化。
Dense Union# 密集联盟 #
Dense union represents a mixed-type array with 5 bytes of overhead for
each value. Its physical layout is as follows:
密集联合表示每个值需要 5 字节开销的混合类型数组。其物理布局如下:
One child array for each type
每种类型的一个子数组Types buffer: A buffer of 8-bit signed integers. Each type in the union has a corresponding type id whose values are found in this buffer. A union with more than 127 possible types can be modeled as a union of unions.
类型缓冲区:一个 8 位有符号整数的缓冲区。联合中的每种类型都有一个相应的类型 id,其值在此缓冲区中找到。一个可能的类型超过 127 的联合可以被建模为联合的联合。Offsets buffer: A buffer of signed Int32 values indicating the relative offset into the respective child array for the type in a given slot. The respective offsets for each child value array must be in order / increasing.
偏移缓冲区:一个包含有符号 Int32 值的缓冲区,表示给定插槽中类型在各自子数组中的相对偏移量。每个子值数组的各自偏移量必须按顺序/递增排列。
Example Layout: ``DenseUnion<f: Float32, i: Int32>``
示例布局:``DenseUnion<f: Float32, i: Int32>``
For the union array:
对于联合数组:
[{f=1.2}, null, {f=3.4}, {i=5}]
will have the following layout:
将具有以下布局:
* Length: 4, Null count: 0
* Types buffer:
| Byte 0 | Byte 1 | Byte 2 | Byte 3 | Bytes 4-63 |
|----------|-------------|----------|----------|-----------------------|
| 0 | 0 | 0 | 1 | unspecified (padding) |
* Offset buffer:
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
|-----------|-------------|------------|-------------|-----------------------|
| 0 | 1 | 2 | 0 | unspecified (padding) |
* Children arrays:
* Field-0 array (f: Float32):
* Length: 3, Null count: 1
* Validity bitmap buffer: 00000101
* Value Buffer:
| Bytes 0-11 | Bytes 12-63 |
|----------------|-----------------------|
| 1.2, null, 3.4 | unspecified (padding) |
* Field-1 array (i: Int32):
* Length: 1, Null count: 0
* Validity bitmap buffer: Not required
* Value Buffer:
| Bytes 0-3 | Bytes 4-63 |
|-----------|-----------------------|
| 5 | unspecified (padding) |
Sparse Union# 稀疏联合 #
A sparse union has the same structure as a dense union, with the omission of
the offsets array. In this case, the child arrays are each equal in length to
the length of the union.
稀疏联合具有与密集联合相同的结构,但省略了偏移量数组。在这种情况下,子数组的长度都等于联合的长度。
While a sparse union may use significantly more space compared with a
dense union, it has some advantages that may be desirable in certain
use cases:
虽然稀疏并集可能比密集并集使用更多的空间,但它在某些使用场景中可能具有一些令人期待的优点:
A sparse union is more amenable to vectorized expression evaluation in some use cases.
稀疏并集在某些使用场景下更适合向量化表达式评估。Equal-length arrays can be interpreted as a union by only defining the types array.
等长数组可以通过仅定义类型数组来解释为联合。
Example layout: ``SparseUnion<i: Int32, f: Float32, s: VarBinary>``
示例布局:``SparseUnion<i: Int32, f: Float32, s: VarBinary>``
For the union array:
对于联合数组:
[{i=5}, {f=1.2}, {s='joe'}, {f=3.4}, {i=4}, {s='mark'}]
will have the following layout:
将具有以下布局:
* Length: 6, Null count: 0
* Types buffer:
| Byte 0 | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Bytes 6-63 |
|------------|-------------|-------------|-------------|-------------|--------------|-----------------------|
| 0 | 1 | 2 | 1 | 0 | 2 | unspecified (padding) |
* Children arrays:
* i (Int32):
* Length: 6, Null count: 4
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00010001 | 0 (padding) |
* Value buffer:
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-63 |
|-------------|-------------|-------------|-------------|-------------|--------------|-----------------------|
| 5 | unspecified | unspecified | unspecified | 4 | unspecified | unspecified (padding) |
* f (Float32):
* Length: 6, Null count: 4
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00001010 | 0 (padding) |
* Value buffer:
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-63 |
|--------------|-------------|-------------|-------------|-------------|-------------|-----------------------|
| unspecified | 1.2 | unspecified | 3.4 | unspecified | unspecified | unspecified (padding) |
* s (`VarBinary`)
* Length: 6, Null count: 4
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00100100 | 0 (padding) |
* Offsets buffer (Int32)
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-27 | Bytes 28-63 |
|------------|-------------|-------------|-------------|-------------|-------------|-------------|------------------------|
| 0 | 0 | 0 | 3 | 3 | 3 | 7 | unspecified (padding) |
* Values buffer:
| Bytes 0-6 | Bytes 7-63 |
|------------|-----------------------|
| joemark | unspecified (padding) |
Only the slot in the array corresponding to the type index is considered. All
“unselected” values are ignored and could be any semantically correct array
value.
仅考虑与类型索引对应的数组插槽。所有“未选中”的值都将被忽略,可以是任何语义上正确的数组值。
Null Layout# 空布局 #
We provide a simplified memory-efficient layout for the Null data type
where all values are null. In this case no memory buffers are
allocated.
我们为 Null 数据类型提供了一种简化的内存高效布局,其中所有值都为空。在这种情况下,不会分配任何内存缓冲区。
Dictionary-encoded Layout#
字典编码布局#
Dictionary encoding is a data representation technique to represent
values by integers referencing a dictionary usually consisting of
unique values. It can be effective when you have data with many
repeated values.
字典编码是一种数据表示技术,通过引用通常由唯一值组成的字典来表示值。当你拥有许多重复值的数据时,它可能会非常有效。
Any array can be dictionary-encoded. The dictionary is stored as an optional
property of an array. When a field is dictionary encoded, the values are
represented by an array of non-negative integers representing the index of the
value in the dictionary. The memory layout for a dictionary-encoded array is
the same as that of a primitive integer layout. The dictionary is handled as a
separate columnar array with its own respective layout.
任何数组都可以进行字典编码。字典作为数组的一个可选属性进行存储。当一个字段进行字典编码时,值由表示字典中值的索引的非负整数数组表示。字典编码数组的内存布局与原始整数布局相同。字典被视为具有自己相应布局的单独的列式数组进行处理。
As an example, you could have the following data:
例如,你可以拥有以下数据:
type: VarBinary
['foo', 'bar', 'foo', 'bar', null, 'baz']
In dictionary-encoded form, this could appear as:
在字典编码形式中,这可能会显示为:
data VarBinary (dictionary-encoded)
index_type: Int32
values: [0, 1, 0, 1, null, 2]
dictionary
type: VarBinary
values: ['foo', 'bar', 'baz']
Note that a dictionary is permitted to contain duplicate values or
nulls:
请注意,字典允许包含重复值或空值:
data VarBinary (dictionary-encoded)
index_type: Int32
values: [0, 1, 3, 1, 4, 2]
dictionary
type: VarBinary
values: ['foo', 'bar', 'baz', 'foo', null]
The null count of such arrays is dictated only by the validity bitmap
of its indices, irrespective of any null values in the dictionary.
这种数组的空值计数仅由其索引的有效性位图决定,而与字典中的任何空值无关。
Since unsigned integers can be more difficult to work with in some cases
(e.g. in the JVM), we recommend preferring signed integers over unsigned
integers for representing dictionary indices. Additionally, we recommend
avoiding using 64-bit unsigned integer indices unless they are required by an
application.
由于在某些情况下(例如在 JVM 中)无符号整数可能更难处理,我们建议优先使用有符号整数而不是无符号整数来表示字典索引。此外,我们建议除非应用程序需要,否则避免使用 64 位无符号整数索引。
We discuss dictionary encoding as it relates to serialization further
below.
我们在下文中进一步讨论字典编码与序列化的关系。
Run-End Encoded Layout# 运行结束编码布局#
New in version Arrow: Columnar Format 1.3
在 Arrow 版本中的新功能:列式格式 1.3
Run-end encoding (REE) is a variation of run-length encoding (RLE). These
encodings are well-suited for representing data containing sequences of the
same value, called runs. In run-end encoding, each run is represented as a
value and an integer giving the index in the array where the run ends.
行尾编码(REE)是行长编码(RLE)的一种变体。这些编码非常适合表示包含相同值序列的数据,这些序列被称为运行。在行尾编码中,每个运行都表示为一个值和一个整数,该整数给出了数组中运行结束的索引。
Any array can be run-end encoded. A run-end encoded array has no buffers
by itself, but has two child arrays. The first child array, called the run ends array,
holds either 16, 32, or 64-bit signed integers. The actual values of each run
are held in the second child array.
For the purposes of determining field names and schemas, these child arrays
are prescribed the standard names of run_ends and values respectively.
任何数组都可以进行行末编码。行末编码的数组本身没有缓冲区,但有两个子数组。第一个子数组被称为行末数组,包含 16、32 或 64 位的有符号整数。每个运行的实际值都存储在第二个子数组中。为了确定字段名称和模式,这些子数组被规定为 run_ends 和 values 的标准名称。
The values in the first child array represent the accumulated length of all runs
from the first to the current one, i.e. the logical index where the
current run ends. This allows relatively efficient random access from a logical
index using binary search. The length of an individual run can be determined by
subtracting two adjacent values. (Contrast this with run-length encoding, in
which the lengths of the runs are represented directly, and in which random
access is less efficient.)
第一个子数组中的值代表了从第一个到当前运行的所有运行的累积长度,即当前运行结束的逻辑索引。这允许使用二进制搜索从逻辑索引进行相对高效的随机访问。可以通过减去两个相邻的值来确定单个运行的长度。(与此形成对比的是,运行长度编码直接表示了运行的长度,而且随机访问的效率较低。)
Note 注意
Because the run_ends
child array cannot have nulls, it’s reasonable
to consider why the run_ends
are a child array instead of just a
buffer, like the offsets for a Variable-size List Layout. This
layout was considered, but it was decided to use the child arrays.
因为 run_ends
子数组不能有空值,所以有理由考虑为什么 run_ends
是子数组而不仅仅是一个缓冲区,就像可变大小列表布局的偏移量一样。考虑过这种布局,但最终决定使用子数组。
Child arrays allow us to keep the “logical length” (the decoded length)
associated with the parent array and the “physical length” (the number
of run ends) associated with the child arrays. If run_ends
was a
buffer in the parent array then the size of the buffer would be unrelated
to the length of the array and this would be confusing.
子数组使我们能够保持与父数组相关的“逻辑长度”(解码长度)和与子数组相关的“物理长度”(运行结束的数量)。如果 run_ends
是父数组中的缓冲区,那么缓冲区的大小将与数组的长度无关,这将令人困惑。
A run must have a length of at least 1. This means the values in the
run ends array all are positive and in strictly ascending order. A run end cannot be
null.
跑步长度必须至少为 1。这意味着跑步结束数组中的所有值都是正数,并且严格按升序排列。跑步结束不能为 null。
The REE parent has no validity bitmap, and it’s null count field should always be 0.
Null values are encoded as runs with the value null.
REE 父项没有有效性位图,其空值计数字段应始终为 0。空值被编码为值为空的运行。
As an example, you could have the following data:
例如,你可以拥有以下数据:
type: Float32
[1.0, 1.0, 1.0, 1.0, null, null, 2.0]
In Run-end-encoded form, this could appear as:
在行尾编码形式中,这可能会显示为:
* Length: 7, Null count: 0
* Child Arrays:
* run_ends (Int32):
* Length: 3, Null count: 0 (Run Ends cannot be null)
* Validity bitmap buffer: Not required (if it exists, it should be all 1s)
* Values buffer
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-63 |
|-------------|-------------|-------------|-----------------------|
| 4 | 6 | 7 | unspecified (padding) |
* values (Float32):
* Length: 3, Null count: 1
* Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00000101 | 0 (padding) |
* Values buffer
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-63 |
|-------------|-------------|-------------|-----------------------|
| 1.0 | unspecified | 2.0 | unspecified (padding) |
Buffer Listing for Each Layout#
每个布局的缓冲区列表 #
For the avoidance of ambiguity, we provide listing the order and type
of memory buffers for each layout.
为避免歧义,我们提供了每种布局的内存缓冲区的顺序和类型的列表。
Layout Type 布局类型 |
Buffer 0 缓冲区 0 |
Buffer 1 缓冲区 1 |
Buffer 2 缓冲区 2 |
Variadic Buffers 可变参数缓冲区 |
---|---|---|---|---|
Primitive |
validity |
data |
||
Variable Binary 二进制变量 |
validity |
offsets |
data |
|
Variable Binary View 变量二进制视图 |
validity |
views |
data |
|
List |
validity |
offsets |
||
Fixed-size List 固定大小的列表 |
validity |
|||
Struct |
validity |
|||
Sparse Union 稀疏联合 |
type ids 类型标识符 |
|||
Dense Union 密集联盟 |
type ids 类型标识符 |
offsets |
||
Null |
||||
Dictionary-encoded |
validity |
data (indices) 数据(索引) |
||
Run-end encoded 运行结束编码 |
Logical Types# 逻辑类型 #
The Schema.fbs defines built-in logical types supported by the
Arrow columnar format. Each logical type uses one of the above
physical layouts. Nested logical types may have different physical
layouts depending on the particular realization of the type.
Schema.fbs 定义了 Arrow 列式格式支持的内置逻辑类型。每种逻辑类型都使用上述物理布局之一。嵌套的逻辑类型可能会根据类型的特定实现有不同的物理布局。
We do not go into detail about the logical types definitions in this
document as we consider Schema.fbs to be authoritative.
我们在此文档中并未详细讨论逻辑类型定义,因为我们认为 Schema.fbs 是权威的。
Serialization and Interprocess Communication (IPC)#
序列化和进程间通信(IPC)#
The primitive unit of serialized data in the columnar format is the
“record batch”. Semantically, a record batch is an ordered collection
of arrays, known as its fields, each having the same length as one
another but potentially different data types. A record batch’s field
names and types collectively form the batch’s schema.
在列式格式中,序列化数据的原始单位是“记录批次”。从语义上讲,记录批次是一个有序的数组集合,被称为其字段,每个字段的长度相同,但可能具有不同的数据类型。记录批次的字段名称和类型共同构成了批次的模式。
In this section we define a protocol for serializing record batches
into a stream of binary payloads and reconstructing record batches
from these payloads without need for memory copying.
在本节中,我们定义了一种协议,用于将记录批次序列化为二进制有效载荷流,并从这些有效载荷中重构记录批次,无需进行内存复制。
The columnar IPC protocol utilizes a one-way stream of binary messages
of these types:
柱状 IPC 协议利用这些类型的单向二进制消息流:
Schema 模式
RecordBatch 记录批次
DictionaryBatch 字典批次
We specify a so-called encapsulated IPC message format which
includes a serialized Flatbuffer type along with an optional message
body. We define this message format before describing how to serialize
each constituent IPC message type.
我们指定了一种所谓的封装 IPC 消息格式,其中包括一个序列化的 Flatbuffer 类型以及一个可选的消息主体。我们在描述如何序列化每个组成 IPC 消息类型之前定义了这种消息格式。
Encapsulated message format#
封装的消息格式#
For simple streaming and file-based serialization, we define a
“encapsulated” message format for interprocess communication. Such
messages can be “deserialized” into in-memory Arrow array objects by
examining only the message metadata without any need to copy or move
any of the actual data.
对于简单的流式和基于文件的序列化,我们定义了一种“封装”的消息格式用于进程间通信。通过仅检查消息元数据,无需复制或移动任何实际数据,就可以将这些消息“反序列化”为内存中的 Arrow 数组对象。
The encapsulated binary message format is as follows:
封装的二进制消息格式如下:
A 32-bit continuation indicator. The value
0xFFFFFFFF
indicates a valid message. This component was introduced in version 0.15.0 in part to address the 8-byte alignment requirement of Flatbuffers
32 位继续指示器。值0xFFFFFFFF
表示有效的消息。这个组件在 0.15.0 版本中引入,部分是为了解决 Flatbuffers 的 8 字节对齐要求。A 32-bit little-endian length prefix indicating the metadata size
表示元数据大小的 32 位小端长度前缀The message metadata as using the
Message
type defined in Message.fbs
消息元数据使用 Message.fbs 中定义的Message
类型Padding bytes to an 8-byte boundary
将字节填充到 8 字节边界The message body, whose length must be a multiple of 8 bytes
消息体的长度必须是 8 字节的倍数
Schematically, we have: 示意性地,我们有:
<continuation: 0xFFFFFFFF>
<metadata_size: int32>
<metadata_flatbuffer: bytes>
<padding>
<message body>
The complete serialized message must be a multiple of 8 bytes so that messages
can be relocated between streams. Otherwise the amount of padding between the
metadata and the message body could be non-deterministic.
完整的序列化消息必须是 8 字节的倍数,以便可以在流之间重新定位消息。否则,元数据和消息主体之间的填充量可能是不确定的。
The metadata_size
includes the size of the Message
plus
padding. The metadata_flatbuffer
contains a serialized Message
Flatbuffer value, which internally includes:
metadata_size
包括 Message
的大小加上填充。 metadata_flatbuffer
包含一个序列化的 Message
Flatbuffer 值,其内部包括:
A version number 版本号
A particular message value (one of
Schema
,RecordBatch
, orDictionaryBatch
)
特定的消息值(Schema
,RecordBatch
或DictionaryBatch
中的一个)The size of the message body
消息正文的大小A
custom_metadata
field for any application-supplied metadata
一个custom_metadata
字段,用于任何应用程序提供的元数据
When read from an input stream, generally the Message
metadata is
initially parsed and validated to obtain the body size. Then the body
can be read.
从输入流读取时,通常首先解析并验证 Message
元数据以获取主体大小。然后可以读取主体。
Schema message# 模式消息#
The Flatbuffers files Schema.fbs contains the definitions for all
built-in logical data types and the Schema
metadata type which
represents the schema of a given record batch. A schema consists of
an ordered sequence of fields, each having a name and type. A
serialized Schema
does not contain any data buffers, only type
metadata.
Flatbuffers 文件 Schema.fbs 包含所有内置逻辑数据类型和 Schema
元数据类型的定义,后者代表给定记录批次的模式。模式由一系列有序字段组成,每个字段都有一个名称和类型。序列化的 Schema
不包含任何数据缓冲区,只包含类型元数据。
The Field
Flatbuffers type contains the metadata for a single
array. This includes:
Field
Flatbuffers 类型包含单个数组的元数据。这包括:
The field’s name 字段的名称
The field’s logical type 字段的逻辑类型
Whether the field is semantically nullable. While this has no bearing on the array’s physical layout, many systems distinguish nullable and non-nullable fields and we want to allow them to preserve this metadata to enable faithful schema round trips.
该字段是否在语义上可为空。虽然这对数组的物理布局没有影响,但许多系统区分可空字段和非可空字段,我们希望允许它们保留这些元数据,以实现对模式的忠实往返。A collection of child
Field
values, for nested types
一组嵌套类型的子Field
值的集合A
dictionary
property indicating whether the field is dictionary-encoded or not. If it is, a dictionary “id” is assigned to allow matching a subsequent dictionary IPC message with the appropriate field.
一个dictionary
属性,用于指示字段是否进行了字典编码。如果是,将分配一个字典“id”以允许将后续的字典 IPC 消息与适当的字段进行匹配。
We additionally provide both schema-level and field-level
custom_metadata
attributes allowing for systems to insert their
own application defined metadata to customize behavior.
我们还提供了模式级别和字段级别的 custom_metadata
属性,允许系统插入自己的应用程序定义的元数据以定制行为。
RecordBatch message# 记录批次消息#
A RecordBatch message contains the actual data buffers corresponding
to the physical memory layout determined by a schema. The metadata for
this message provides the location and size of each buffer, permitting
Array data structures to be reconstructed using pointer arithmetic and
thus no memory copying.
RecordBatch 消息包含与模式确定的物理内存布局相对应的实际数据缓冲区。此消息的元数据提供了每个缓冲区的位置和大小,允许使用指针算术重构数组数据结构,因此不需要复制内存。
The serialized form of the record batch is the following:
记录批次的序列化形式如下:
The
data header
, defined as theRecordBatch
type in Message.fbs.
data header
,在 Message.fbs 中定义为RecordBatch
类型。The
body
, a flat sequence of memory buffers written end-to-end with appropriate padding to ensure a minimum of 8-byte alignment
body
,一个平坦的内存缓冲区序列,端对端编写并适当填充以确保至少 8 字节的对齐
The data header contains the following:
数据头包含以下内容:
The length and null count for each flattened field in the record batch
记录批次中每个扁平化字段的长度和空值计数The memory offset and length of each constituent
Buffer
in the record batch’s body
记录批次主体中每个构成部分Buffer
的内存偏移量和长度
Fields and buffers are flattened by a pre-order depth-first traversal
of the fields in the record batch. For example, let’s consider the
schema
字段和缓冲区通过对记录批次中的字段进行预排序深度优先遍历进行扁平化。例如,让我们考虑这个模式
col1: Struct<a: Int32, b: List<item: Int64>, c: Float64>
col2: Utf8
The flattened version of this is:
这个的扁平化版本是:
FieldNode 0: Struct name='col1'
FieldNode 1: Int32 name='a'
FieldNode 2: List name='b'
FieldNode 3: Int64 name='item'
FieldNode 4: Float64 name='c'
FieldNode 5: Utf8 name='col2'
For the buffers produced, we would have the following (refer to the
table above):
对于生成的缓冲区,我们将有以下内容(参见上表):
buffer 0: field 0 validity
buffer 1: field 1 validity
buffer 2: field 1 values
buffer 3: field 2 validity
buffer 4: field 2 offsets
buffer 5: field 3 validity
buffer 6: field 3 values
buffer 7: field 4 validity
buffer 8: field 4 values
buffer 9: field 5 validity
buffer 10: field 5 offsets
buffer 11: field 5 data
The Buffer
Flatbuffers value describes the location and size of a
piece of memory. Generally these are interpreted relative to the
encapsulated message format defined below.
Buffer
Flatbuffers 值描述了一块内存的位置和大小。通常,这些相对于下面定义的封装消息格式进行解释。
The size
field of Buffer
is not required to account for padding
bytes. Since this metadata can be used to communicate in-memory pointer
addresses between libraries, it is recommended to set size
to the actual
memory size rather than the padded size.
size
字段的 Buffer
不需要考虑填充字节。由于此元数据可用于在库之间传递内存指针地址,因此建议将 size
设置为实际内存大小,而不是填充大小。
Variadic buffers# 可变参数缓冲区 #
New in version Arrow: Columnar Format 1.4
在 Arrow 版本中的新特性:列式格式 1.4
Some types such as Utf8View are represented using a variable number of buffers.
For each such Field in the pre-ordered flattened logical schema, there will be
an entry in variadicBufferCounts
to indicate the number of variadic buffers
which belong to that Field in the current RecordBatch.
某些类型,如 Utf8View,是通过可变数量的缓冲区来表示的。对于预排序的扁平化逻辑模式中的每个此类字段,在 variadicBufferCounts
中都会有一个条目,用来指示当前 RecordBatch 中属于该字段的可变缓冲区的数量。
For example, consider the schema
例如,考虑以下模式
col1: Struct<a: Int32, b: BinaryView, c: Float64>
col2: Utf8View
This has two fields with variadic buffers, so variadicBufferCounts
will
have two entries in each RecordBatch. For a RecordBatch of this schema with
variadicBufferCounts = [3, 2]
, the flattened buffers would be:
这有两个字段带有可变缓冲区,所以 variadicBufferCounts
将在每个 RecordBatch 中有两个条目。对于具有 variadicBufferCounts = [3, 2]
的此模式的 RecordBatch,扁平化的缓冲区将是:
buffer 0: col1 validity
buffer 1: col1.a validity
buffer 2: col1.a values
buffer 3: col1.b validity
buffer 4: col1.b views
buffer 5: col1.b data
buffer 6: col1.b data
buffer 7: col1.b data
buffer 8: col1.c validity
buffer 9: col1.c values
buffer 10: col2 validity
buffer 11: col2 views
buffer 12: col2 data
buffer 13: col2 data
Byte Order (Endianness)#
字节顺序(字节序)#
The Arrow format is little endian by default.
Arrow 格式默认为小端模式。
Serialized Schema metadata has an endianness field indicating
endianness of RecordBatches. Typically this is the endianness of the
system where the RecordBatch was generated. The main use case is
exchanging RecordBatches between systems with the same Endianness. At
first we will return an error when trying to read a Schema with an
endianness that does not match the underlying system. The reference
implementation is focused on Little Endian and provides tests for
it. Eventually we may provide automatic conversion via byte swapping.
序列化的 Schema 元数据有一个表示 RecordBatches 字节顺序的字段。这通常是生成 RecordBatch 的系统的字节顺序。主要用例是在具有相同字节顺序的系统之间交换 RecordBatches。起初,当试图读取一个与底层系统字节顺序不匹配的 Schema 时,我们会返回一个错误。参考实现主要关注小端字节顺序,并为其提供测试。最终,我们可能会通过字节交换提供自动转换。
IPC Streaming Format# IPC 流媒体格式#
We provide a streaming protocol or “format” for record batches. It is
presented as a sequence of encapsulated messages, each of which
follows the format above. The schema comes first in the stream, and it
is the same for all of the record batches that follow. If any fields
in the schema are dictionary-encoded, one or more DictionaryBatch
messages will be included. DictionaryBatch
and RecordBatch
messages may be interleaved, but before any dictionary key is used in
a RecordBatch
it should be defined in a DictionaryBatch
.
我们提供一种用于记录批次的流协议或“格式”。它被呈现为一系列封装的消息,每个消息都遵循上述格式。模式首先出现在流中,并且对于所有后续的记录批次都是相同的。如果模式中的任何字段都是字典编码的,则会包含一个或多个 DictionaryBatch
消息。 DictionaryBatch
和 RecordBatch
消息可能会交错,但在 RecordBatch
中使用任何字典键之前,应在 DictionaryBatch
中定义。
<SCHEMA>
<DICTIONARY 0>
...
<DICTIONARY k - 1>
<RECORD BATCH 0>
...
<DICTIONARY x DELTA>
...
<DICTIONARY y DELTA>
...
<RECORD BATCH n - 1>
<EOS [optional]: 0xFFFFFFFF 0x00000000>
Note 注意
An edge-case for interleaved dictionary and record batches occurs
when the record batches contain dictionary encoded arrays that are
completely null. In this case, the dictionary for the encoded column might
appear after the first record batch.
对于交错的字典和记录批次,边缘情况发生在记录批次包含完全为空的字典编码数组时。在这种情况下,编码列的字典可能会在第一批记录之后出现。
When a stream reader implementation is reading a stream, after each
message, it may read the next 8 bytes to determine both if the stream
continues and the size of the message metadata that follows. Once the
message flatbuffer is read, you can then read the message body.
当流读取器实现正在读取一个流时,每读取一条消息后,它可能会读取接下来的 8 个字节,以确定流是否继续以及接下来的消息元数据的大小。一旦读取了消息的 flatbuffer,您就可以读取消息体了。
The stream writer can signal end-of-stream (EOS) either by writing 8 bytes
containing the 4-byte continuation indicator (0xFFFFFFFF
) followed by 0
metadata length (0x00000000
) or closing the stream interface. We
recommend the “.arrows” file extension for the streaming format although
in many cases these streams will not ever be stored as files.
流写入器可以通过写入包含 4 字节续行指示符( 0xFFFFFFFF
)的 8 字节,后跟 0 元数据长度( 0x00000000
),或者关闭流接口来发出流结束(EOS)信号。我们建议使用“.arrows”文件扩展名作为流格式,尽管在许多情况下,这些流不会被存储为文件。
IPC File Format# IPC 文件格式#
We define a “file format” supporting random access that is an extension of
the stream format. The file starts and ends with a magic string ARROW1
(plus padding). What follows in the file is identical to the stream format.
At the end of the file, we write a footer containing a redundant copy of
the schema (which is a part of the streaming format) plus memory offsets and
sizes for each of the data blocks in the file. This enables random access to
any record batch in the file. See File.fbs for the precise details of the
file footer.
我们定义了一种支持随机访问的“文件格式”,它是流格式的扩展。文件以魔术字符串 ARROW1
(加上填充)开始和结束。文件中的后续内容与流格式相同。在文件的末尾,我们写入一个包含模式的冗余副本(这是流格式的一部分)以及文件中每个数据块的内存偏移量和大小的页脚。这使得可以随机访问文件中的任何记录批次。请参阅 File.fbs 以获取文件页脚的精确细节。
Schematically we have: 我们在图示中有:
<magic number "ARROW1">
<empty padding bytes [to 8 byte boundary]>
<STREAMING FORMAT with EOS>
<FOOTER>
<FOOTER SIZE: int32>
<magic number "ARROW1">
In the file format, there is no requirement that dictionary keys
should be defined in a DictionaryBatch
before they are used in a
RecordBatch
, as long as the keys are defined somewhere in the
file. Further more, it is invalid to have more than one non-delta
dictionary batch per dictionary ID (i.e. dictionary replacement is not
supported). Delta dictionaries are applied in the order they appear in
the file footer. We recommend the “.arrow” extension for files created with
this format. Note that files created with this format are sometimes called
“Feather V2” or with the “.feather” extension, the name and the extension
derived from “Feather (V1)”, which was a proof of concept early in
the Arrow project for language-agnostic fast data frame storage for
Python (pandas) and R.
在文件格式中,没有要求字典键必须在 DictionaryBatch
中定义,然后在 RecordBatch
中使用,只要键在文件的某个地方定义即可。此外,每个字典 ID 不能有多于一个的非增量字典批次(即不支持字典替换)。增量字典按照它们在文件页脚中出现的顺序应用。我们建议使用“.arrow”扩展名创建的文件。请注意,使用此格式创建的文件有时被称为“Feather V2”或带有“.feather”扩展名,这个名称和扩展名源自“Feather (V1)”,这是 Arrow 项目早期的一个概念验证,用于 Python(pandas)和 R 的语言无关的快速数据帧存储。
Dictionary Messages# 词典信息 #
Dictionaries are written in the stream and file formats as a sequence of record
batches, each having a single field. The complete semantic schema for a
sequence of record batches, therefore, consists of the schema along with all of
the dictionaries. The dictionary types are found in the schema, so it is
necessary to read the schema to first determine the dictionary types so that
the dictionaries can be properly interpreted:
字典以记录批次的形式写入流和文件格式,每个批次只有一个字段。因此,一系列记录批次的完整语义模式由模式和所有字典组成。字典类型在模式中可以找到,所以必须先读取模式以确定字典类型,这样才能正确解释字典:
table DictionaryBatch {
id: long;
data: RecordBatch;
isDelta: boolean = false;
}
The dictionary id
in the message metadata can be referenced one or more times
in the schema, so that dictionaries can even be used for multiple fields. See
the Dictionary-encoded Layout section for more about the semantics of
dictionary-encoded data.
消息元数据中的字典 id
可以在模式中被引用一次或多次,以便字典甚至可以用于多个字段。有关字典编码数据语义的更多信息,请参阅字典编码布局部分。
The dictionary isDelta
flag allows existing dictionaries to be
expanded for future record batch materializations. A dictionary batch
with isDelta
set indicates that its vector should be concatenated
with those of any previous batches with the same id
. In a stream
which encodes one column, the list of strings ["A", "B", "C", "B",
"D", "C", "E", "A"]
, with a delta dictionary batch could take the
form:
字典 isDelta
标志允许现有的字典扩展,以便于未来的记录批次实现。带有 isDelta
设置的字典批次表示,其向量应与具有相同 id
的任何先前批次的向量进行连接。在编码一个列的流中,字符串列表 ["A", "B", "C", "B",
"D", "C", "E", "A"]
,带有一个增量字典批次可能会采取以下形式:
<SCHEMA>
<DICTIONARY 0>
(0) "A"
(1) "B"
(2) "C"
<RECORD BATCH 0>
0
1
2
1
<DICTIONARY 0 DELTA>
(3) "D"
(4) "E"
<RECORD BATCH 1>
3
2
4
0
EOS
Alternatively, if isDelta
is set to false, then the dictionary
replaces the existing dictionary for the same ID. Using the same
example as above, an alternate encoding could be:
或者,如果 isDelta
设置为假,则该字典将替换具有相同 ID 的现有字典。使用上述相同的例子,一个替代的编码可能是:
<SCHEMA>
<DICTIONARY 0>
(0) "A"
(1) "B"
(2) "C"
<RECORD BATCH 0>
0
1
2
1
<DICTIONARY 0>
(0) "A"
(1) "C"
(2) "D"
(3) "E"
<RECORD BATCH 1>
2
1
3
0
EOS
Custom Application Metadata#
自定义应用程序元数据 #
We provide a custom_metadata
field at three levels to provide a
mechanism for developers to pass application-specific metadata in
Arrow protocol messages. This includes Field
, Schema
, and
Message
.
我们在三个级别提供了一个 custom_metadata
字段,以便为开发人员在 Arrow 协议消息中传递特定于应用程序的元数据提供一种机制。这包括 Field
, Schema
和 Message
。
The colon symbol :
is to be used as a namespace separator. It can
be used multiple times in a key.
冒号符号 :
将被用作命名空间分隔符。它可以在一个键中多次使用。
The ARROW
pattern is a reserved namespace for internal Arrow use
in the custom_metadata
fields. For example,
ARROW:extension:name
.
ARROW
模式是 Arrow 内部使用的保留命名空间,在 custom_metadata
字段中使用。例如, ARROW:extension:name
。
Extension Types# 扩展类型 #
User-defined “extension” types can be defined setting certain
KeyValue
pairs in custom_metadata
in the Field
metadata
structure. These extension keys are:
用户可以通过在 Field
元数据结构中的 custom_metadata
中设置某些 KeyValue
对来定义自定义的“扩展”类型。这些扩展键是:
'ARROW:extension:name'
for the string name identifying the custom data type. We recommend that you use a “namespace”-style prefix for extension type names to minimize the possibility of conflicts with multiple Arrow readers and writers in the same application. For example, usemyorg.name_of_type
instead of simplyname_of_type
用于标识自定义数据类型的字符串名称。我们建议您使用“命名空间”风格的前缀来命名扩展类型,以最小化在同一应用程序中的多个 Arrow 读取器和写入器之间的冲突可能性。例如,使用myorg.name_of_type
而不仅仅是name_of_type
'ARROW:extension:metadata'
for a serialized representation of theExtensionType
necessary to reconstruct the custom type
'ARROW:extension:metadata'
用于序列化表示ExtensionType
,以重构自定义类型
Note 注意
Extension names beginning with arrow.
are reserved for
canonical extension types,
they should not be used for third-party extension types.
以 arrow.
开头的扩展名是为规范扩展类型保留的,不应用于第三方扩展类型。
This extension metadata can annotate any of the built-in Arrow logical
types. The intent is that an implementation that does not support an
extension type can still handle the underlying data. For example a
16-byte UUID value could be embedded in FixedSizeBinary(16)
, and
implementations that do not have this extension type can still work
with the underlying binary values and pass along the
custom_metadata
in subsequent Arrow protocol messages.
此扩展元数据可以注释 Arrow 内置的任何逻辑类型。其意图是,即使某个实现不支持扩展类型,仍然可以处理底层数据。例如,16 字节的 UUID 值可以嵌入 FixedSizeBinary(16)
中,而不具有此扩展类型的实现仍然可以处理底层的二进制值,并在后续的 Arrow 协议消息中传递 custom_metadata
。
Extension types may or may not use the
'ARROW:extension:metadata'
field. Let’s consider some example
extension types:
扩展类型可能会也可能不会使用 'ARROW:extension:metadata'
字段。让我们考虑一些示例扩展类型:
uuid
represented asFixedSizeBinary(16)
with empty metadata
uuid
表示为FixedSizeBinary(16)
,元数据为空latitude-longitude
represented asstruct<latitude: double, longitude: double>
, and empty metadata
latitude-longitude
表示为struct<latitude: double, longitude: double>
,并且元数据为空tensor
(multidimensional array) stored asBinary
values and having serialized metadata indicating the data type and shape of each value. This could be JSON like{'type': 'int8', 'shape': [4, 5]}
for a 4x5 cell tensor.
tensor
(多维数组)以Binary
值的形式存储,并具有序列化的元数据,指示每个值的数据类型和形状。这可以像{'type': 'int8', 'shape': [4, 5]}
一样的 JSON,用于 4x5 单元的张量。trading-time
represented asTimestamp
with serialized metadata indicating the market trading calendar the data corresponds to
trading-time
表示为Timestamp
,序列化元数据指示数据对应的市场交易日历
See also 参见
Implementation guidelines#
实施指南#
An execution engine (or framework, or UDF executor, or storage engine,
etc) can implement only a subset of the Arrow spec and/or extend it
given the following constraints:
执行引擎(或框架,或 UDF 执行器,或存储引擎等)只能实现 Arrow 规范的一个子集,并/或在满足以下约束的情况下对其进行扩展:
Implementing a subset of the spec#
实现规范的一个子集 #
If only producing (and not consuming) arrow vectors: Any subset of the vector spec and the corresponding metadata can be implemented.
如果仅生产(而不消费)箭头向量:可以实现向量规格的任何子集及其对应的元数据。If consuming and producing vectors: There is a minimal subset of vectors to be supported. Production of a subset of vectors and their corresponding metadata is always fine. Consumption of vectors should at least convert the unsupported input vectors to the supported subset (for example Timestamp.millis to timestamp.micros or int32 to int64).
如果消费和生产向量:需要支持的向量有一个最小的子集。生产一个向量的子集及其相应的元数据总是可以的。消费向量至少应将不支持的输入向量转换为支持的子集(例如,将 Timestamp.millis 转换为 timestamp.micros 或 int32 转换为 int64)。
Extensibility# 可扩展性 #
An execution engine implementor can also extend their memory
representation with their own vectors internally as long as they are
never exposed. Before sending data to another system expecting Arrow
data, these custom vectors should be converted to a type that exist in
the Arrow spec.
执行引擎实现者也可以使用他们自己的向量扩展内部的内存表示,只要他们永远不被暴露。在将数据发送到期望接收 Arrow 数据的另一个系统之前,这些自定义向量应该被转换为 Arrow 规范中存在的类型。