这是用户在 2025-1-9 20:53 为 https://app.immersivetranslate.com/pdf-pro/9acd99c9-a3e7-4488-a489-c9801a72403f 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Multiary Wavelet Trees in Practice
多元小波树的实践

Alex Bowe  亚历克斯·博威alex.bowe@rmit.edu.au

Honours Thesis  荣誉论文
Supervisor: Simon Puglisi
主管:西蒙·普格利西

School of Computer Science and Information Technology
计算机科学与信息技术学院
RMIT University  RMIT 大学Melbourne, AUSTRALIA  墨尔本,澳大利亚

November 2010  2010 年 11 月

Abstract  摘要

Self-index data structures for pattern matching based on Suffix Arrays and the Burrows-Wheeler Transform have recently grown popular. The fundamental operation in these self-indexes is the rank query: rank ( i , c ) rank ( i , c ) rank(i,c)\operatorname{rank}(i, c) requests the number of occurrences of symbol c c cc before position i i ii in a string. The wavelet tree is currently the data structure of choice for implementing rank queries. Current wavelet tree implementations encode the Burrows-Wheeler Transform as a hierarchy of binary strings, which they then store as RRR sequences; the RRR structure offers O ( 1 ) O ( 1 ) O(1)O(1) rank queries and zeroth-order entropy compression for binary strings. A generalisation of the RRR tecnique extends wavelet trees to have higher order encoding, that is, increased branching, in theory making traversal faster. To support the implementation of such a multiary wavelet tree, this thesis investigates the generalisation of RRR to sequences over small alphabets. We also analyse the use of concatenated bitmaps to represent sequences over small alphabets, thus allowing continued use of the binary RRR structure. Our results show that multiary wavelet trees are faster than their binary counterparts, but require large amounts of memory in the case of the generalised RRR. We also show binary RRR on concatenated bitmaps to be an effective, practical alternative.
基于后缀数组和 Burrows-Wheeler 变换的自索引数据结构在模式匹配中最近变得流行。这些自索引中的基本操作是排名查询: rank ( i , c ) rank ( i , c ) rank(i,c)\operatorname{rank}(i, c) 请求在字符串中位置 i i ii 之前符号 c c cc 的出现次数。目前,波形树是实现排名查询的首选数据结构。目前的波形树实现将 Burrows-Wheeler 变换编码为二进制字符串的层次结构,然后将其存储为 RRR 序列;RRR 结构为二进制字符串提供 O ( 1 ) O ( 1 ) O(1)O(1) 排名查询和零阶熵压缩。RRR 技术的一个推广扩展了波形树,使其具有更高阶的编码,即增加分支,理论上使遍历更快。为了支持这种多元波形树的实现,本论文研究了 RRR 对小字母表序列的推广。我们还分析了使用连接位图表示小字母表序列,从而允许继续使用二进制 RRR 结构。 我们的结果表明,多元小波树比二元小波树更快,但在广义 RRR 的情况下需要大量内存。我们还展示了在连接位图上使用二元 RRR 是一种有效的、实用的替代方案。

Keywords: data structures, algorithms, string search, compression, indexes
关键词:数据结构,算法,字符串搜索,压缩,索引

Contents  内容

1 Introduction … 3  1 引言 … 3
1.1 Our Contribution … 4
1.1 我们的贡献 … 4

1.2 Roadmap … 4
1.2 路线图 … 4

2 Preliminaries … 6
2 前言 … 6

2.1 Background … 6
2.1 背景 … 6

2.2 Suffix Arrays … 6
2.2 后缀数组 … 6

2.3 Burrows-Wheeler Transform … 7
2.3 Burrows-Wheeler 变换 … 7

2.4 Rank Query … 7
2.4 排名查询 … 7

2.5 Backward Search … 8
2.5 向后搜索 … 8

3 Binary Wavelet Trees and RRR … 11
3 二进制小波树和 RRR … 11

3.1 Binary Wavelet Trees … 11
3.1 二进制小波树 … 11

3.2 RRR … 13
4 Multiary Wavelet Trees and Generalised RRR … 16
4 多元小波树和广义 RRR … 16

4.1 Multiary Wavelet Trees … 16
4.1 多元小波树 … 16

4.2 Variation 1 : Uncompressed … 16
4.2 变体 1 : 未压缩 … 16

4.3 Variation 2: Multi-Binary RRR … 17
4.3 变体 2:多二进制 RRR … 17

4.4 Generalised RRR … 17
4.4 一般化 RRR … 17

5 Experiments … 19
5 实验 … 19

5.1 Hypotheses … 19
5.1 假设 … 19

5.2 Method … 19
5.2 方法 … 19

5.3 Test Data … 20
5.3 测试数据 … 20

6 Results … 21
6 结果 … 21

7 Conclusion … 31
7 结论 … 31

8 Future Work … 31
8 未来工作 … 31

9 Acknowledgements … 33
9 致谢 … 33

1 Introduction  1 引言

As collections of text grow larger, our need to find information in them and infer patterns and rankings increases, too. Suffix arrays, first described by Manber and Myers [18], allow a variety of complex pattern matching and pattern discovery problems to be performed in optimal time. There are many areas where suffix arrays are likely the most appropriate data structure for the task, including:
随着文本集合的增大,我们在其中查找信息和推断模式及排名的需求也在增加。后缀数组,首次由 Manber 和 Myers 描述[18],允许以最佳时间执行各种复杂的模式匹配和模式发现问题。在许多领域,后缀数组可能是执行该任务的最合适数据结构,包括:
  • Genome analysis [1, 9];
    基因组分析 [1, 9];
  • Searching for patterns in non-word data such as images, multimedia signals and DNA [5];
    在非文本数据中寻找模式,例如图像、多媒体信号和 DNA [5];
  • Searching for patterns in oriental languages; as some oriental languages do not have spaces between certain particles, an inverted file would be insufficient;
    在东方语言中寻找模式;由于某些东方语言在某些成分之间没有空格,反向文件将不够用;
  • Pattern discovery and visualisation using arc diagrams, as proposed by Wattenberg [27].
    使用弧形图进行模式发现和可视化,如 Wattenberg [27]所提议的。

Figure 1: Overview of the key data structures and their relationship with each other. When a Suffix Array and a Burrows-Wheeler Transform (BWT) [3] are combined, they form an FM-Index [7]. FM-Indexes use Backward Searches on the BWT to provide fast pattern matching, counting, and substring extraction operations. Backward Search requires rank operations, which are best implemented using a Wavelet Tree (WT) [12]. A WT encodes a string as a hierarchy of bit vectors, which it uses to answer general rank queries using log σ log σ log sigma\log \sigma binary rank queries. Binary rank queries can be answered in O ( 1 ) O ( 1 ) O(1)O(1) time when the bit vector is stored as a RRR sequence [25], which utilise a global table of pre-calculated ranks. The RRR data structure also offers compression of the Wavelet Tree. (we defer a detailed discussion of RRR to Section 3.2)
图 1:关键数据结构及其相互关系的概述。当后缀数组和 Burrows-Wheeler 变换(BWT)[3]结合时,它们形成 FM-索引[7]。FM-索引在 BWT 上使用反向搜索,以提供快速的模式匹配、计数和子字符串提取操作。反向搜索需要秩操作,最佳实现方式是使用波形树(WT)[12]。WT 将字符串编码为位向量的层次结构,利用这些位向量回答一般的秩查询,使用 log σ log σ log sigma\log \sigma 二进制秩查询。当位向量存储为 RRR 序列[25]时,二进制秩查询可以在 O ( 1 ) O ( 1 ) O(1)O(1) 时间内回答,这利用了预计算秩的全局表。RRR 数据结构还提供了波形树的压缩。(我们将在 3.2 节中详细讨论 RRR)
Due to its performance in these important applications, the improvement of suffix arrays has been the focus of intensive research over the past 20 years. Self-indexing is one of these improvements. Self indexes support fast pattern counting (reporting the number of occurrences of a pattern in a text), fast pattern matching (reporting the positions of each pattern occurrence) and extraction of arbitrary substrings of the original text, including the original text
由于在这些重要应用中的表现,后缀数组的改进在过去 20 年中一直是密集研究的重点。自我索引是这些改进之一。自我索引支持快速模式计数(报告模式在文本中出现的次数)、快速模式匹配(报告每个模式出现的位置)以及提取原始文本的任意子字符串,包括原始文本。

itself. These operations in some sense allow us to replace the original text with a self index.
本身。这些操作在某种意义上允许我们用自我索引替换原始文本。
One such self index is known as the ‘FM-Index’, proposed by Ferragina and Manzini [7]. An FM-Index utilises a sparse Suffix Array and the BurrowsWheeler Transform (BWT) [3] (the two leftmost diagrams in Figure 1) of the original text, and supports fast pattern matching operations by using the socalled ‘backward search’ method.
一种这样的自索引被称为“FM-Index”,由 Ferragina 和 Manzini 提出。FM-Index 利用稀疏后缀数组和原始文本的 Burrows-Wheeler 变换(BWT),并通过使用所谓的“向后搜索”方法支持快速模式匹配操作。
Backward searching with a BWT requires a fundamental operation called a ‘rank query’, which counts the number of occurrences of a given symbol up to a given position in the string.
使用 BWT 进行反向搜索需要一个基本操作,称为“排名查询”,它计算在字符串中给定位置之前给定符号的出现次数。
While a naive implementation of the rank query operation might inspect every symbol of the BWT, and do so in O ( N ) O ( N ) O(N)O(N) time where N N NN is the length of the BWT, it is possible to improve this to O ( log σ ) O ( log σ ) O(log sigma)O(\log \sigma) time where σ σ sigma\sigma is the size of the alphabet. To do this, we construct a Wavelet Tree [12] over the BWT (the third diagram in Figure 1). A Wavelet Tree is constructed by encoding the string as a hierarchy of bit vectors. These bit vectors are then used to answer rank queries by traversing and performing binary rank queries 1 1 ^(1){ }^{1}. A more detailed description of Wavelet Trees appears in Section 3.
虽然对秩查询操作的简单实现可能会检查 BWT 的每个符号,并且在 O ( N ) O ( N ) O(N)O(N) 时间内完成,其中 N N NN 是 BWT 的长度,但可以将其改进为 O ( log σ ) O ( log σ ) O(log sigma)O(\log \sigma) 时间,其中 σ σ sigma\sigma 是字母表的大小。为此,我们在 BWT 上构建一个 Wavelet Tree [12](图 1 中的第三个图)。Wavelet Tree 是通过将字符串编码为位向量的层次结构来构建的。这些位向量随后用于通过遍历和执行二进制秩查询 1 1 ^(1){ }^{1} 来回答秩查询。Wavelet Trees 的更详细描述出现在第 3 节。
Binary rank queries can be performed in O ( 1 ) O ( 1 ) O(1)O(1) time when the bit vectors are stored in a RRR sequence, as proposed by Raman, Raman and Rao [25]. RRR also offers compression of the Wavelet Tree.
当位向量存储在 RRR 序列中时,二进制秩查询可以在 O ( 1 ) O ( 1 ) O(1)O(1) 时间内执行,正如 Raman、Raman 和 Rao [25]所提出的。RRR 还提供了小波树的压缩。
The key motivation behind this project arises from the increasing number of papers, such as Ferragina et al. 2007 [8], Yu et al. 2009 [29], and Barbay et al. 2010 [2], which utilise Multiary Wavelet Trees as a theoretical tool, that is, Wavelet Trees with a branching factor greater than two. However no known implementations of Multiary Wavelet Trees exist. This thesis aims to address this need, and bring theory closer to practice. It was expected that increasing arity would improve the time performance of self-indexes which use a BWT.
该项目的主要动机源于越来越多的论文,例如 Ferragina et al. 2007 [8]、Yu et al. 2009 [29] 和 Barbay et al. 2010 [2],它们将多元小波树作为一种理论工具,即分支因子大于二的小波树。然而,目前尚不存在多元小波树的已知实现。本论文旨在满足这一需求,使理论更接近实践。预计增加元数将改善使用 BWT 的自索引的时间性能。

1.1 Our Contribution  1.1 我们的贡献

The contribution of this paper is the experimental analysis of two types of Multiary Wavelet Tree. The first of these, to our knowledge, is the first implementation of the Generalised RRR structure, which is used to support rank queries on small alphabets, as first suggested by Ferragina et al. [8].
本文的贡献是对两种多元小波树的实验分析。其中第一个,据我们所知,是对广义 RRR 结构的首次实现,该结构用于支持对小字母表的秩查询,正如 Ferragina 等人[8]首次提出的。
We also propose a new data structure, a Multiary Wavelet Tree that utilises Binary RRR using concatenated bitmaps to represent sequences on small alphabets as bit vectors.
我们还提出了一种新的数据结构,多元小波树,它利用二进制 RRR 通过连接位图来表示小字母表上的序列为位向量。
We discover that although Multiary Wavelet Trees are faster, Generalised RRR requires significant memory for the supporting global table in its current form, which, depending on the size of the text, may make these Multiary Wavelet Trees impractical. We show that our new data structure, which continues to use Binary RRR, is a practical way to make rank queries faster. Additionally, we observe that the BWT gives rise to a sparse class list as the arity increases.
我们发现,尽管多元小波树更快,但通用 RRR 在其当前形式下需要大量内存来支持全局表,这可能会使这些多元小波树在文本大小的不同情况下变得不切实际。我们展示了我们新的数据结构,它继续使用二进制 RRR,是一种使秩查询更快的实用方法。此外,我们观察到,随着元数的增加,BWT 产生了一个稀疏的类列表。

1.2 Roadmap  1.2 路线图

The rest of the thesis is organised as follows. Section 2 begins with some notation, followed by background on the problem domain. We then discuss how
论文的其余部分组织如下。第二节首先介绍一些符号,然后是问题领域的背景。接下来我们讨论如何
each data structure is built and used together.
每个数据结构都是一起构建和使用的。

Section 3 provides a description of Binary Wavelet Trees and their use of the RRR structure, followed by a discussion in Section 4 of how our Multiary Wavelet Trees are designed, including some implementation details of interest.
第 3 节提供了二进制小波树及其使用 RRR 结构的描述,接着在第 4 节讨论了我们的多元小波树是如何设计的,包括一些有趣的实现细节。
In Section 5 we describe how we measured the time and space performance for each Multiary Wavelet Tree variation, and provide a rationale of our test dataset. Our results are presented in Section 6, accompanied by a discussion of the apparent trends.
在第 5 节中,我们描述了如何测量每种多元小波树变体的时间和空间性能,并提供了我们的测试数据集的理由。我们的结果在第 6 节中呈现,并伴随对明显趋势的讨论。
The conclusion follows in Section 7. We discuss the practicality of each Multiary Wavelet Tree variation, and in Section 8 we consider how their performance might be improved.
结论在第 7 节中给出。我们讨论了每种多元小波树变体的实用性,在第 8 节中我们考虑如何提高它们的性能。

2 Preliminaries  2 前言

Figure 2: Array representation of ‘mississippi’ string.
图 2:‘mississippi’字符串的数组表示。

Throughout this paper we represent the string we are searching in as S S SS of length | S | = N | S | = N |S|=N|S|=N, and the pattern we are searching for as P . S [ i ] P . S [ i ] P.S[i]P . S[i] represents the symbol located at position i i ii in S S SS, and S [ i . . j ] S [ i . . j ] S[i..j]S[i . . j] represents the substring of S S SS beginning at position i i ii and ending at j j jj inclusive. Strings are one-based, so in Figure 2 S [ 1 ] = 2 S [ 1 ] = 2S[1]=2 S[1]= ’ m ', S [ 3 ] = S [ 3 ] = S[3]=S[3]= ’ s ', and S [ 1 . .3 ] = S [ 1 . .3 ] = S[1..3]=S[1 . .3]= ‘mis’.
在本文中,我们将要搜索的字符串表示为 S S SS ,长度为 | S | = N | S | = N |S|=N|S|=N ,我们要搜索的模式表示为 P . S [ i ] P . S [ i ] P.S[i]P . S[i] ,表示位于 S S SS 中位置 i i ii 的符号,而 S [ i . . j ] S [ i . . j ] S[i..j]S[i . . j] 表示从位置 i i ii 开始并在 j j jj 结束(包括)的 S S SS 的子字符串。字符串是基于 1 的,因此在图 2 S [ 1 ] = 2 S [ 1 ] = 2S[1]=2 S[1]= 中,'m'、's' 和 'mis'。
The i th i th  i^("th ")i^{\text {th }} suffix is thus defined as S [ i . . N ] S [ i . . N ] S[i..N]S[i . . N], so the 1 st 1 st  1^("st ")1^{\text {st }} suffix in Figure 2 is S [ 1 . .12 ] = S [ 1 . .12 ] = S[1..12]=S[1 . .12]= 'mississipi $ $ $\$ ', and the 5 th 5 th  5^("th ")5^{\text {th }} suffix is S [ 5 . .12 ] = S [ 5 . .12 ] = S[5..12]=S[5 . .12]= ‘issippi$’. The i th i th  i^("th ")i^{\text {th }} prefix is defined as S [ 1 . . i ] S [ 1 . . i ] S[1..i]S[1 . . i], so the 5 th 5 th  5^("th ")5^{\text {th }} prefix in Figure 2 is S [ 1 . .5 ] = S [ 1 . .5 ] = S[1..5]=S[1 . .5]= ‘missi’.
因此, i th i th  i^("th ")i^{\text {th }} 后缀被定义为 S [ i . . N ] S [ i . . N ] S[i..N]S[i . . N] ,所以图 2 中的 1 st 1 st  1^("st ")1^{\text {st }} 后缀是 S [ 1 . .12 ] = S [ 1 . .12 ] = S[1..12]=S[1 . .12]= 'mississipi $ $ $\$ ',而 5 th 5 th  5^("th ")5^{\text {th }} 后缀是 S [ 5 . .12 ] = S [ 5 . .12 ] = S[5..12]=S[5 . .12]= 'issippi$'。 i th i th  i^("th ")i^{\text {th }} 前缀被定义为 S [ 1 . . i ] S [ 1 . . i ] S[1..i]S[1 . . i] ,所以图 2 中的 5 th 5 th  5^("th ")5^{\text {th }} 前缀是 S [ 1 . .5 ] = S [ 1 . .5 ] = S[1..5]=S[1 . .5]= 'missi'。
The log log log\log operation is base 2 unless otherwise stated.
除非另有说明, log log log\log 操作是以 2 为基数。

2.1 Background  2.1 背景

In 1970 Knuth, Morris and Pratt (KMP) discovered an algorithm to match patterns in time proportional to the length of the text [14, 20]. If the text is large, then KMP becomes ineffective for ranking and pattern discovery. Moreover, KMP is only useful for exact matches.
在 1970 年,Knuth、Morris 和 Pratt(KMP)发现了一种算法,可以在与文本长度成比例的时间内匹配模式[14, 20]。如果文本很大,那么 KMP 在排名和模式发现方面变得无效。此外,KMP 仅对精确匹配有用。
One alternative to KMP for document ranking is the use of an inverted index, but they must work with keywords and are thus inappropriate for many applications, such as searches on certain oriental languages, and other strings that do not have a clear definition of keywords (MIDI, for example). Suffix arrays are also more efficient than inverted files for searching phrases or partial patterns [ 13 , 19 ] [ 13 , 19 ] [13,19][13,19]. This was originally possible with the suffix tree [20, 28] (a precursor to the suffix array), although suffix trees require three to five times as much space [18].
KMP 的一个替代方案是使用倒排索引,但它们必须与关键词一起工作,因此不适合许多应用,例如某些东方语言的搜索,以及其他没有明确关键词定义的字符串(例如 MIDI)。后缀数组在搜索短语或部分模式时也比倒排文件更高效 [ 13 , 19 ] [ 13 , 19 ] [13,19][13,19] 。这最初是通过后缀树[20, 28](后缀数组的前身)实现的,尽管后缀树需要三到五倍的空间[18]。

2.2 Suffix Arrays  2.2 后缀数组

In its simplest form, a suffix array can be constructed for a string S [ 1 . . N ] S [ 1 . . N ] S[1..N]S[1 . . N] like so:
在最简单的形式中,可以为字符串 S [ 1 . . N ] S [ 1 . . N ] S[1..N]S[1 . . N] 构建后缀数组,如下所示:
  1. Construct an array of pointers to all suffixes S [ 1 . . N ] , S [ 2 . . N ] , , S [ N . . N ] S [ 1 . . N ] , S [ 2 . . N ] , , S [ N . . N ] S[1..N],S[2..N],dots,S[N..N]S[1 . . N], S[2 . . N], \ldots, S[N . . N].
    构造一个指向所有后缀的指针数组 S [ 1 . . N ] , S [ 2 . . N ] , , S [ N . . N ] S [ 1 . . N ] , S [ 2 . . N ] , , S [ N . . N ] S[1..N],S[2..N],dots,S[N..N]S[1 . . N], S[2 . . N], \ldots, S[N . . N]
  2. Sort these pointers by the lexicographical (i.e. alphabetical) ordering of their associated suffixes.
    按其相关后缀的字典顺序(即字母顺序)对这些指针进行排序。
Figure 2 shows an example string, ‘mississippi’. The construction of the corresponding suffix array is shown in Figure 3. Construction of suffix arrays is now a well studied problem [24].
图 2 显示了一个示例字符串,‘mississippi’。相应后缀数组的构造如图 3 所示。后缀数组的构造现在是一个研究得很透彻的问题[24]。

Figure 3: Construction of Suffix Array for ‘mississippi’.
图 3:为“mississippi”构建后缀数组。
BWT SA
i 1 2 1 2 12\mathbf{1 2} $
p 1 1 1 1 11\mathbf{1 1} i$
s 8 ippi$
s 5 5 5\mathbf{5} issippi$  密西西比$
m 2 2 2\mathbf{2} ississippi$  密西西比$
$ 1 mississippi$  密西西比$
p 10 pi$
i 9 ppi$
s 7 sippi$
s 4 sissippi$
i 6 ssippi$
i 3 ssissippi$
BWT SA i 12 $ p 11 i$ s 8 ippi$ s 5 issippi$ m 2 ississippi$ $ 1 mississippi$ p 10 pi$ i 9 ppi$ s 7 sippi$ s 4 sissippi$ i 6 ssippi$ i 3 ssissippi$| BWT | SA | | | :---: | :---: | :--- | | i | $\mathbf{1 2}$ | $ | | p | $\mathbf{1 1}$ | i$ | | s | 8 | ippi$ | | s | $\mathbf{5}$ | issippi$ | | m | $\mathbf{2}$ | ississippi$ | | $ | 1 | mississippi$ | | p | 10 | pi$ | | i | 9 | ppi$ | | s | 7 | sippi$ | | s | 4 | sissippi$ | | i | 6 | ssippi$ | | i | 3 | ssissippi$ |
Figure 4: Suffix Array and Burrows-Wheeler Transform for ‘mississippi’ string.
图 4:‘mississippi’字符串的后缀数组和 Burrows-Wheeler 变换。

2.3 Burrows-Wheeler Transform
2.3 巴罗斯-惠勒变换

The Burrows-Wheeler Transform (BWT) is a string of length N N NN defined by the suffix array S A S A SAS A and the original text S S SS. In particular, B W T [ i ] = S [ S A [ i ] 1 ] B W T [ i ] = S [ S A [ i ] 1 ] BWT[i]=S[SA[i]-1]B W T[i]=S[S A[i]-1], and B W T [ 1 ] = B W T [ 1 ] = BWT[1]=B W T[1]= $ $ $\$ ', that is, the i t h i t h i^(th)i^{t h} symbol of the BWT is the symbol prior to the i t h i t h i^(th)i^{t h} suffix in the Suffix Array S A S A SAS A. See Figure 4.
Burrows-Wheeler 变换(BWT)是由后缀数组 S A S A SAS A 和原始文本 S S SS 定义的长度为 N N NN 的字符串。特别地, B W T [ i ] = S [ S A [ i ] 1 ] B W T [ i ] = S [ S A [ i ] 1 ] BWT[i]=S[SA[i]-1]B W T[i]=S[S A[i]-1] ,以及 B W T [ 1 ] = B W T [ 1 ] = BWT[1]=B W T[1]= $ $ $\$ ',也就是说,BWT 的 i t h i t h i^(th)i^{t h} 符号是后缀数组 S A S A SAS A i t h i t h i^(th)i^{t h} 后缀之前的符号。见图 4。
As proposed by Ferragina and Manzini, when a BWT is stored alongside a Suffix Array, it is known as a FM-Index [7], which supports backwards search.
根据 Ferragina 和 Manzini 的提议,当 BWT 与后缀数组一起存储时,它被称为 FM-Index [7],支持向后搜索。

2.4 Rank Query  2.4 排名查询

Munro [21] describes how to do rank queries on binary strings in O ( 1 ) O ( 1 ) O(1)O(1) time using o ( n ) o ( n ) o(n)o(n) bits of extra space. Much of the early work on rank queries focussed on binary strings.
Munro [21] 描述了如何在 O ( 1 ) O ( 1 ) O(1)O(1) 时间内使用 o ( n ) o ( n ) o(n)o(n) 位额外空间对二进制字符串进行排名查询。早期关于排名查询的许多工作集中在二进制字符串上。
A rank query on the string S S SS is defined as rank S ( i , c ) = n rank S ( i , c ) = n rank_(S)(i,c)=n\operatorname{rank}_{S}(i, c)=n, with n n nn being the number of times symbol c c cc appears in the range S [ 1 , i ] S [ 1 , i ] S[1,i]S[1, i]. This paper omits the subscript when the string we are querying is clear from the context. For example in Figure 5 , rank ( 9 , s ) = 3 5 , rank ( 9 , s ) = 3 5,rank(9,s)=35, \operatorname{rank}(9, s)=3. If i 0 i 0 i <= 0i \leq 0 then rank ( i , c ) = 0 rank ( i , c ) = 0 rank(i,c)=0\operatorname{rank}(i, c)=0.
对字符串 S S SS 的排名查询定义为 rank S ( i , c ) = n rank S ( i , c ) = n rank_(S)(i,c)=n\operatorname{rank}_{S}(i, c)=n ,其中 n n nn 是符号 c c cc 在范围 S [ 1 , i ] S [ 1 , i ] S[1,i]S[1, i] 中出现的次数。本文在查询的字符串从上下文中清晰可见时省略下标。例如在图 5 , rank ( 9 , s ) = 3 5 , rank ( 9 , s ) = 3 5,rank(9,s)=35, \operatorname{rank}(9, s)=3 中。如果 i 0 i 0 i <= 0i \leq 0 rank ( i , c ) = 0 rank ( i , c ) = 0 rank(i,c)=0\operatorname{rank}(i, c)=0

Figure 5: Rank query rank ( 9 , s ) = 3 rank ( 9 , s ) = 3 rank(9,s)=3\operatorname{rank}(9, s)=3 on the Burrows-Wheeler Transform of ‘mississippi’.
图 5:在“mississippi”的 Burrows-Wheeler 变换上进行的排名查询 rank ( 9 , s ) = 3 rank ( 9 , s ) = 3 rank(9,s)=3\operatorname{rank}(9, s)=3
Since any pattern P P PP in S S SS is a prefix of a suffix, and because the suffixes are lexicographically ordered, all occurrences of a search pattern P P PP lie in a contiguous portion of the Suffix Array. In earlier implementations, the range that this pattern lies on would be located by using successive binary searches. Backward search utilises the BWT in a series of paired rank queries, improving the query performance considerably [ 4 , 6 , 8 , 10 , 16 , 17 , 19 , 22 ] [ 4 , 6 , 8 , 10 , 16 , 17 , 19 , 22 ] [4,6,8,10,16,17,19,22][4,6,8,10,16,17,19,22]
由于 S S SS 中的任何模式 P P PP 都是后缀的前缀,并且后缀是按字典顺序排列的,因此搜索模式 P P PP 的所有出现都位于后缀数组的一个连续部分。在早期的实现中,这个模式所在的范围是通过使用连续的二分搜索来定位的。反向搜索利用 BWT 进行一系列配对排名查询,显著提高了查询性能 [ 4 , 6 , 8 , 10 , 16 , 17 , 19 , 22 ] [ 4 , 6 , 8 , 10 , 16 , 17 , 19 , 22 ] [4,6,8,10,16,17,19,22][4,6,8,10,16,17,19,22]
Backward search issues | P | | P | |P||P| pairs of rank queries, where | P | | P | |P||P| denotes the length of the pattern. The paired rank queries are:
反向搜索问题 | P | | P | |P||P| 对排名查询,其中 | P | | P | |P||P| 表示模式的长度。配对的排名查询是:
s = C [ P [ i ] ] + rank ( s 1 , P [ i ] ) + 1 e = C [ P [ i ] ] + rank ( e , P [ i ] ) s = C [ P [ i ] ] + rank ( s 1 , P [ i ] ) + 1 e = C [ P [ i ] ] + rank ( e , P [ i ] ) {:[s^(')=C[P[i]]+rank(s-1","P[i])+1],[e^(')=C[P[i]]+rank(e","P[i])]:}\begin{gathered} s^{\prime}=C[P[i]]+\operatorname{rank}(s-1, P[i])+1 \\ e^{\prime}=C[P[i]]+\operatorname{rank}(e, P[i]) \end{gathered}
Where s s ss denotes the start of the range, initially at s = 1 s = 1 s=1s=1, and e e ee is the end of the range, initially e = N e = N e=Ne=N.
其中 s s ss 表示范围的起始,最初为 s = 1 s = 1 s=1s=1 ,而 e e ee 是范围的结束,最初为 e = N e = N e=Ne=N
In Figure 7 there is a column F F FF, which contains the first symbol for each suffix. Note that the F F FF column is not stored as we store it encoded as C C CC instead.
在图 7 中,有一列 F F FF ,其中包含每个后缀的第一个符号。请注意, F F FF 列没有存储,因为我们将其编码为 C C CC 来存储。

C C CC is an array 2 2 ^(2){ }^{2} containing the count of all symbols in Σ Σ Sigma\Sigma which sort lexicographically before P [ i ] P [ i ] P[i]P[i], where Σ Σ Sigma\Sigma is the alphabet from our original string S S SS, as in Figure 6.
C C CC 是一个数组 2 2 ^(2){ }^{2} ,包含在 Σ Σ Sigma\Sigma 中按字典顺序排在 P [ i ] P [ i ] P[i]P[i] 之前的所有符号的计数,其中 Σ Σ Sigma\Sigma 是我们原始字符串 S S SS 中的字母表,如图 6 所示。
In the first iteration we query the final character of the pattern, so i = | P | i = | P | i=|P|i=|P|. For each iteration, we decrement i i ii until i = 1 i = 1 i=1i=1. This maintains the invariant that S A [ s . . e ] S A [ s . . e ] SA[s..e]S A[s . . e] contains all the suffixes of which P [ i . . | P | ] P [ i . . | P | ] P[i..|P|]P[i . .|P|] is a prefix, and hence all locations of P [ i . . | P | ] P [ i . . | P | ] P[i..|P|]P[i . .|P|] in S S SS. This is illustrated in Figure 8 through to Figure 10. If at any stage e < s e < s e < se<s, then the pattern does not exist in our original string.
在第一次迭代中,我们查询模式的最后一个字符,因此 i = | P | i = | P | i=|P|i=|P| 。对于每次迭代,我们递减 i i ii ,直到 i = 1 i = 1 i=1i=1 。这保持了不变式,即 S A [ s . . e ] S A [ s . . e ] SA[s..e]S A[s . . e] 包含所有以 P [ i . . | P | ] P [ i . . | P | ] P[i..|P|]P[i . .|P|] 为前缀的后缀,因此包含 S S SS P [ i . . | P | ] P [ i . . | P | ] P[i..|P|]P[i . .|P|] 的所有位置。这在图 8 到图 10 中进行了说明。如果在任何阶段 e < s e < s e < se<s ,那么模式在我们的原始字符串中不存在。
An example is given in Figure 7 through to Figure 10, where the pattern ’ i s s i s s issi s s ’ is searched for in the string ‘mississippi’, starting with i = 3 , P [ 3 ] = i = 3 , P [ 3 ] = i=3,P[3]=i=3, P[3]= s s ss '. The working for each rank query is shown below each figure. We represent the current symbol as c c cc to avoid confusion between ’ s ’ and s s ss and s s s^(')s^{\prime}.
图 7 到图 10 中给出了一个示例,其中在字符串‘mississippi’中搜索模式’ i s s i s s issi s s ’,从 i = 3 , P [ 3 ] = i = 3 , P [ 3 ] = i=3,P[3]=i=3, P[3]= s s ss '开始。每个等级查询的工作在每个图形下方显示。我们将当前符号表示为 c c cc ,以避免在’s’和 s s ss s s s^(')s^{\prime} 之间的混淆。

Figure 6: Table C C CC of number of occurrences in F F FF of each symbol which sorts alphabetically before the displayed symbol.
图 6:表 C C CC 中每个符号在 F F FF 中出现的次数,按字母顺序排列在显示符号之前。
F BWT SA
1 - $ i 12 $ s = 1 s = 1 quad s=1\quad s=1
2 i p 11 i$ e = 12
3 i S 8 ippi$
4 i s 5 issippi$  密西西比$
5 i m 2 ississippi$  密西西比$
6 m $ 1 mississippi$  密西西比$
7 p p 10 pi$
8 p i 9 ppi$
9 s s 7 sippi$
10 S s 4 sissippi$
11 S i 6 ssippi$
12 - S i 3 ssissippi$
F BWT SA 1 - $ i 12 $ quad s=1 2 i p 11 i$ e = 12 3 i S 8 ippi$ 4 i s 5 issippi$ 5 i m 2 ississippi$ 6 m $ 1 mississippi$ 7 p p 10 pi$ 8 p i 9 ppi$ 9 s s 7 sippi$ 10 S s 4 sissippi$ 11 S i 6 ssippi$ 12 - S i 3 ssissippi$| | | F | BWT | SA | | | :---: | :---: | :---: | :---: | :---: | :---: | | 1 | - | $ | i | 12 | $ $\quad s=1$ | | 2 | | i | p | 11 | i$ e = 12 | | 3 | | i | S | 8 | ippi$ | | 4 | | i | s | 5 | issippi$ | | 5 | | i | m | 2 | ississippi$ | | 6 | | m | $ | 1 | mississippi$ | | 7 | | p | p | 10 | pi$ | | 8 | | p | i | 9 | ppi$ | | 9 | | s | s | 7 | sippi$ | | 10 | | S | s | 4 | sissippi$ | | 11 | | S | i | 6 | ssippi$ | | 12 | - | S | i | 3 | ssissippi$ |
Figure 7: First stage of backwards search for ‘iss’ on ‘mississippi’ string - before any rank queries have been made.
图 7:在“mississippi”字符串上对“iss”进行反向搜索的第一阶段 - 在进行任何排名查询之前。
  1. Starting from s = 1 s = 1 s=1s=1 and e = 12 e = 12 e=12e=12 as in Figure 7, and c = P [ i ] = c = P [ i ] = c=P[i]=c=P[i]= ‘s’ where i = 3 i = 3 i=3i=3, we make our first two rank queries:
    s = 1 s = 1 s=1s=1 e = 12 e = 12 e=12e=12 开始,如图 7 所示,以及 c = P [ i ] = c = P [ i ] = c=P[i]=c=P[i]= 的‘s’,在 i = 3 i = 3 i=3i=3 处,我们进行第一次两个排名查询:
s = C [ c ] + rank ( 0 , c ) + 1 = 8 + 0 + 1 = 9 e = C [ c ] + rank ( 12 , c ) = 8 + 4 = 12 s = C [ c ] + rank ( 0 , c ) + 1 = 8 + 0 + 1 = 9 e = C [ c ] + rank ( 12 , c ) = 8 + 4 = 12 {:[s^(')=C[c]+rank(0","c)+1=8+0+1=9],[e^(')=C[c]+rank(12","c)=8+4=12]:}\begin{gathered} s^{\prime}=C[c]+\operatorname{rank}(0, c)+1=8+0+1=9 \\ e^{\prime}=C[c]+\operatorname{rank}(12, c)=8+4=12 \end{gathered}
Figure 8: Second stage of backwards search for ‘iss’ on ‘mississippi’ string. All the occurrences of ’ s s ss ’ lie in S A [ 9 . .12 ] S A [ 9 . .12 ] SA[9..12]S A[9 . .12].
图 8:在“mississippi”字符串上对“iss”的反向搜索的第二阶段。所有的出现都位于 S A [ 9 . .12 ] S A [ 9 . .12 ] SA[9..12]S A[9 . .12] 中的’ s s ss ’。

2. From s = 9 s = 9 s=9s=9 and e = 11 e = 11 e=11e=11 as in Figure 8 , and c = P [ i ] = c = P [ i ] = c=P[i]=c=P[i]= ‘s’ where i = 2 i = 2 i=2i=2, our next two rank queries are:
s = 9 s = 9 s=9s=9 e = 11 e = 11 e=11e=11 如图 8 所示,以及 c = P [ i ] = c = P [ i ] = c=P[i]=c=P[i]= 的‘s’在 i = 2 i = 2 i=2i=2 处,我们的下两个排名查询是:
s = C [ c ] + rank ( 8 , c ) + 1 = 8 + 2 + 1 = 11 e = C [ c ] + rank ( 12 , c ) = 8 + 4 = 12 s = C [ c ] + rank ( 8 , c ) + 1 = 8 + 2 + 1 = 11 e = C [ c ] + rank ( 12 , c ) = 8 + 4 = 12 {:[s^('')=C[c]+rank(8","c)+1=8+2+1=11],[e^('')=C[c]+rank(12","c)=8+4=12]:}\begin{gathered} s^{\prime \prime}=C[c]+\operatorname{rank}(8, c)+1=8+2+1=11 \\ e^{\prime \prime}=C[c]+\operatorname{rank}(12, c)=8+4=12 \end{gathered}
F BWT SA
1 $ i 12 $ s = 11 s = 11 quads=11\quad \mathrm{s}=11
2 i p 11 i$ e = 12
3 i s 8 ippi$
4 i S 5 issippi$  密西西比$
5 i m 2 ississippi$  密西西比$
6 m $ 1 mississippi$  密西西比$
7 p p 10 pi$
8 p i 9 ppi$
9 s s 7 sippi$
10 S s 4 sissippi$
11 - s i 6 ssippi$
12 - s i 3 ssissippi$
F BWT SA 1 $ i 12 $ quads=11 2 i p 11 i$ e = 12 3 i s 8 ippi$ 4 i S 5 issippi$ 5 i m 2 ississippi$ 6 m $ 1 mississippi$ 7 p p 10 pi$ 8 p i 9 ppi$ 9 s s 7 sippi$ 10 S s 4 sissippi$ 11 - s i 6 ssippi$ 12 - s i 3 ssissippi$| | F | BWT | SA | | | :---: | :---: | :---: | :---: | :---: | | 1 | $ | i | 12 | $ $\quad \mathrm{s}=11$ | | 2 | i | p | 11 | i$ e = 12 | | 3 | i | s | 8 | ippi$ | | 4 | i | S | 5 | issippi$ | | 5 | i | m | 2 | ississippi$ | | 6 | m | $ | 1 | mississippi$ | | 7 | p | p | 10 | pi$ | | 8 | p | i | 9 | ppi$ | | 9 | s | s | 7 | sippi$ | | 10 | S | s | 4 | sissippi$ | | 11 | - s | i | 6 | ssippi$ | | 12 | - s | i | 3 | ssissippi$ |
Figure 9: Third stage of backwards search for ‘iss’ on ‘mississippi’ string. All the occurrences of ‘ss’ lie in S A [ 11 . .12 ] S A [ 11 . .12 ] SA[11..12]S A[11 . .12].
图 9:在“mississippi”字符串上对“iss”进行反向搜索的第三阶段。所有“ss”的出现都位于 S A [ 11 . .12 ] S A [ 11 . .12 ] SA[11..12]S A[11 . .12]

3. From s = 11 s = 11 s=11s=11 and e = 12 e = 12 e=12e=12 as in Figure 9 , and c = P [ i ] = i c = P [ i ] = i c=P[i]=‘ic=P[i]=‘ \mathrm{i} ’ where i = 1 i = 1 i=1i=1, our final two rank queries are:
3. 从 s = 11 s = 11 s=11s=11 e = 12 e = 12 e=12e=12 如图 9 所示,以及 c = P [ i ] = i c = P [ i ] = i c=P[i]=‘ic=P[i]=‘ \mathrm{i} ' 在 i = 1 i = 1 i=1i=1 处,我们的最终两个排名查询是:
s = C [ c ] + rank ( 10 , c ) + 1 = 1 + 2 + 1 = 4 e = C [ c ] + rank ( 12 , c ) = 1 + 4 = 5 s = C [ c ] + rank ( 10 , c ) + 1 = 1 + 2 + 1 = 4 e = C [ c ] + rank ( 12 , c ) = 1 + 4 = 5 {:[s^(''')=C[c]+rank(10","c)+1=1+2+1=4],[e^(''')=C[c]+rank(12","c)=1+4=5]:}\begin{gathered} s^{\prime \prime \prime}=C[c]+\operatorname{rank}(10, c)+1=1+2+1=4 \\ e^{\prime \prime \prime}=C[c]+\operatorname{rank}(12, c)=1+4=5 \end{gathered}
1 $ i 12 s = 4 s = 4 s=4s=4
2 i p 11 i$ e =
3 i s 8 ippi$
4 - i s 5 issippi$  密西西比$
5 - i m 2 ississippi$  密西西比$
6 m $ 1 mississippi$  密西西比$
7 p p 10 pi$
8 p i 9 ppi$
9 s s 7 sippi$
10 s s 4 sissippi$
1 s 1 6 ssippi$
12 s i 3 ssissippi$
1 $ i 12 s=4 2 i p 11 i$ e = 3 i s 8 ippi$ 4 - i s 5 issippi$ 5 - i m 2 ississippi$ 6 m $ 1 mississippi$ 7 p p 10 pi$ 8 p i 9 ppi$ 9 s s 7 sippi$ 10 s s 4 sissippi$ 1 s 1 6 ssippi$ 12 s i 3 ssissippi$| 1 | $ | i | 12 | $s=4$ | | :---: | :---: | :---: | :---: | :---: | | 2 | i | p | 11 | i$ e = | | 3 | i | s | 8 | ippi$ | | 4 | - i | s | 5 | issippi$ | | 5 | - i | m | 2 | ississippi$ | | 6 | m | $ | 1 | mississippi$ | | 7 | p | p | 10 | pi$ | | 8 | p | i | 9 | ppi$ | | 9 | s | s | 7 | sippi$ | | 10 | s | s | 4 | sissippi$ | | 1 | s | 1 | 6 | ssippi$ | | 12 | s | i | 3 | ssissippi$ |
Figure 10: Fourth and final stage of backwards search for ‘iss’ on ‘mississippi’ string. All the occurrences of ‘iss’ lie in S A [ 4 . .5 ] S A [ 4 . .5 ] SA[4..5]S A[4 . .5].
图 10:在“mississippi”字符串上对“iss”进行反向搜索的第四个也是最后一个阶段。所有“iss”的出现都位于 S A [ 4 . .5 ] S A [ 4 . .5 ] SA[4..5]S A[4 . .5]

3 Binary Wavelet Trees and RRR
3 二进制小波树和 RRR

This section describes Binary Wavelet Trees, which provide fast rank queries over strings with an alphabet size larger than 2 , and the RRR structure, which is used for fast binary rank queries and compression.
本节描述了二进制小波树,它提供了对字母表大小大于 2 的字符串的快速排名查询,以及 RRR 结构,它用于快速二进制排名查询和压缩。

3.1 Binary Wavelet Trees
3.1 二进制小波树

One of the most effective data structures for answering rank queries is the Wavelet Tree [ 4 , 6 , 8 , 12 , 16 ] [ 4 , 6 , 8 , 12 , 16 ] [4,6,8,12,16][4,6,8,12,16].
回答排名查询最有效的数据结构之一是 Wavelet Tree [ 4 , 6 , 8 , 12 , 16 ] [ 4 , 6 , 8 , 12 , 16 ] [4,6,8,12,16][4,6,8,12,16]
Binary Wavelet Trees encode the BWT (or any string over which we require fast rank queries) as a perfect binary tree of bit vectors, to enable O ( log σ ) O ( log σ ) O(log sigma)O(\log \sigma) time rank queries, where σ σ sigma\sigma is the size of the alphabet. The tree is defined recursively as follows:
二进制小波树将 BWT(或任何我们需要快速排名查询的字符串)编码为位向量的完美二叉树,以实现 O ( log σ ) O ( log σ ) O(log sigma)O(\log \sigma) 时间的排名查询,其中 σ σ sigma\sigma 是字母表的大小。该树递归定义如下:
  1. Encoding half the alphabet as 0 , and half as 1 , for example:
    将字母表的一半编码为 0,另一半编码为 1,例如:
Σ = { $ , i , m , p , s } e n c ( Σ ) = { 0 , 0 , 0 , 1 , 1 } Σ = { $ , i , m , p , s } e n c ( Σ ) = { 0 , 0 , 0 , 1 , 1 } {:[Sigma={$","i","m","p","s}],[enc(Sigma)={0","0","0","1","1}]:}\begin{gathered} \Sigma=\{\$, i, m, p, s\} \\ e n c(\Sigma)=\{0,0,0,1,1\} \end{gathered}
  1. Group each 0 -encoded symbol, { $ , i , m } { $ , i , m } {$,i,m}\{\$, i, m\}, as a sub-tree;
    将每个 0 编码符号 { $ , i , m } { $ , i , m } {$,i,m}\{\$, i, m\} 作为子树进行分组;
  2. Group each 1-encoded symbol, { p , s } { p , s } {p,s}\{p, s\}, as a sub-tree;
    将每个 1 编码的符号 { p , s } { p , s } {p,s}\{p, s\} 作为子树进行分组;
  3. Reapply to each sub-tree recursively until there is only one symbol left.
    递归地重新应用于每个子树,直到只剩下一个符号。
The encoded binary Wavelet Tree root node for the ‘mississippi’ BWT is shown in Figure 11. For a more detailed example, showing the whole tree, see Figure 12.
“mississippi” BWT 的编码二进制小波树根节点如图 11 所示。有关更详细的示例,显示整个树,请参见图 12。
1 p S S m $ p p pp S S i i
0 1 1 1 0 0 1 0 1 1 0 0
1 2 3 4 5 6 7 8 9 10 11 12
1 p S S m $ p S S i i 0 1 1 1 0 0 1 0 1 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12| 1 | p | S | S | m | $ | $p$ | | S | S | i | i | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
Figure 11: Root node of Binary Wavelet Tree encoding for ‘mississippi’ BWT. Each symbol in the string is assigned a bit (0 or 1 ) depending on which half of the alphabet it is from.
图 11:‘mississippi’ BWT 的二进制小波树编码的根节点。字符串中的每个符号根据其来自字母表的哪一半被分配一个比特(0 或 1)。
After the tree is constructed, a rank query on the Wavelet Tree can be done with log σ log σ log sigma\log \sigma binary rank queries on the bit vectors. For example, if we wanted to know rank ( 6 , e ) rank ( 6 , e ) rank(6,e)\operatorname{rank}(6, e) in Figure 12, we use the following procedure which is illustrated in Figure 14. We know that enc ( e ) = 0 ( e ) = 0 (e)=0(e)=0 at this level, so:
在树构建完成后,可以通过对位向量进行 log σ log σ log sigma\log \sigma 个二进制秩查询来对小波树进行秩查询。例如,如果我们想知道图 12 中的 rank ( 6 , e ) rank ( 6 , e ) rank(6,e)\operatorname{rank}(6, e) ,我们使用以下程序,如图 14 所示。我们知道在这个层级上 enc ( e ) = 0 ( e ) = 0 (e)=0(e)=0 ,所以:
  1. At the root node, count the number of 0 s in the range [1…6], which is given by rank ( 6 , 0 ) = 4 rank ( 6 , 0 ) = 4 rank(6,0)=4\operatorname{rank}(6,0)=4. This gives us the index to query in our 0 -child.
    在根节点,计算范围 [1…6] 中 0 的数量,结果为 rank ( 6 , 0 ) = 4 rank ( 6 , 0 ) = 4 rank(6,0)=4\operatorname{rank}(6,0)=4 。这给了我们在 0 -子节点中查询的索引。
  2. Calculate rank ( 4 , 1 ) = 2 rank ( 4 , 1 ) = 2 rank(4,1)=2\operatorname{rank}(4,1)=2, as e e ee is encoded as 1 at this child. We traverse the 1-branch this time, with the next index as 2 .
    计算 rank ( 4 , 1 ) = 2 rank ( 4 , 1 ) = 2 rank(4,1)=2\operatorname{rank}(4,1)=2 ,因为 e e ee 在这个孩子中编码为 1。我们这次遍历 1-分支,下一索引为 2。
  3. rank ( 2 , 1 ) = 2 rank ( 2 , 1 ) = 2 rank(2,1)=2\operatorname{rank}(2,1)=2, which we use as the index in the child on the 1 -branch, with our next index as 2 .
    rank ( 2 , 1 ) = 2 rank ( 2 , 1 ) = 2 rank(2,1)=2\operatorname{rank}(2,1)=2 ,我们在 1-分支的子项中将其用作索引,接下来的索引为 2。
  4. rank ( 2 , 0 ) = 2 rank ( 2 , 0 ) = 2 rank(2,0)=2\operatorname{rank}(2,0)=2, as e e ee is encoded as 0 here. Since the children at this point are leaf nodes, we return 2 as our result.
    rank ( 2 , 0 ) = 2 rank ( 2 , 0 ) = 2 rank(2,0)=2\operatorname{rank}(2,0)=2 ,因为 e e ee 在这里编码为 0。由于此时的子节点是叶节点,我们返回 2 作为结果。
Hence the result of rank ( 6 , e ) rank ( 6 , e ) rank(6,e)\operatorname{rank}(6, e) is 2 . If we store these nodes in RRR, binary rank queries can be answered in O ( 1 ) O ( 1 ) O(1)O(1) time.
因此, rank ( 6 , e ) rank ( 6 , e ) rank(6,e)\operatorname{rank}(6, e) 的结果是 2。如果我们将这些节点存储在 RRR 中,二进制秩查询可以在 O ( 1 ) O ( 1 ) O(1)O(1) 时间内回答。

Figure 12: Binary Wavelet Tree for ‘Peter Piper…’ where spaces are displayed as underscores.
图 12:‘Peter Piper…’的二进制小波树,其中空格显示为下划线。

Figure 13: 4-ary Wavelet Tree for ‘Peter Piper…’ where spaces are displayed as underscores.
图 13:‘Peter Piper…’的 4 元小波树,其中空格显示为下划线。

Figure 14: Answering rank ( 6 , e ) rank ( 6 , e ) rank(6,e)\operatorname{rank}(6, e) over the Binary Wavelet Tree for ‘Peter Piper…’ where spaces are displayed as underscores.
图 14:在二进制小波树上回答 rank ( 6 , e ) rank ( 6 , e ) rank(6,e)\operatorname{rank}(6, e) ,关于“Peter Piper…”的内容,其中空格显示为下划线。

3.2 RRR

RRR was first proposed by Raman et al. [25]. The purpose of R R R R R R RRRR R R is to encode a bit sequence in such a way that supports O ( 1 ) O ( 1 ) O(1)O(1) time binary rank queries. It also provides implicit (i.e. automatic) compression, requiring NH 0 ( S ) + o ( N ) NH 0 ( S ) + o ( N ) NH_(0)(S)+o(N)\mathrm{NH}_{0}(S)+o(N) where H 0 ( S ) H 0 ( S ) H_(0)(S)H_{0}(S) is the zeroth-order empirical entropy of S S SS.
RRR 最早由 Raman 等人提出[25]。 R R R R R R RRRR R R 的目的是以一种支持 O ( 1 ) O ( 1 ) O(1)O(1) 时间二进制秩查询的方式编码位序列。它还提供了隐式(即自动)压缩,要求 NH 0 ( S ) + o ( N ) NH 0 ( S ) + o ( N ) NH_(0)(S)+o(N)\mathrm{NH}_{0}(S)+o(N) ,其中 H 0 ( S ) H 0 ( S ) H_(0)(S)H_{0}(S) S S SS 的零阶经验熵。
Zeroth-order empirical entropy is a lower bound for the average code word size when a symbol is mapped to the same code word irrespective of the context in which they appear. It can be calculated as H 0 ( S ) = i = 1 σ N i N log N N i H 0 ( S ) = i = 1 σ N i N log N N i H_(0)(S)=sum_(i=1)^(sigma)(N^(i))/(N)log((N)/(N^(i)))H_{0}(S)=\sum_{i=1}^{\sigma} \frac{N^{i}}{N} \log \frac{N}{N^{i}}, for an alphabet Σ Σ Sigma\Sigma size of σ σ sigma\sigma, and N i N i N^(i)N^{i} is the number of occurrences of the i t h i t h i^(th)i^{t h} element of Σ Σ Sigma\Sigma in our text S [ 1 . . N ] S [ 1 . . N ] S[1..N]S[1 . . N].
零阶经验熵是当一个符号无论出现在何种上下文中都映射到相同的代码字时,平均代码字大小的下界。它可以计算为 H 0 ( S ) = i = 1 σ N i N log N N i H 0 ( S ) = i = 1 σ N i N log N N i H_(0)(S)=sum_(i=1)^(sigma)(N^(i))/(N)log((N)/(N^(i)))H_{0}(S)=\sum_{i=1}^{\sigma} \frac{N^{i}}{N} \log \frac{N}{N^{i}} ,对于字母表 Σ Σ Sigma\Sigma 大小为 σ σ sigma\sigma ,而 N i N i N^(i)N^{i} 是我们文本 S [ 1 . . N ] S [ 1 . . N ] S[1..N]S[1 . . N] Σ Σ Sigma\Sigma i t h i t h i^(th)i^{t h} 元素出现的次数。
The classical O ( 1 ) O ( 1 ) O(1)O(1) time implementation of binary rank queries required the N N NN bits of the original bit sequence, plus O ( N log log N log N ) = o ( N ) O N log log N log N = o ( N ) O((N log log N)/(log N))=o(N)O\left(\frac{N \log \log N}{\log N}\right)=o(N) additional bits [11].
经典的 O ( 1 ) O ( 1 ) O(1)O(1) 时间实现的二进制秩查询需要原始位序列的 N N NN 位,加上 O ( N log log N log N ) = o ( N ) O N log log N log N = o ( N ) O((N log log N)/(log N))=o(N)O\left(\frac{N \log \log N}{\log N}\right)=o(N) 个额外的位 [11]。

Figure 15: Block division scheme for ‘Peter Piper…’ Wavelet Tree’s root node bit vector.
图 15:‘Peter Piper…’小波树根节点位向量的块划分方案。
To construct the RRR we divide the bit vector into several so-called superblocks, we then divide these superblocks further, into blocks of b b bb bits each, as
为了构建 RRR,我们将位向量分成几个所谓的超级块,然后进一步将这些超级块划分为每个 b b bb 位的块,作为

in Figure 15. We call the number of blocks in a superblock the superblock-factor, f f ff.
在图 15 中。我们称超块中的块数为超块因子, f f ff
For each of these blocks we store a class number c c cc, which in the binary case is the number of 1 s in the block. This is used as a lookup key in a table G G GG, which is a table of tables, and will be explained shortly. We also store offset o o oo, which is an index into the table at G [ c ] G [ c ] G[c]G[c]. Each offset value o o oo tells us precisely which of the possible blocks of class c c cc a block is. See Figure 17.
对于每个这些块,我们存储一个类编号 c c cc ,在二进制情况下,它是块中 1 的数量。这被用作表 G G GG 中的查找键,这是一张表的表,稍后将进行解释。我们还存储偏移量 o o oo ,它是表 G [ c ] G [ c ] G[c]G[c] 中的一个索引。每个偏移值 o o oo 精确地告诉我们一个块是类 c c cc 的可能块中的哪一个。请参见图 17。

G G GG is a table having subtables G [ c ] G [ c ] G[c]G[c] for each class c c cc. For every possible permutation of c c cc 1-bits, G [ c ] G [ c ] G[c]G[c] contains an array of cumulative sums for each position in the block of given offset and class - this is illustrated in Figure 16. It is important to note that the size of o o oo varies, since the number of possible permutations of c c cc bits, and hence entries in G [ c ] G [ c ] G[c]G[c], is ( b c ) ( b c ) ((b)/(c))\binom{b}{c}, and can be encoded in log ( b c ) log ( b c ) log((b)/(c))\log \binom{b}{c} bits.
G G GG 是一个包含每个类 c c cc 的子表 G [ c ] G [ c ] G[c]G[c] 的表。对于每个可能的 c c cc 1 位的排列, G [ c ] G [ c ] G[c]G[c] 包含给定偏移量和类的块中每个位置的累积和数组 - 如图 16 所示。重要的是要注意, o o oo 的大小是变化的,因为 c c cc 位的可能排列数量,因此 G [ c ] G [ c ] G[c]G[c] 中的条目是 ( b c ) ( b c ) ((b)/(c))\binom{b}{c} ,可以用 log ( b c ) log ( b c ) log((b)/(c))\log \binom{b}{c} 位编码。
The reason for grouping blocks into superblocks is to avoid iterating over each block to answer a rank query; a query requires the sum of the ranks of previous blocks as well, as depicted in See Figure 17. If we store the sum of all block ranks up to a superblock boundary, then a rank query rank ( i , c ) rank ( i , c ) rank(i,c)\operatorname{rank}(i, c) can be answered like so:
将块分组为超块的原因是为了避免对每个块进行迭代以回答排名查询;查询还需要前面块的排名总和,如图 17 所示。如果我们存储到超块边界的所有块排名的总和,那么排名查询 rank ( i , c ) rank ( i , c ) rank(i,c)\operatorname{rank}(i, c) 可以这样回答:
  1. Calculate which block our index is in as i b = i / b i b = i / b i_(b)=i//bi_{b}=i / b.
    计算我们的索引位于哪个块中,作为 i b = i / b i b = i / b i_(b)=i//bi_{b}=i / b
  2. Calculate which superblock our block resides in as i s = i b / f i s = i b / f i_(s)=i_(b)//fi_{s}=i_{b} / f.
    计算我们的区块位于哪个超级块中,作为 i s = i b / f i s = i b / f i_(s)=i_(b)//fi_{s}=i_{b} / f
  3. Set result to the sum of previous ranks at i s i s i_(s)i_{s} boundary (which is precalculated).
    将结果设置为在 i s i s i_(s)i_{s} 边界处的前一个等级的总和(该值是预先计算的)。
  4. Using each blocks class-offset pair ( c , o ) ( c , o ) (c,o)(c, o) after the boundary at i s i s i_(s)i_{s}, add the rank for that entire block to result.
    在边界后使用每个块的类偏移对 ( c , o ) ( c , o ) (c,o)(c, o) i s i s i_(s)i_{s} ,将该整个块的排名添加到结果中。
  5. Repeat previous step until we reach i b i b i_(b)i_{b}. We then add rank i b ( j , c ) rank i b ( j , c ) rank_(i_(b))(j,c)\operatorname{rank}_{i_{b}}(j, c) to our result, where j = i mod b j = i mod b j=i mod bj=i \bmod b, and is the position we are querying local to i b i b i_(b)i_{b}. Our final answer is result.
    重复前一步骤,直到我们达到 i b i b i_(b)i_{b} 。然后我们将 rank i b ( j , c ) rank i b ( j , c ) rank_(i_(b))(j,c)\operatorname{rank}_{i_{b}}(j, c) 加入我们的结果,其中 j = i mod b j = i mod b j=i mod bj=i \bmod b ,并且是我们查询的相对于 i b i b i_(b)i_{b} 的位置。我们的最终答案是结果。
With the superblock we also store an initial address for the variable-length offset values. After finding the first offset address in a superblock, we calculate the next offset address in bits according to the blocks class c c cc as log ( b c ) log ( b c ) |~log((b)/(c))~|\left\lceil\log \binom{b}{c}\right\rceil bits, which we add to the current address. See Figure 17, which shows what is calculated for each superblock 3 3 ^(3){ }^{3}.
通过超级块,我们还存储了可变长度偏移值的初始地址。在超级块中找到第一个偏移地址后,我们根据块的类别 c c cc 计算下一个偏移地址,以 log ( b c ) log ( b c ) |~log((b)/(c))~|\left\lceil\log \binom{b}{c}\right\rceil 位为单位,并将其加到当前地址上。请参见图 17,显示了为每个超级块 3 3 ^(3){ }^{3} 计算的内容。
It is possible to support Multiary Wavelet Trees using RRR with a more extensive class allocation, which we will discuss in Section 4.4.
可以使用更广泛的类分配通过 RRR 支持多元小波树,我们将在第 4.4 节中讨论。
Figure 16: Binary RRR Count Table, with example lookup for class c = 2 c = 2 c=2c=2 and offset o = 3 o = 3 o=3o=3 in a RRR sequence.
图 16:二进制 RRR 计数表,示例查找类 c = 2 c = 2 c=2c=2 和 RRR 序列中的偏移 o = 3 o = 3 o=3o=3

sum of ranks for all previous blocks
所有先前区块的排名总和

before superblock boundaries
在超级块边界之前
Figure 17: RRR Sequence divided into three superblocks. For each superblock boundary, a sum of previous ranks is stored, as well as the address of the first offset value. These allow us to reduce the amount of iteration required to answer a rank query.
图 17:RRR 序列分为三个超级块。对于每个超级块边界,存储前面排名的总和,以及第一个偏移值的地址。这些允许我们减少回答排名查询所需的迭代次数。

4 Multiary Wavelet Trees and Generalised RRR
4 多元小波树和广义 RRR

This section discusses the design of Multiary Wavelet Trees, and three approaches to support rank queries on the nodes.
本节讨论了多元小波树的设计,以及支持节点上排名查询的三种方法。

4.1 Multiary Wavelet Trees
4.1 多元小波树

Multiary Wavelet Trees are analogous to their binary counterparts, although now we encode each node recursively like so:
多元小波树类似于它们的二进制对应物,尽管现在我们以递归方式对每个节点进行编码,如下所示:
  1. Encoding one A t h A t h A^(th)A^{t h} of the alphabet as 0 , the next A th A th  A^("th ")A^{\text {th }} as 1 , the next N t h N t h N^(th)N^{t h} as 2 and so on until A 1 A 1 A-1A-1. For example, with the ‘Peter Piper…’ string:
    将字母表中的一个 A t h A t h A^(th)A^{t h} 编码为 0,下一个 A th A th  A^("th ")A^{\text {th }} 编码为 1,再下一个 N t h N t h N^(th)N^{t h} 编码为 2,依此类推,直到 A 1 A 1 A-1A-1 。例如,对于“Peter Piper…”字符串:
Σ = { $ , , P , a , c , d , e , f , i , k , l , o , p , r , s , t } enc ( Σ ) = { 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 } Σ = $ , , P , a , c , d , e , f , i , k , l , o , p , r , s , t enc ( Σ ) = { 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 } {:[Sigma={$,_(-),P,a,c,d,e,f,i,k,l,o,p,r,s,t}],[enc(Sigma)={0","0","0","0","1","1","1","1","2","2","2","2","3","3","3","3}]:}\begin{gathered} \Sigma=\left\{\$,_{-}, P, a, c, d, e, f, i, k, l, o, p, r, s, t\right\} \\ \operatorname{enc}(\Sigma)=\{0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3\} \end{gathered}
  1. Group each 0 -encoded symbol as a sub-tree
    将每个 0 编码符号分组为一个子树
  2. Group each 1-encoded symbol as a sub-tree
    将每个 1 编码的符号分组为子树
  3. Group each 2-encoded symbol as a sub-tree, and so on until A 1 A 1 A-1A-1
    将每个 2 编码符号分组为一个子树,依此类推,直到 A 1 A 1 A-1A-1
  4. Reapply to each sub-tree recursively until the amount of symbols is less than or equal to A 1 A 1 A-1A-1.
    递归地重新应用到每个子树,直到符号的数量小于或等于 A 1 A 1 A-1A-1
See Figure 13 for a Multiary Wavelet Tree constructed over the ‘Peter Piper…’ string.
请参见图 13,了解在“Peter Piper…”字符串上构建的多元小波树。
We can no longer use binary rank queries, and hence Binary RRR, the same way as we do with a Binary Wavelet Tree. We discuss three alternative approaches in the following sections.
我们无法再像使用二进制小波树那样使用二进制秩查询,因此也无法使用二进制 RRR。我们将在接下来的章节中讨论三种替代方法。

4.2 Variation 1 : Uncompressed
4.2 变体 1 : 未压缩

eeecedecfcedee

22202120302122
000010001010000
Symbol 1 2 00000100000100 1 2 00000100000100 (1)/(2)00000100000100\frac{1}{2} 00000100000100  符号 1 2 00000100000100 1 2 00000100000100 (1)/(2)00000100000100\frac{1}{2} 00000100000100
211101010001011
3 20000000100000
Figure 18: Concatenated bitmap binary encoding of multiary Wavelet Tree Node representing ‘eeecedecfcedee’ from the ‘Peter Piper…’ string.
图 18:表示“eeecedecfcedee”的多元小波树节点的连接位图二进制编码,来自“Peter Piper…”字符串。
It is possible to represent each encoded symbol c c cc, where c c cc is an element of 0 , 1 , , A 1 0 , 1 , , A 1 0,1,dots,A-10,1, \ldots, A-1 and A A AA is the arity, using A A AA bitmaps. First we construct the bitmaps for each symbol, as in Figure 18, then we concatenate these bitmaps and store them as one bit-vector. A rank query then involves a ranged binary rank query on N [ c L , c L + i ] N [ c L , c L + i ] N[cL,cL+i]N[c L, c L+i] at each node N N NN, where L L LL is the length of the node string before concatenation, and i i ii is the position.
可以使用 A A AA 位图表示每个编码符号 c c cc ,其中 c c cc 0 , 1 , , A 1 0 , 1 , , A 1 0,1,dots,A-10,1, \ldots, A-1 的一个元素, A A AA 是元数。首先,我们为每个符号构建位图,如图 18 所示,然后将这些位图连接起来并将其存储为一个位向量。然后,排名查询涉及在每个节点 N N NN 上对 N [ c L , c L + i ] N [ c L , c L + i ] N[cL,cL+i]N[c L, c L+i] 进行范围二进制排名查询,其中 L L LL 是连接前节点字符串的长度, i i ii 是位置。
We use A A AA bits per symbol, when they could be represented in log A log A log A\log A bits, but this allows us to utilise binary RRR, which we do in the 'Multi-Binary R R R R R R RRRR R R ’ Wavelet Tree discussed in Section 4.3.
我们使用 A A AA 位每个符号,而它们可以用 log A log A log A\log A 位表示,但这使我们能够利用二进制 RRR,我们在第 4.3 节讨论的“多重二进制 R R R R R R RRRR R R ”小波树中使用了它。

4.3 Variation 2 : Multi-Binary RRR
4.3 变体 2 : 多二进制 RRR

Like the uncompressed version, bitmaps are created for each symbol and concatenated, but the bit-vector is stored in a binary RRR sequence. A query then becomes two binary rank queries 4 4 ^(4){ }^{4};
与未压缩版本一样,位图为每个符号创建并连接,但位向量以二进制 RRR 序列存储。查询变成两个二进制秩查询 4 4 ^(4){ }^{4}
rank ( c L 1 , 1 ) rank ( c L + i , 1 ) rank ( c L 1 , 1 ) rank ( c L + i , 1 ) {:[rank(c**L-1","1)],[rank(c**L+i","1)]:}\begin{array}{r} \operatorname{rank}(c * L-1,1) \\ \operatorname{rank}(c * L+i, 1) \end{array}
Where c c cc is the symbol we are querying at position i , c > 0 i , c > 0 i,c > 0i, c>0, and L L LL is the original length before concatenation. If c = 0 c = 0 c=0c=0 then we say the result of the first binary rank query is 0 . The final result of rank ( i , c ) rank ( i , c ) rank(i,c)\operatorname{rank}(i, c) is calculated as the second binary rank query minus the first.
其中 c c cc 是我们在位置 i , c > 0 i , c > 0 i,c > 0i, c>0 查询的符号, L L LL 是连接前的原始长度。如果 c = 0 c = 0 c=0c=0 ,那么我们说第一次二进制排名查询的结果是 0。 rank ( i , c ) rank ( i , c ) rank(i,c)\operatorname{rank}(i, c) 的最终结果是第二次二进制排名查询减去第一次。
This variation means that we will not need to store a bigger G G GG table to accommodate the additional classes when increasing the arity, when compared with a generalised RRR structure, but does not offer the same sequence compression as the concatenated bitmaps take more bits than required.
这种变体意味着,与一般化的 RRR 结构相比,我们不需要存储更大的 G G GG 表来容纳增加的类,但由于连接的位图需要的位数超过所需,因此不提供相同的序列压缩。
Our third variation stores the symbols (without binary encoding) in a generalised RRR structure.
我们的第三种变体将符号(不进行二进制编码)存储在一种广义的 RRR 结构中。

4.4 Generalised RRR  4.4 一般化的 RRR

Figure 19: 4-ary RRR Count Table, with example lookup for class c = 2 c = 2 c=2c=2, which represents ( 2 , 2 , 0 , 1 ) ( 2 , 2 , 0 , 1 ) (2,2,0,1)(2,2,0,1), and offset o = 3 o = 3 o=3o=3 in a RRR sequence.
图 19:4 元 RRR 计数表,示例查找类 c = 2 c = 2 c=2c=2 ,表示 ( 2 , 2 , 0 , 1 ) ( 2 , 2 , 0 , 1 ) (2,2,0,1)(2,2,0,1) ,以及 RRR 序列中的偏移量 o = 3 o = 3 o=3o=3
The purpose of the Generalised RRR structure is to provide O ( 1 ) O ( 1 ) O(1)O(1) time rank queries and compression of a sequence of small integers (as opposed to binary integers) on the range [ 0 . . A 1 ] [ 0 . . A 1 ] [0..A-1][0 . . A-1] for arity A A AA. This requires these differences:
通用 RRR 结构的目的是提供 O ( 1 ) O ( 1 ) O(1)O(1) 时间排名查询和对范围 [ 0 . . A 1 ] [ 0 . . A 1 ] [0..A-1][0 . . A-1] 内小整数序列(与二进制整数相对)进行压缩,适用于元数 A A AA 。这需要这些差异:
  • Rather than simply being the number of 1-bits, classes are now considered to be a tuple of ( N 0 , N 1 , , N A 1 ) N 0 , N 1 , , N A 1 (N^(0),N^(1),dots,N^(A-1))\left(N^{0}, N^{1}, \ldots, N^{A-1}\right) for a given block, where N 0 N 0 N^(0)N^{0} is the number of 0 s , N 1 0 s , N 1 0s,N^(1)0 \mathrm{~s}, N^{1} is the number of 1 s , and so on. The RRR sequence still stores classes as a unique integer, though.
    与简单地计算 1 位的数量不同,类现在被视为给定块的一个元组 ( N 0 , N 1 , , N A 1 ) N 0 , N 1 , , N A 1 (N^(0),N^(1),dots,N^(A-1))\left(N^{0}, N^{1}, \ldots, N^{A-1}\right) ,其中 N 0 N 0 N^(0)N^{0} 0 s , N 1 0 s , N 1 0s,N^(1)0 \mathrm{~s}, N^{1} 的数量是 1 的数量,依此类推。尽管如此,RRR 序列仍然将类存储为一个唯一的整数。
  • Rather than having ( b N 1 ) ( b N 1 ) ((b)/(N^(1)))\binom{b}{N^{1}} permutations per class for blocksize b b bb, there are now ( b N 0 , N 1 , , N A 1 ) = i = 1 A ( b j = 1 i 1 N j N i ) ( b N 0 , N 1 , , N A 1 ) = i = 1 A ( b j = 1 i 1 N j N i ) ((b)/(N^(0),N^(1),dots,N^(A-1)))=prod_(i=1)^(A)((b-sum_(j=1)^(i-1)N^(j))/(N^(i)))\binom{b}{N^{0}, N^{1}, \ldots, N^{A-1}}=\prod_{i=1}^{A}\binom{b-\sum_{j=1}^{i-1} N^{j}}{N^{i}} different permutations. Each offset value o o oo therefore requires log ( N 0 , N 1 , , N A 1 ) log N 0 , N 1 , , N A 1 |~log(N^(0)","N^(1)","dots","N^(A-1))~|\left\lceil\log \left(\begin{array}{c}N^{0}, N^{1}, \ldots, N^{A-1}\end{array}\right)\right\rceil bits. The number of different permutations grows rapidly as arity increases, so our implementation only stored the classes and offsets that we encountered in an attempt to make use of sparsity we can expect in the BWT.
    与每个类别有 ( b N 1 ) ( b N 1 ) ((b)/(N^(1)))\binom{b}{N^{1}} 种排列不同,对于块大小 b b bb ,现在有 ( b N 0 , N 1 , , N A 1 ) = i = 1 A ( b j = 1 i 1 N j N i ) ( b N 0 , N 1 , , N A 1 ) = i = 1 A ( b j = 1 i 1 N j N i ) ((b)/(N^(0),N^(1),dots,N^(A-1)))=prod_(i=1)^(A)((b-sum_(j=1)^(i-1)N^(j))/(N^(i)))\binom{b}{N^{0}, N^{1}, \ldots, N^{A-1}}=\prod_{i=1}^{A}\binom{b-\sum_{j=1}^{i-1} N^{j}}{N^{i}} 种不同的排列。因此,每个偏移值 o o oo 需要 log ( N 0 , N 1 , , N A 1 ) log N 0 , N 1 , , N A 1 |~log(N^(0)","N^(1)","dots","N^(A-1))~|\left\lceil\log \left(\begin{array}{c}N^{0}, N^{1}, \ldots, N^{A-1}\end{array}\right)\right\rceil 位。随着元数增加,不同排列的数量迅速增长,因此我们的实现仅存储我们遇到的类别和偏移,以试图利用我们可以在 BWT 中预期的稀疏性。
  • The G G GG table must also store cumulative ranks for each symbol, per permutation. See Figure 19.
    G G GG 表还必须存储每个符号的累积排名,按排列方式。请参见图 19。
  • Each superblock now has A A AA rank sums of each previous block - a rank sum for each symbol.
    每个超级块现在都有 A A AA 个前一个块的排名总和 - 每个符号的一个排名总和。

5 Experiments  5 实验

We will detail below what our expected findings were, and our method, including the data selected for the experiments and why it was chosen.
我们将在下面详细说明我们预期的发现,以及我们的方法,包括为实验选择的数据及其选择原因。

5.1 Hypotheses  5.1 假设

The experiments were designed to test the following hypotheses:
实验旨在测试以下假设:
  • Since BWTs have many runs of the same symbol, RRR classes and offsets will not be equally distributed.
    由于 BWT 有许多相同符号的运行,RRR 类和偏移量将不会均匀分布。
  • Increasing the arity of a Wavelet Tree will make it shallower and hence reduce the amount of nodes visited per query, resulting in faster queries.
    增加小波树的元数将使其变得更浅,从而减少每个查询访问的节点数量,导致查询速度更快。
  • There is a practical limit to the order of the arity increase for Generalized RRR, since the RRR count table will increase in size. In particular there is tension between the size of the RRR count table and the size of the class / offset sequences.
    对于广义 RRR,增大阶数的实际限制是存在的,因为 RRR 计数表的大小会增加。特别是,RRR 计数表的大小与类/偏移序列的大小之间存在紧张关系。

5.2 Method  5.2 方法

For each Multiary Wavelet Tree variant 5 5 ^(5){ }^{5} (using Generalised RRR, Multi-binary RRR, and uncompressed concatenated bitmaps to provide sequence ranking at each node), we generated 1000 random rank queries rank ( c , i ) rank ( c , i ) rank(c,i)\operatorname{rank}(c, i). For three runs the mean query time was recorded, and the minimum result was taken as our result, as it is the time least influenced by external factors (e.g. Operating System swapping). The above experiment was repeated as we doubled the arity.
对于每个多元小波树变体 5 5 ^(5){ }^{5} (使用广义 RRR、多元二进制 RRR 和未压缩的连接位图在每个节点提供序列排名),我们生成了 1000 个随机排名查询 rank ( c , i ) rank ( c , i ) rank(c,i)\operatorname{rank}(c, i) 。在三次运行中记录了平均查询时间,并将最小结果作为我们的结果,因为这是受外部因素(例如操作系统交换)影响最小的时间。上述实验在我们将元数加倍时重复进行。
The size of the Wavelet Tree (including the RRR encoded sequences at each node) was recorded. The size of the RRR count tables was recorded for the cases which used a variant of RRR. For the Generalized RRR, we recorded how many unique class and offset values were encountered, and calculated the percentage of total possible classes and offsets these were 6 6 ^(6){ }^{6}.
记录了小波树的大小(包括每个节点的 RRR 编码序列)。对于使用 RRR 变体的情况,记录了 RRR 计数表的大小。对于广义 RRR,我们记录了遇到的唯一类和偏移值的数量,并计算了这些值占总可能类和偏移的百分比,这些值为 6 6 ^(6){ }^{6}
We used Francisco Claude’s SPIRE 2008 implementation of binary RRR 7 7 ^(7){ }^{7} implementation as a base line for comparison [4]. We used the same default block-size and super-block factor as Claude, which were 15 and 32 respectively.
我们使用了 Francisco Claude 的 SPIRE 2008 二进制 RRR 7 7 ^(7){ }^{7} 实现作为比较的基准[4]。我们使用了与 Claude 相同的默认块大小和超块因子,分别为 15 和 32。
The data set used for testing is described below 8 8 ^(8){ }^{8} in Section 5.3 with some statistics in Table 1.
用于测试的数据集在第 5.3 节中描述,如下所示 8 8 ^(8){ }^{8} ,并在表 1 中提供了一些统计数据。
The experiments were run on an otherwise idle Mac OS X Snow Leopard with a 2.4 GHz Intel Core 2 Duo processor, and 4 GB 1067 MHz DDR3 RAM.
实验在一台闲置的 Mac OS X Snow Leopard 上进行,配备 2.4 GHz Intel Core 2 Duo 处理器和 4 GB 1067 MHz DDR3 RAM。

5.3 Test Data  5.3 测试数据

Since a prime use of Wavelet Trees is to provide faster rank queries over BWTs (as part of an FM-Index), we constructed BWT strings over a selection of texts taken from the Pizza E C C h i l i Corpus website 9 E C C h i l i Corpus website  9 ECChili^("Corpus website ")^(9)\mathcal{E C C h i l i}^{\text {Corpus website }}{ }^{9}. This is the standard corpus used when developing compressed self indexes, and has been collected by Paolo Ferragina and Gonzalo Navarro, two prominent contributors of suffix array research.
由于小波树的一个主要用途是提供对 BWT 的更快排名查询(作为 FM-Index 的一部分),我们在从 Pizza E C C h i l i Corpus website 9 E C C h i l i Corpus website  9 ECChili^("Corpus website ")^(9)\mathcal{E C C h i l i}^{\text {Corpus website }}{ }^{9} 中选取的文本上构建了 BWT 字符串。这是开发压缩自索引时使用的标准语料库,由两位著名的后缀数组研究贡献者 Paolo Ferragina 和 Gonzalo Navarro 收集。
The corpus consists of source code for the Linux kernel and GNU C Compiler (GCC), protein sequences, DNA, English texts from Project Gutenberg 10 10 ^(10){ }^{10}, and XML-formatted bibliographies from several major Computer Science journals. These are considered to be representative of the sort of texts a suffix array may be used for. In the case of the English corpus, we also took a mapping of each unique word to an integer, allowing us to test word-based indexing.
语料库由 Linux 内核和 GNU C 编译器(GCC)的源代码、蛋白质序列、DNA、来自古腾堡计划的英文文本 10 10 ^(10){ }^{10} 以及来自几本主要计算机科学期刊的 XML 格式参考书目组成。这些被认为是后缀数组可能使用的文本类型的代表。在英文语料库的情况下,我们还对每个唯一单词进行了映射到一个整数,使我们能够测试基于单词的索引。
Three data sizes were used to test the scalability: 25 MB , 50 MB 25 MB , 50 MB 25MB,50MB25 \mathrm{MB}, 50 \mathrm{MB} and 75 MB . Importantly, these data sizes are much bigger than the available CPU cache, but will not take large amounts of time for experimentation. The length and alphabet size of these files are described in Table 1.
使用了三种数据大小来测试可扩展性: 25 MB , 50 MB 25 MB , 50 MB 25MB,50MB25 \mathrm{MB}, 50 \mathrm{MB} 和 75 MB。重要的是,这些数据大小远大于可用的 CPU 缓存,但不会占用大量实验时间。这些文件的长度和字母表大小在表 1 中描述。
Data  数据 25 MB 50 MB 75 MB
σ σ sigma\sigma length  长度 σ σ sigma\sigma length  长度 σ σ sigma\sigma
xml 96 26214400 98 52428804 98 78643208
dna 14 26214400 19 52428804 21 78643208
proteins  蛋白质 25 26214400 29 52428804 33 78643208
sources  来源 116 26214400 229 52428804 229 78643208
english  英语 154 26214400 179 52428804 188 78643208
words  单词 83 083 5969593 115754 11860943 164757
Data 25 MB 50 MB 75 MB sigma length sigma length sigma xml 96 26214400 98 52428804 98 78643208 dna 14 26214400 19 52428804 21 78643208 proteins 25 26214400 29 52428804 33 78643208 sources 116 26214400 229 52428804 229 78643208 english 154 26214400 179 52428804 188 78643208 words 83 083 5969593 115754 11860943 164757| Data | 25 MB | | 50 MB | | 75 MB | | | :---: | ---: | ---: | ---: | ---: | ---: | ---: | | | $\sigma$ | length | $\sigma$ | length | | $\sigma$ | | xml | 96 | 26214400 | 98 | 52428804 | 98 | 78643208 | | dna | 14 | 26214400 | 19 | 52428804 | 21 | 78643208 | | proteins | 25 | 26214400 | 29 | 52428804 | 33 | 78643208 | | sources | 116 | 26214400 | 229 | 52428804 | 229 | 78643208 | | english | 154 | 26214400 | 179 | 52428804 | 188 | 78643208 | | words | 83 | 083 | 5969593 | 115754 | 11860943 | 164757 |
Table 1: Text lengths and alphabet sizes for each test file.
表 1:每个测试文件的文本长度和字母表大小。

6 Results  6 结果

Note that measurements for the Generalized R R R R R R RRRR R R with arity 16 are missing, since the experiments took too long due to paging. This supports our third hypothesis, but there may be ways to overcome this, as detailed in Section 8.
请注意,具有 16 个元的广义 R R R R R R RRRR R R 的测量值缺失,因为实验由于分页而耗时过长。这支持了我们的第三个假设,但可能有方法可以克服这一点,具体细节见第 8 节。
Also Note that the Uncompressed Wavelet Tree timings have been plotted on a different graph due to the difference in scale.
另请注意,未压缩小波树的时间已在不同的图表上绘制,因为比例不同。
Arity Max Total Classes  最大总课程数 Max Total Permutations  最大总排列数
2 16 32768
4 816 1073741824
8 170544 35184372088832
Arity Max Total Classes Max Total Permutations 2 16 32768 4 816 1073741824 8 170544 35184372088832| Arity | Max Total Classes | Max Total Permutations | | :---: | ---: | ---: | | 2 | 16 | 32768 | | 4 | 816 | 1073741824 | | 8 | 170544 | 35184372088832 |
Table 2: Maximum Total Classes and Offsets possible with a blocksize of 15 for arity values 2,4 , and 8 .
表 2:对于基数值 2、4 和 8,块大小为 15 时可能的最大总类和偏移量。
From Figure 20 we can see how the number of unique classes and block permutations increases depending on file size and arity. Table 2 shows the maximum of each of these values, and Figure 21 plots the number of unique classes and block permutations as a percentage of the values in Table 2.
从图 20 中我们可以看到,唯一类别和块排列的数量如何根据文件大小和元数增加。表 2 显示了这些值的最大值,图 21 则将唯一类别和块排列的数量绘制为表 2 中值的百分比。
Figure 21 indicates that in all data sets, around 100 percent of the classes and block permutations were encountered for arity 2 . In most data sets, all classes were encountered for arity 4 as well, with the exception being DNA, which encountered around 30 percent.
图 21 表明,在所有数据集中,约 100%的类和块排列在二元性下被遇到。在大多数数据集中,四元性下也遇到了所有类,唯一的例外是 DNA,遇到了约 30%。
Figure 20 shows that the proteins data set had significantly more unique block permutations than any other data set, while the words data set is the only one to have encountered all classes for arity 8 . This is because protein data is essentially random, so its BWT will not contain long runs of the same symbol, reducing the skew of its classes.
图 20 显示,蛋白质数据集的唯一块排列显著多于任何其他数据集,而单词数据集是唯一一个遇到所有 8 元类的。原因在于蛋白质数据本质上是随机的,因此其 BWT 不会包含长时间的相同符号,从而减少了其类的偏斜。

English  英语

DNA
Sources  来源
Words  词语
Proteins  蛋白质
XML
Figure 20: Number of unique classes and block permutations for each data file. The vertical axis represents the total number of classes and block permutations that were witnessed for the given file size and arity when using the Multiary Wavelet Tree with Generalised RRR.
图 20:每个数据文件的唯一类和块排列的数量。纵轴表示在使用具有广义 RRR 的多元小波树时,对于给定文件大小和元数所观察到的类和块排列的总数。
English  英语
DNA
Sources  来源
Words  词语
Proteins  蛋白质
XML
Figure 21: Sparsity measurements for each data file. The vertical axis represents the percentage of total possible classes and block permutations that were witnessed for the given file size and arity when using the Multiary Wavelet Tree with Generalised RRR. Note that some bars are too small to see.
图 21:每个数据文件的稀疏性测量。纵轴表示在使用带有广义 RRR 的多元小波树时,对于给定文件大小和元数所见的总可能类别和块排列的百分比。请注意,有些条形图太小而无法看到。
Uncompressed Multiary Wavelet Tree query times for English files
未压缩的多元小波树查询时间(针对英文文件)

Figure 22: Query times for Uncompressed Multiary Wavelet Tree of increasing arity for each English file.
图 22:每个英文文件的无压缩多元小波树的查询时间,随着元数的增加。
From Figure 22 we can see how increasing the arity affects querying an uncompressed Wavelet Tree. It is slower for increasing file size because it is calculating rank queries without the assistance of RRR.
从图 22 中我们可以看到,增加元数如何影响对未压缩小波树的查询。由于在没有 RRR 的帮助下计算秩查询,文件大小增加时速度会变慢。
From Figures 23 and 24 we are able to see that increasing the file size does not significantly affect the time performance of the Wavelet Trees which utilise RRR.
从图 23 和图 24 中,我们可以看到,增加文件大小并不会显著影响使用 RRR 的 Wavelet Trees 的时间性能。
The Generalised RRR is slower than Claude’s. We suspect that this is due to our use of pointers, as required to create a sparse table, whereas Claude’s avoids dereferencing and may make better use of cache. The trend is similar to the other Multiary Wavelet Trees, though.
通用 RRR 比 Claude 的慢。我们怀疑这是由于我们使用了指针,这是创建稀疏表所必需的,而 Claude 的避免了解引用,可能更好地利用了缓存。不过,这一趋势与其他多元小波树相似。
The Multi-Binary RRR Wavelet Tree is faster than Claude’s when the arity is increased. It is slower at arity 2 possibly due to optimisations in Claude’s Wavelet Tree code, or because we need to do an extra calculation to work out the binary rank of all previous symbols (see Section 4.3).
多元二进制 RRR 小波树在增加元数时比 Claude 的更快。在元数为 2 时,它的速度较慢,这可能是由于 Claude 的小波树代码中的优化,或者因为我们需要进行额外的计算来确定所有先前符号的二进制秩(见第 4.3 节)。
RRR Wavelet Tree query times for 75MB English file
75MB 英文文件的 RRR 小波树查询时间

Figure 23: Query times for RRR Wavelet Trees of increasing arity for the 75MB English file.
图 23:75MB 英文文件的 RRR 小波树的查询时间,随着分支数的增加。

Figure 24: Query times for RRR Wavelet Trees of increasing arity for the 25MB and 50MB English files.
图 24:25MB 和 50MB 英文文件的 RRR 小波树的查询时间,随着元数的增加。
Memory consumption for 75MB English file
75MB 英文文件的内存消耗

Figure 25: Memory consumption for Wavelet Trees of increasing arity for the 75 MB English file. The memory required for each data structure is the size coefficient multiplied by the original file size. The bar stacked on top is the space for the supporting RRR count table. The bar underneath is the space for the shape of the Wavelet Tree (which had negligible overhead) and each of its nodes, as RRR sequences or not. The Uncompressed Wavelet Tree is the only one which does not have a RRR count table.
图 25:75 MB 英文文件的波形树随着分支数增加的内存消耗。每个数据结构所需的内存是大小系数乘以原始文件大小。顶部的条形是支持 RRR 计数表的空间。底部的条形是波形树的形状(几乎没有开销)及其每个节点的空间,无论是否为 RRR 序列。未压缩的波形树是唯一没有 RRR 计数表的。
Figure 25 shows the memory consumption of the structures for each English file. As the smaller sized files have similar size coefficients, we only show the graphs for the 75 MB files.
图 25 显示了每个英文文件结构的内存消耗。由于较小的文件具有相似的大小系数,我们仅显示 75 MB 文件的图表。
Note that even though the Generalised RRR Wavelet Tree size (which includes the RRR sequences it stores) is smaller than the original text, and smaller than the Multi-Binary RRR Wavelet Tree, the size to contain the supporting RRR count structure is large.
请注意,尽管广义 RRR 小波树的大小(包括它存储的 RRR 序列)小于原始文本,并且小于多重二进制 RRR 小波树,但包含支持 RRR 计数结构的大小是很大的。
The Proteins files have less classes (Figure 21) but more permutations (Figure 20) than the Words files. This may be the cause of the large table size for the Generalised RRR Wavelet Tree in Figure 27 compared with that of the Words file in Figure 26.
蛋白质文件的类别较少(图 21),但排列组合较多(图 20),与单词文件相比。这可能是导致图 27 中广义 RRR 小波树的表格大小较大,而图 26 中的单词文件表格较小的原因。
In all cases, the Generalised RRR Wavelet Tree size is less than the original text size. It is important to note that the RRR count table can be shared among Wavelet Trees, but this may cause it to grow to its full amount, depending on the distribution of classes for the type of text we are indexing, as more classes and permutations may be encountered.
在所有情况下,广义 RRR 小波树的大小都小于原始文本的大小。重要的是要注意,RRR 计数表可以在小波树之间共享,但这可能导致其增长到最大值,具体取决于我们正在索引的文本类型的类别分布,因为可能会遇到更多的类别和排列。
The tradeoff between memory and query time for the English files can be seen in Figure 28. All other test data follows a similar trend, so the graphs have been omitted 11 11 ^(11){ }^{11}, with the exception of Words (Figure 29), Proteins (Figure 30)
英语文件的内存与查询时间之间的权衡可以在图 28 中看到。所有其他测试数据遵循类似的趋势,因此图表已被省略 11 11 ^(11){ }^{11} ,除了单词(图 29)、蛋白质(图 30)。

Memory consumption for 75 MB Words file
75 MB Words 文件的内存消耗

Figure 26: Memory consumption for Wavelet Trees of increasing arity for the 75 MB Words file. The memory required for each data structure is the size coefficient multiplied by the original file size. The bar stacked on top is the space for the supporting RRR count table. The bar underneath is the space for the shape of the Wavelet Tree (which had negligible overhead) and each of its nodes, as RRR sequences or not. The Uncompressed Wavelet Tree is the only one which does not have a RRR count table.
图 26:75 MB Words 文件的波形树随着度数增加的内存消耗。每个数据结构所需的内存是大小系数乘以原始文件大小。顶部堆叠的条形是支持 RRR 计数表的空间。下面的条形是波形树的形状所需的空间(几乎没有开销)以及其每个节点的空间,无论是否为 RRR 序列。未压缩的波形树是唯一没有 RRR 计数表的。

and DNA.  和 DNA。
In Figure 29 The Generalised RRR Wavelet Trees queries slow down for arity 8. This may be due to fragmentation of the RRR count table memory; as the Wavelet Tree becomes less shallow as its arity increases, the benefit of cache may also diminish, since the RRR count table will become larger and possibly dispersed throughout memory. This would not affect the other RRR Wavelet Trees as their RRR count tables are contiguous and small, making the better use of cache.
在图 29 中,广义 RRR 小波树的查询在度数为 8 时变得缓慢。这可能是由于 RRR 计数表内存的碎片化;随着小波树的度数增加,其变得不那么浅,缓存的好处可能也会减小,因为 RRR 计数表将变得更大,并可能分散在内存中。这不会影响其他 RRR 小波树,因为它们的 RRR 计数表是连续且较小的,更好地利用了缓存。
Figure 30 shows the query time increasing slightly for arity 4 , then decreasing again for arity 8 . This may be due to the small alphabet size of the Proteins file, meaning the Wavelet Tree would not get much shallower, and as above the performance may be dominated by cache. This trend was seen in DNA as well, which also has a small alphabet.
图 30 显示了查询时间在 4 元时略有增加,然后在 8 元时再次减少。这可能是由于 Proteins 文件的字母表大小较小,意味着小波树不会变得太浅,并且如上所述,性能可能受到缓存的主导。这一趋势在 DNA 中也有体现,DNA 的字母表也很小。
Memory consumption for 75 MB Proteins file
75 MB 蛋白质文件的内存消耗

Figure 27: Memory consumption for Wavelet Trees of increasing arity for the 75 MB Proteins file. The memory required for each data structure is the size coefficient multiplied by the original file size. The bar stacked on top is the space for the supporting RRR count table. The bar underneath is the space for the shape of the Wavelet Tree (which had negligible overhead) and each of its nodes, as RRR sequences or not. The Uncompressed Wavelet Tree is the only one which does not have a RRR count table.
图 27:75 MB 蛋白质文件的波形树随着分支数增加的内存消耗。每个数据结构所需的内存是大小系数乘以原始文件大小。顶部的条形是支持 RRR 计数表的空间。底部的条形是波形树的形状(几乎没有开销)及其每个节点的空间,无论是否为 RRR 序列。未压缩的波形树是唯一没有 RRR 计数表的。

Figure 28: Memory-Time tradeoff for RRR Wavelet Trees of increasing arity for the 75 MB English file. The memory required for each data structure is the size coefficient multiplied by the original file size.
图 28:75 MB 英文文件的 RRR 小波树随着度数增加的内存-时间权衡。每个数据结构所需的内存是大小系数乘以原始文件大小。
Memory-Time tradeoff for 75MB Words file
75MB Words 文件的内存-时间权衡

Figure 29: Memory-Time tradeoff for RRR Wavelet Trees of increasing arity for the 75 MB Words file. The memory required for each data structure is the size coefficient multiplied by the original file size.
图 29:75 MB Words 文件的 RRR 小波树随着分支数增加的内存-时间权衡。每个数据结构所需的内存是大小系数乘以原始文件大小。

Memory-Time tradeoff for 75MB Proteins file
75MB 蛋白质文件的内存-时间权衡

Figure 30: Memory-Time tradeoff for RRR Wavelet Trees of increasing arity for the 75 MB Proteins file. The memory required for each data structure is the size coefficient multiplied by the original file size.
图 30:75 MB 蛋白质文件的 RRR 小波树的内存-时间权衡,随着树的度数增加。每个数据结构所需的内存是大小系数乘以原始文件大小。

7 Conclusion  7 结论

From our observations we have discovered that there is sparsity in the classes and block permutations encountered in our test data, and it can have a significant effect on the supporting table size, in the case of Generalised RRR. However, making use of this sparsity dynamically requires pointers, and may be the cause of some time overhead and cache misses while querying.
根据我们的观察,我们发现测试数据中存在类和块排列的稀疏性,这在广义 RRR 的情况下可能对支持表的大小产生显著影响。然而,动态利用这种稀疏性需要指针,这可能是查询时造成一些时间开销和缓存未命中的原因。
Since the RRR count table is shared among all nodes and even all Wavelet Trees of the same or smaller arity and blocksize, it may be the case that when documents are significantly large, or we are indexing a large collection of documents, the overhead of the RRR count table becomes negligible. For such documents, a distributed approach may be required; we may store the RRR table on multiple computers, or increase the arity so that the table fits in one central server for querying by the computers that host the Wavelet Trees. It is likely such a configuration may need the extra speed while querying, since it is typical of a search engine to host indexes over many servers and calculate many queries per second.
由于 RRR 计数表在所有节点以及相同或更小的分支数和块大小的所有小波树之间共享,因此当文档非常大或我们正在索引大量文档时,RRR 计数表的开销可能变得微不足道。对于这样的文档,可能需要一种分布式方法;我们可以将 RRR 表存储在多台计算机上,或者增加分支数,以便表可以适应一个中央服务器,以供托管小波树的计算机查询。这样的配置可能需要在查询时额外的速度,因为搜索引擎通常在许多服务器上托管索引,并每秒计算许多查询。
For single documents, or small collections of documents, the RRR count table expands too rapidly to make increasing the arity worthwhile. We discuss a possible method to curb this growth in Section 8.
对于单个文档或小型文档集合,RRR 计数表的扩展速度过快,使得增加元数变得不值得。我们在第 8 节讨论了一种可能的方法来抑制这种增长。
However, we have shown that there are simple ways to implement Multiary Wavelet Trees using rank structures for binary alphabets. In our ‘Multi-Binary RRR’ Wavelet Tree rank queries become faster, while the Wavelet Tree nodes did not grow too large. The Multi-Binary RRR Wavelet Tree was sometimes larger than the original text, however.
然而,我们已经证明了使用秩结构为二进制字母实现多元小波树的简单方法。在我们的“多元二进制 RRR”小波树中,秩查询变得更快,而小波树节点并没有变得过大。然而,多元二进制 RRR 小波树有时比原始文本要大。

8 Future Work  8 未来工作

There are several promising avenues for future work which this thesis has helped reveal:
本论文揭示了未来工作的几个有前景的方向:
  1. Investigate if there is any way to make the count table for Generalized RRR smaller. One promising idea is to only store base counts which can generate all cyclic permutations of a block. For example, if one block is b 1 = [ 0 , 1 , 2 , 2 , 3 ] b 1 = [ 0 , 1 , 2 , 2 , 3 ] b_(1)=[0,1,2,2,3]b_{1}=[0,1,2,2,3] and another block is b 2 = [ 1 , 2 , 2 , 3 , 0 ] b 2 = [ 1 , 2 , 2 , 3 , 0 ] b_(2)=[1,2,2,3,0]b_{2}=[1,2,2,3,0] (that is, a left cyclic shift of one position applied to b 1 b 1 b_(1)b_{1} ), then there may be a way to calculate the counts for b 2 b 2 b_(2)b_{2} from b 1 b 1 b_(1)b_{1}, and no longer store the counts for b 2 b 2 b_(2)b_{2}.
    调查是否有任何方法可以使广义 RRR 的计数表更小。一个有前景的想法是仅存储可以生成一个块的所有循环排列的基本计数。例如,如果一个块是 b 1 = [ 0 , 1 , 2 , 2 , 3 ] b 1 = [ 0 , 1 , 2 , 2 , 3 ] b_(1)=[0,1,2,2,3]b_{1}=[0,1,2,2,3] ,另一个块是 b 2 = [ 1 , 2 , 2 , 3 , 0 ] b 2 = [ 1 , 2 , 2 , 3 , 0 ] b_(2)=[1,2,2,3,0]b_{2}=[1,2,2,3,0] (即对 b 1 b 1 b_(1)b_{1} 应用一个位置的左循环移位),那么可能有办法从 b 1 b 1 b_(1)b_{1} 计算 b 2 b 2 b_(2)b_{2} 的计数,而不再存储 b 2 b 2 b_(2)b_{2} 的计数。
  2. Similar to above, another option is to share count table entries among different blocks which have similar positioning but for different symbols. For example the block b 1 = [ 0 , 1 , 2 , 2 , 3 ] b 1 = [ 0 , 1 , 2 , 2 , 3 ] b_(1)=[0,1,2,2,3]b_{1}=[0,1,2,2,3] has the same count table entry for c = 2 c = 2 c=2c=2 as the block b 2 = [ 0 , 2 , 1 , 1 , 0 ] b 2 = [ 0 , 2 , 1 , 1 , 0 ] b_(2)=[0,2,1,1,0]b_{2}=[0,2,1,1,0] does for c = 1 c = 1 c=1c=1.
    与上述类似,另一个选项是共享具有相似位置但针对不同符号的不同块之间的计数表条目。例如,块 b 1 = [ 0 , 1 , 2 , 2 , 3 ] b 1 = [ 0 , 1 , 2 , 2 , 3 ] b_(1)=[0,1,2,2,3]b_{1}=[0,1,2,2,3] 对于 c = 2 c = 2 c=2c=2 的计数表条目与块 b 2 = [ 0 , 2 , 1 , 1 , 0 ] b 2 = [ 0 , 2 , 1 , 1 , 0 ] b_(2)=[0,2,1,1,0]b_{2}=[0,2,1,1,0] 对于 c = 1 c = 1 c=1c=1 的计数表条目相同。
  3. Reduce the use of pointers when constructing a sparse RRR table. This may be done by completing a full scan of the text before constructing the table, and tracking how many unique blocks there are, then allocating them contiguously.
    在构建稀疏 RRR 表时减少指针的使用。这可以通过在构建表之前对文本进行完整扫描来完成,并跟踪有多少个唯一块,然后将它们连续分配。
  4. Search for a good tradeoff between arity and block size for the Generalised RRR; the Generalised RRR Table will grow smaller if a smaller block size is used, but the sequences then become bigger as they have more class
    寻找一般化 RRR 中元数和块大小之间的良好折衷;如果使用较小的块大小,则一般化 RRR 表将变小,但序列会变大,因为它们具有更多的类别

    and offset values (which require less bits, but perhaps not significantly so). This may not be preferable since the table can be shared among many sequences. The analysis of Ferragina et al. [8] suggests blocksize and arity should be related, in particular blocksize b = 1 2 log A N b = 1 2 log A N b=|__(1)/(2)log_(A)N __|b=\left\lfloor\frac{1}{2} \log _{A} N\right\rfloor, that is, b b bb decreases slightly as arity A A AA increases.
    和偏移值(需要更少的位数,但可能并不显著)。这可能不是首选,因为该表可以在多个序列之间共享。Ferragina 等人的分析 [8] 表明块大小和元数应该相关,特别是块大小 b = 1 2 log A N b = 1 2 log A N b=|__(1)/(2)log_(A)N __|b=\left\lfloor\frac{1}{2} \log _{A} N\right\rfloor ,即 b b bb 随着元数 A A AA 的增加而略微减少。
  5. Implement and investigate a Multiary Huffman-Shaped Wavelet Tree (see Mäkinen’s work for details on Huffman-Shaped Wavelet Trees [15]). This may overcome the RRR count table memory consumption while still reducing the tree depth.
    实现并研究多元哈夫曼形状小波树(有关哈夫曼形状小波树的详细信息,请参见 Mäkinen 的工作[15])。这可能会克服 RRR 计数表的内存消耗,同时仍然减少树的深度。
  6. Real world query patterns may show that there are certain blocks which are queried more frequently, so we may only need to keep the RRR count table entries for these blocks in memory at all times. The other RRR count table sections may be stored on disk and loaded into memory on the occasion that they are queried.
    现实世界的查询模式可能显示出某些区块被查询的频率更高,因此我们可能只需要始终将这些区块的 RRR 计数表条目保留在内存中。其他 RRR 计数表部分可以存储在磁盘上,并在被查询时加载到内存中。
  7. Distribute the RRR count table among nodes in a cluster, allowing it to be held in memory as restricted by the cluster as a whole, not a single computer. Then, increasing arity would be an issue of how many nodes are on the cluster. There has been recent research on distributed compressed suffix arrays by Russo et al. [26].
    在集群中的节点之间分配 RRR 计数表,允许它在整个集群的限制下保留在内存中,而不是单个计算机。然后,增加的元数将是集群中节点数量的问题。Russo 等人最近对分布式压缩后缀数组进行了研究[26]。
  8. Investigate Multiary Wavelet Trees which use the bitmap concatenation technique, but encode each node using other binary sequence structures which answer rank queries, such as the rank index by Okanohara and Sadakane [23].
    研究使用位图连接技术的多元小波树,但使用其他二进制序列结构对每个节点进行编码,这些结构可以回答秩查询,例如 Okanohara 和 Sadakane 提出的秩索引[23]。

9 Acknowledgements  9 致谢

I thank my supervisor Simon Puglisi for his patience and guidance, Juha Kärkkäinen (University of Helsinki) for helping me to understand RRR, and Francisco Claude (University of Waterloo) for his advice and explanation of his code. I also thank my brothers Nikolas and James Bowe for their programming advice and motivation. Finally, I thank RMIT University for providing me with a scholarship to complete this paper.
我感谢我的导师西蒙·普格利西(Simon Puglisi)对我的耐心和指导,尤哈·卡尔凯宁(Juha Kärkkäinen)(赫尔辛基大学)帮助我理解 RRR,弗朗西斯科·克劳德(Francisco Claude)(滑铁卢大学)对他的建议和代码解释。我还感谢我的兄弟尼古拉斯和詹姆斯·博威(Nikolas and James Bowe)对我的编程建议和激励。最后,我感谢 RMIT 大学为我提供奖学金以完成这篇论文。

References  参考文献

[1] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53-86, 2004.
[1] M. I. Abouelhoda, S. Kurtz, 和 E. Ohlebusch. 用增强后缀数组替代后缀树. 离散算法杂志, 2(1):53-86, 2004.

[2] J. Barbay, F. Claude, and G. Navarro. Compact rich-functional binary relation representations. In A. López-Ortiz, editor, LATIN 2010: Theoretical Informatics, volume 6034 of Lecture Notes in Computer Science, pages 170-183. Springer, 2010.
[2] J. Barbay, F. Claude, 和 G. Navarro. 紧凑的丰富功能二元关系表示. 在 A. López-Ortiz, 编辑, LATIN 2010: 理论信息学, 计算机科学讲义系列第 6034 卷, 第 170-183 页. Springer, 2010.

[3] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 1994.
[3] M. Burrows 和 D. J. Wheeler. 一种块排序无损数据压缩算法。技术报告 124,数字设备公司,加利福尼亚州帕洛阿尔托,1994。

[4] F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5280, pages 176-187. Springer, 2008.
[4] F. Claude 和 G. Navarro. 任意序列上的实用秩/选择查询. 在第 15 届国际字符串处理与信息检索研讨会(SPIRE)论文集中, LNCS 5280, 第 176-187 页. Springer, 2008.

[5] J. S. Culpepper, G. Navarro, S. J. Puglisi, and A. Turpin. Top-k ranked document search in general text databases. In M. de Berg and U. Meyer, editors, Proceedings of the 18th European Symposium on Algorithms (ESA 2010), to appear 2010.
[5] J. S. Culpepper, G. Navarro, S. J. Puglisi, 和 A. Turpin. 在一般文本数据库中的前 k 个排名文档搜索. 在 M. de Berg 和 U. Meyer, 编辑, 第 18 届欧洲算法研讨会论文集 (ESA 2010), 即将出版 2010.

[6] P. Ferragina, R. Giancarlo, and G. Manzini. The myriad virtues of wavelet trees. Information and Computation, 207(8):849-866, 2009.
[6] P. Ferragina, R. Giancarlo, 和 G. Manzini. 小波树的众多优点. 信息与计算, 207(8):849-866, 2009.

[7] P. Ferragina and G. Manzini. Opportunistic data structures with applications. Proceedings of the 41 st Annual IEEE Symposium on Foundations of Computer Science, pages 390-398, 2000.
[7] P. Ferragina 和 G. Manzini. 具有应用的机会数据结构. 第 41 届年度 IEEE 计算机科学基础研讨会论文集, 第 390-398 页, 2000 年.

[8] P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):20, 2007.
[8] P. Ferragina, G. Manzini, V. Mäkinen, 和 G. Navarro. 序列的压缩表示和全文索引. ACM 算法交易, 3(2):20, 2007.

[9] P. Flicek and E. Birney. Sense from sequence reads: methods for alignment and assembly. Nature Methods, 6(11s):S6-S12, October 2009.
[9] P. Flicek 和 E. Birney. 从序列读取中获取信息:对齐和组装的方法。自然方法,6(11s):S6-S12,2009 年 10 月。

[10] A. Golynski, J. I. Munro, and S. S. Rao. Rank/select operations on large alphabets: a tool for text indexing. In SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 368-373, New York, NY, USA, 2006. ACM.
[10] A. Golynski, J. I. Munro, 和 S. S. Rao. 大字母表上的排名/选择操作:文本索引的工具。在 SODA '06:第十七届年度 ACM-SIAM 离散算法研讨会论文集,页 368-373,纽约,NY,美国,2006。ACM。

[11] R. González, S. Grabowski, V. Mäkinen, and G. Navarro. Practical implementation of rank and select queries. In Poster Proceedings Volume of 4 th Workshop on Efficient and Experimental Algorithms (WEA), pages 27-38, Greece, 2005. CTI Press and Ellinika Grammata.
[11] R. González, S. Grabowski, V. Mäkinen, 和 G. Navarro. 排名和选择查询的实际实现. 在第四届高效与实验算法研讨会(WEA)海报论文集,页码 27-38,希腊,2005. CTI Press 和 Ellinika Grammata.

[12] R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14 th annual ACM-SIAM symposium on Discrete algorithms, pages 841-850. Society for Industrial and Applied Mathematics, 2003.
[12] R. Grossi, A. Gupta, 和 J. Vitter. 高阶熵压缩文本索引. 在第 14 届年度 ACM-SIAM 离散算法研讨会论文集中,页码 841-850. 工业与应用数学学会,2003 年。

[13] W.-K. Hon, R. Shah, and J. Vitter. Space-efficient framework for topk string retrieval problems. In Foundations of Computer Science, 2009. FOCS '09. 50th Annual IEEE Symposium on, pages 713 -722, October 2009.
[13] W.-K. Hon, R. Shah, 和 J. Vitter. 用于 topk 字符串检索问题的空间高效框架. 在计算机科学基础, 2009. FOCS '09. 第 50 届年度 IEEE 研讨会, 页码 713 -722, 2009 年 10 月.

[14] D. Knuth, J. Morris Jr, and V. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323-350, 1977.
[14] D. Knuth, J. Morris Jr, 和 V. Pratt. 字符串中的快速模式匹配. SIAM 计算杂志, 6(2):323-350, 1977.

[15] V. Mäkinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40-66, 2005.
[15] V. Mäkinen 和 G. Navarro. 基于游程编码的简洁后缀数组。北欧计算杂志, 12(1):40-66, 2005.

[16] V. Mäkinen and G. Navarro. Implicit compression boosting with applications to self-indexing. In Proceedings of the 14 th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 4726, pages 214-226. Springer, 2007.
[16] V. Mäkinen 和 G. Navarro. 隐式压缩增强及其在自我索引中的应用. 在第 14 届国际字符串处理与信息检索研讨会(SPIRE)论文集中, LNCS 4726, 第 214-226 页. Springer, 2007.

[17] V. Mäkinen and G. Navarro. Rank and select revisited and extended. Theoretical Computer Science, 387(3):332-347, 2007.
[17] V. Mäkinen 和 G. Navarro. 排名和选择的重新审视与扩展. 理论计算机科学, 387(3):332-347, 2007.

[18] U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993.
[18] U. Manber 和 G. Myers. 后缀数组:一种在线字符串搜索的新方法。SIAM 计算杂志,22(5):935-948,1993。

[19] M. Marín and G. Navarro. Distributed query processing using suffix arrays. In String Processing and Information Retrieval, volume 2857 of Lecture Notes in Computer Science, pages 311-325. Springer, 2003.
[19] M. Marín 和 G. Navarro. 使用后缀数组的分布式查询处理。在《字符串处理与信息检索》中,计算机科学讲义笔记第 2857 卷,页 311-325。施普林格,2003 年。

[20] E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262-272, 1976.
[20] E. M. McCreight. 一种节省空间的后缀树构建算法. ACM 期刊, 23(2):262-272, 1976.

[21] J. Munro. Tables. In Foundations of Software Technology and Theoretical Computer Science, pages 37-42. Springer, 1996.
[21] J. Munro. 表格. 在《软件技术基础与理论计算机科学》中,第 37-42 页。施普林格,1996 年。

[22] G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):article 2, 2007.
[22] G. Navarro 和 V. Mäkinen. 压缩全文索引. ACM Computing Surveys, 39(1):文章 2, 2007.

[23] D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. Arxiv Computing Research Repository, abs/cs/0610001, 2006.
[23] D. Okanohara 和 K. Sadakane. 实用的熵压缩秩/选择字典。Arxiv 计算研究库,abs/cs/0610001,2006。

[24] S. J. Puglisi, W. F. Smyth, and A. Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys, 39(2):1-31, 2007.
[24] S. J. Puglisi, W. F. Smyth, 和 A. Turpin. 后缀数组构造算法的分类. ACM Computing Surveys, 39(2):1-31, 2007.

[25] R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. A C M A C M ACMA C M Transactions on Algorithms, 3(4), 2007.
[25] R. Raman, V. Raman, 和 S. R. Satti. 具有编码 k-元树、前缀和和多重集应用的简洁可索引字典。 A C M A C M ACMA C M 算法学报, 3(4), 2007.

[26] L. Russo, G. Navarro, and A. Oliveira. Parallel and distributed compressed indexes. In A. Amir and L. Parida, editors, Combinatorial Pattern Matching, volume 6129 of Lecture Notes in Computer Science, pages 348-360. Springer, 2010.
[26] L. Russo, G. Navarro, 和 A. Oliveira. 并行和分布式压缩索引. 在 A. Amir 和 L. Parida, 编辑, 组合模式匹配, 计算机科学讲义笔记第 6129 卷, 第 348-360 页. Springer, 2010.

[27] M. Wattenberg. Arc diagrams: Visualizing structure in strings. In Proceedings of the IEEE Symposium on Information Visualization, pages 110-116, Washington, DC, USA, 2002. IEEE Computer Society.
[27] M. Wattenberg. 弧图:在字符串中可视化结构。载于《IEEE 信息可视化研讨会论文集》,第 110-116 页,美国华盛顿特区,2002 年。IEEE 计算机学会。

[28] P. Weiner. Linear pattern matching algorithms. In SWAT '73: Proceedings of the 14 th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1-11, Washington, DC, USA, 1973. IEEE Computer Society.
[28] P. Weiner. 线性模式匹配算法。在 SWAT '73:第 14 届开关与自动机理论年会论文集(swat 1973),第 1-11 页,美国华盛顿特区,1973 年。IEEE 计算机学会。

[29] C.-C. Yu, W.-K. Hon, and B.-F. Wang. Efficient data structures for the orthogonal range successor problem. In Computing and Combinatorics, volume 5609 of Lecture Notes in Computer Science, pages 96-105. Springer, 2009.
[29] C.-C. Yu, W.-K. Hon, 和 B.-F. Wang. 正交范围后继问题的高效数据结构. 在《计算与组合》,计算机科学讲义第 5609 卷,页 96-105. 施普林格, 2009.

  1. 1 1 ^(1){ }^{1} Binary rank queries are also known as popcounts in other literature.
    1 1 ^(1){ }^{1} 二进制秩查询在其他文献中也被称为人口计数。
  2. 2 2 ^(2){ }^{2} Note that we are indexing C C CC by a symbol P [ i ] P [ i ] P[i]P[i], so this may be implemented with a suitable hash function.
    2 2 ^(2){ }^{2} 请注意,我们通过符号 P [ i ] P [ i ] P[i]P[i] C C CC 进行索引,因此这可以通过合适的哈希函数来实现。
  3. 3 3 ^(3){ }^{3} We omit the first superblock, since the first offset address is easy to find, and the sum of ranks before the first superblock is always 0 .
    3 3 ^(3){ }^{3} 我们省略第一个超级块,因为第一个偏移地址很容易找到,并且第一个超级块之前的等级总和始终为 0。
  4. 4 4 ^(4){ }^{4} In our implementation we pre-calculate the first binary rank queries for each symbol and store it with the node.
    4 4 ^(4){ }^{4} 在我们的实现中,我们预先计算每个符号的第一个二进制秩查询,并将其与节点一起存储。
  5. 5 5 ^(5){ }^{5} All source code is available at http://github.com/alexbowe/multiary-wt.
    5 5 ^(5){ }^{5} 所有源代码可在 http://github.com/alexbowe/multiary-wt 获取。

    6 6 ^(6){ }^{6} All raw and processed data is available at http://github.com/alexbowe/multiary-wt, and the graphs are available at http://github.com/alexbowe/wavelet-paper/tree/thesis/.
    6 6 ^(6){ }^{6} 所有原始和处理过的数据可在 http://github.com/alexbowe/multiary-wt 获取,图表可在 http://github.com/alexbowe/wavelet-paper/tree/thesis/ 获取。

    7 7 ^(7){ }^{7} Claude’s compressed data structures library is available at http://libcds.recoded.cl
    7 7 ^(7){ }^{7} Claude 的压缩数据结构库可在 http://libcds.recoded.cl 获取

    8 8 ^(8){ }^{8} The BWT files for each dataset are available at
    8 8 ^(8){ }^{8} 每个数据集的 BWT 文件可在此处获取

    http://bwt-corpus.s3.amazonaws.com/list.html, and are reconstructible using the scripts at http://github.com/alexbowe/bwt-corpus.
    http://bwt-corpus.s3.amazonaws.com/list.html,并且可以使用 http://github.com/alexbowe/bwt-corpus 上的脚本进行重建。
  6. 9 http : / / 9 http : / / ^(9)http:////{ }^{9} \mathrm{http}: / / pizzachili.dcc.uchile.cl
    10 10 ^(10){ }^{10} http://www.gutenberg.org
  7. 11 11 ^(11){ }^{11} The other graphs are available at http://github.com/alexbowe/honours-thesis/.
    11 11 ^(11){ }^{11} 其他图表可在 http://github.com/alexbowe/honours-thesis/ 获取。