这是用户在 2025-1-9 20:53 为 https://app.immersivetranslate.com/pdf-pro/9acd99c9-a3e7-4488-a489-c9801a72403f 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Multiary Wavelet Trees in Practice
多元小波树的实践

Alex Bowe  亚历克斯·博威alex.bowe@rmit.edu.au

Honours Thesis  荣誉论文
Supervisor: Simon Puglisi
主管:西蒙·普格利西

School of Computer Science and Information Technology
计算机科学与信息技术学院
RMIT University  RMIT 大学Melbourne, AUSTRALIA  墨尔本,澳大利亚

November 2010  2010 年 11 月

Abstract  摘要

Self-index data structures for pattern matching based on Suffix Arrays and the Burrows-Wheeler Transform have recently grown popular. The fundamental operation in these self-indexes is the rank query: rank ( i , c ) rank ( i , c ) rank(i,c)\operatorname{rank}(i, c) requests the number of occurrences of symbol c c cc before position i i ii in a string. The wavelet tree is currently the data structure of choice for implementing rank queries. Current wavelet tree implementations encode the Burrows-Wheeler Transform as a hierarchy of binary strings, which they then store as RRR sequences; the RRR structure offers O ( 1 ) O ( 1 ) O(1)O(1) rank queries and zeroth-order entropy compression for binary strings. A generalisation of the RRR tecnique extends wavelet trees to have higher order encoding, that is, increased branching, in theory making traversal faster. To support the implementation of such a multiary wavelet tree, this thesis investigates the generalisation of RRR to sequences over small alphabets. We also analyse the use of concatenated bitmaps to represent sequences over small alphabets, thus allowing continued use of the binary RRR structure. Our results show that multiary wavelet trees are faster than their binary counterparts, but require large amounts of memory in the case of the generalised RRR. We also show binary RRR on concatenated bitmaps to be an effective, practical alternative.
基于后缀数组和 Burrows-Wheeler 变换的自索引数据结构在模式匹配中最近变得流行。这些自索引中的基本操作是排名查询: rank ( i , c ) rank ( i , c ) rank(i,c)\operatorname{rank}(i, c) 请求在字符串中位置 i i ii 之前符号 c c cc 的出现次数。目前,波形树是实现排名查询的首选数据结构。目前的波形树实现将 Burrows-Wheeler 变换编码为二进制字符串的层次结构,然后将其存储为 RRR 序列;RRR 结构为二进制字符串提供 O ( 1 ) O ( 1 ) O(1)O(1) 排名查询和零阶熵压缩。RRR 技术的一个推广扩展了波形树,使其具有更高阶的编码,即增加分支,理论上使遍历更快。为了支持这种多元波形树的实现,本论文研究了 RRR 对小字母表序列的推广。我们还分析了使用连接位图表示小字母表序列,从而允许继续使用二进制 RRR 结构。 我们的结果表明,多元小波树比二元小波树更快,但在广义 RRR 的情况下需要大量内存。我们还展示了在连接位图上使用二元 RRR 是一种有效的、实用的替代方案。

Keywords: data structures, algorithms, string search, compression, indexes
关键词:数据结构,算法,字符串搜索,压缩,索引

Contents  内容

1 Introduction … 3  1 引言 … 3
1.1 Our Contribution … 4
1.1 我们的贡献 … 4

1.2 Roadmap … 4
1.2 路线图 … 4

2 Preliminaries … 6
2 前言 … 6

2.1 Background … 6
2.1 背景 … 6

2.2 Suffix Arrays … 6
2.2 后缀数组 … 6

2.3 Burrows-Wheeler Transform … 7
2.3 Burrows-Wheeler 变换 … 7

2.4 Rank Query … 7
2.4 排名查询 … 7

2.5 Backward Search … 8
2.5 向后搜索 … 8

3 Binary Wavelet Trees and RRR … 11
3 二进制小波树和 RRR … 11

3.1 Binary Wavelet Trees … 11
3.1 二进制小波树 … 11

3.2 RRR … 13
4 Multiary Wavelet Trees and Generalised RRR … 16
4 多元小波树和广义 RRR … 16

4.1 Multiary Wavelet Trees … 16
4.1 多元小波树 … 16

4.2 Variation 1 : Uncompressed … 16
4.2 变体 1 : 未压缩 … 16

4.3 Variation 2: Multi-Binary RRR … 17
4.3 变体 2:多二进制 RRR … 17

4.4 Generalised RRR … 17
4.4 一般化 RRR … 17

5 Experiments … 19
5 实验 … 19

5.1 Hypotheses … 19
5.1 假设 … 19

5.2 Method … 19
5.2 方法 … 19

5.3 Test Data … 20
5.3 测试数据 … 20

6 Results … 21
6 结果 … 21

7 Conclusion … 31
7 结论 … 31

8 Future Work … 31
8 未来工作 … 31

9 Acknowledgements … 33
9 致谢 … 33

1 Introduction  1 引言

As collections of text grow larger, our need to find information in them and infer patterns and rankings increases, too. Suffix arrays, first described by Manber and Myers [18], allow a variety of complex pattern matching and pattern discovery problems to be performed in optimal time. There are many areas where suffix arrays are likely the most appropriate data structure for the task, including:
随着文本集合的增大,我们在其中查找信息和推断模式及排名的需求也在增加。后缀数组,首次由 Manber 和 Myers 描述[18],允许以最佳时间执行各种复杂的模式匹配和模式发现问题。在许多领域,后缀数组可能是执行该任务的最合适数据结构,包括:
  • Genome analysis [1, 9];
    基因组分析 [1, 9];
  • Searching for patterns in non-word data such as images, multimedia signals and DNA [5];
    在非文本数据中寻找模式,例如图像、多媒体信号和 DNA [5];
  • Searching for patterns in oriental languages; as some oriental languages do not have spaces between certain particles, an inverted file would be insufficient;
    在东方语言中寻找模式;由于某些东方语言在某些成分之间没有空格,反向文件将不够用;
  • Pattern discovery and visualisation using arc diagrams, as proposed by Wattenberg [27].
    使用弧形图进行模式发现和可视化,如 Wattenberg [27]所提议的。

Figure 1: Overview of the key data structures and their relationship with each other. When a Suffix Array and a Burrows-Wheeler Transform (BWT) [3] are combined, they form an FM-Index [7]. FM-Indexes use Backward Searches on the BWT to provide fast pattern matching, counting, and substring extraction operations. Backward Search requires rank operations, which are best implemented using a Wavelet Tree (WT) [12]. A WT encodes a string as a hierarchy of bit vectors, which it uses to answer general rank queries using log σ log σ log sigma\log \sigma binary rank queries. Binary rank queries can be answered in O ( 1 ) O ( 1 ) O(1)O(1) time when the bit vector is stored as a RRR sequence [25], which utilise a global table of pre-calculated ranks. The RRR data structure also offers compression of the Wavelet Tree. (we defer a detailed discussion of RRR to Section 3.2)
图 1:关键数据结构及其相互关系的概述。当后缀数组和 Burrows-Wheeler 变换(BWT)[3]结合时,它们形成 FM-索引[7]。FM-索引在 BWT 上使用反向搜索,以提供快速的模式匹配、计数和子字符串提取操作。反向搜索需要秩操作,最佳实现方式是使用波形树(WT)[12]。WT 将字符串编码为位向量的层次结构,利用这些位向量回答一般的秩查询,使用 log σ log σ log sigma\log \sigma 二进制秩查询。当位向量存储为 RRR 序列[25]时,二进制秩查询可以在 O ( 1 ) O ( 1 ) O(1)O(1) 时间内回答,这利用了预计算秩的全局表。RRR 数据结构还提供了波形树的压缩。(我们将在 3.2 节中详细讨论 RRR)
Due to its performance in these important applications, the improvement of suffix arrays has been the focus of intensive research over the past 20 years. Self-indexing is one of these improvements. Self indexes support fast pattern counting (reporting the number of occurrences of a pattern in a text), fast pattern matching (reporting the positions of each pattern occurrence) and extraction of arbitrary substrings of the original text, including the original text
由于在这些重要应用中的表现,后缀数组的改进在过去 20 年中一直是密集研究的重点。自我索引是这些改进之一。自我索引支持快速模式计数(报告模式在文本中出现的次数)、快速模式匹配(报告每个模式出现的位置)以及提取原始文本的任意子字符串,包括原始文本。

itself. These operations in some sense allow us to replace the original text with a self index.
本身。这些操作在某种意义上允许我们用自我索引替换原始文本。
One such self index is known as the ‘FM-Index’, proposed by Ferragina and Manzini [7]. An FM-Index utilises a sparse Suffix Array and the BurrowsWheeler Transform (BWT) [3] (the two leftmost diagrams in Figure 1) of the original text, and supports fast pattern matching operations by using the socalled ‘backward search’ method.
一种这样的自索引被称为“FM-Index”,由 Ferragina 和 Manzini 提出。FM-Index 利用稀疏后缀数组和原始文本的 Burrows-Wheeler 变换(BWT),并通过使用所谓的“向后搜索”方法支持快速模式匹配操作。
Backward searching with a BWT requires a fundamental operation called a ‘rank query’, which counts the number of occurrences of a given symbol up to a given position in the string.
使用 BWT 进行反向搜索需要一个基本操作,称为“排名查询”,它计算在字符串中给定位置之前给定符号的出现次数。
While a naive implementation of the rank query operation might inspect every symbol of the BWT, and do so in O ( N ) O ( N ) O(N)O(N) time where N N NN is the length of the BWT, it is possible to improve this to O ( log σ ) O ( log σ ) O(log sigma)O(\log \sigma) time where σ σ sigma\sigma is the size of the alphabet. To do this, we construct a Wavelet Tree [12] over the BWT (the third diagram in Figure 1). A Wavelet Tree is constructed by encoding the string as a hierarchy of bit vectors. These bit vectors are then used to answer rank queries by traversing and performing binary rank queries 1 1 ^(1){ }^{1}. A more detailed description of Wavelet Trees appears in Section 3.
虽然对秩查询操作的简单实现可能会检查 BWT 的每个符号,并且在 O ( N ) O ( N ) O(N)O(N) 时间内完成,其中 N N NN 是 BWT 的长度,但可以将其改进为 O ( log σ ) O ( log σ ) O(log sigma)O(\log \sigma) 时间,其中 σ σ sigma\sigma 是字母表的大小。为此,我们在 BWT 上构建一个 Wavelet Tree [12](图 1 中的第三个图)。Wavelet Tree 是通过将字符串编码为位向量的层次结构来构建的。这些位向量随后用于通过遍历和执行二进制秩查询 1 1 ^(1){ }^{1} 来回答秩查询。Wavelet Trees 的更详细描述出现在第 3 节。
Binary rank queries can be performed in O ( 1 ) O ( 1 ) O(1)O(1) time when the bit vectors are stored in a RRR sequence, as proposed by Raman, Raman and Rao [25]. RRR also offers compression of the Wavelet Tree.
当位向量存储在 RRR 序列中时,二进制秩查询可以在 O ( 1 ) O ( 1 ) O(1)O(1) 时间内执行,正如 Raman、Raman 和 Rao [25]所提出的。RRR 还提供了小波树的压缩。
The key motivation behind this project arises from the increasing number of papers, such as Ferragina et al. 2007 [8], Yu et al. 2009 [29], and Barbay et al. 2010 [2], which utilise Multiary Wavelet Trees as a theoretical tool, that is, Wavelet Trees with a branching factor greater than two. However no known implementations of Multiary Wavelet Trees exist. This thesis aims to address this need, and bring theory closer to practice. It was expected that increasing arity would improve the time performance of self-indexes which use a BWT.
该项目的主要动机源于越来越多的论文,例如 Ferragina et al. 2007 [8]、Yu et al. 2009 [29] 和 Barbay et al. 2010 [2],它们将多元小波树作为一种理论工具,即分支因子大于二的小波树。然而,目前尚不存在多元小波树的已知实现。本论文旨在满足这一需求,使理论更接近实践。预计增加元数将改善使用 BWT 的自索引的时间性能。

1.1 Our Contribution  1.1 我们的贡献

The contribution of this paper is the experimental analysis of two types of Multiary Wavelet Tree. The first of these, to our knowledge, is the first implementation of the Generalised RRR structure, which is used to support rank queries on small alphabets, as first suggested by Ferragina et al. [8].
本文的贡献是对两种多元小波树的实验分析。其中第一个,据我们所知,是对广义 RRR 结构的首次实现,该结构用于支持对小字母表的秩查询,正如 Ferragina 等人[8]首次提出的。
We also propose a new data structure, a Multiary Wavelet Tree that utilises Binary RRR using concatenated bitmaps to represent sequences on small alphabets as bit vectors.
我们还提出了一种新的数据结构,多元小波树,它利用二进制 RRR 通过连接位图来表示小字母表上的序列为位向量。
We discover that although Multiary Wavelet Trees are faster, Generalised RRR requires significant memory for the supporting global table in its current form, which, depending on the size of the text, may make these Multiary Wavelet Trees impractical. We show that our new data structure, which continues to use Binary RRR, is a practical way to make rank queries faster. Additionally, we observe that the BWT gives rise to a sparse class list as the arity increases.
我们发现,尽管多元小波树更快,但通用 RRR 在其当前形式下需要大量内存来支持全局表,这可能会使这些多元小波树在文本大小的不同情况下变得不切实际。我们展示了我们新的数据结构,它继续使用二进制 RRR,是一种使秩查询更快的实用方法。此外,我们观察到,随着元数的增加,BWT 产生了一个稀疏的类列表。

1.2 Roadmap  1.2 路线图

The rest of the thesis is organised as follows. Section 2 begins with some notation, followed by background on the problem domain. We then discuss how
论文的其余部分组织如下。第二节首先介绍一些符号,然后是问题领域的背景。接下来我们讨论如何
each data structure is built and used together.
每个数据结构都是一起构建和使用的。

Section 3 provides a description of Binary Wavelet Trees and their use of the RRR structure, followed by a discussion in Section 4 of how our Multiary Wavelet Trees are designed, including some implementation details of interest.
第 3 节提供了二进制小波树及其使用 RRR 结构的描述,接着在第 4 节讨论了我们的多元小波树是如何设计的,包括一些有趣的实现细节。
In Section 5 we describe how we measured the time and space performance for each Multiary Wavelet Tree variation, and provide a rationale of our test dataset. Our results are presented in Section 6, accompanied by a discussion of the apparent trends.
在第 5 节中,我们描述了如何测量每种多元小波树变体的时间和空间性能,并提供了我们的测试数据集的理由。我们的结果在第 6 节中呈现,并伴随对明显趋势的讨论。
The conclusion follows in Section 7. We discuss the practicality of each Multiary Wavelet Tree variation, and in Section 8 we consider how their performance might be improved.
结论在第 7 节中给出。我们讨论了每种多元小波树变体的实用性,在第 8 节中我们考虑如何提高它们的性能。

2 Preliminaries  2 前言

Figure 2: Array representation of ‘mississippi’ string.
图 2:‘mississippi’字符串的数组表示。

Throughout this paper we represent the string we are searching in as S S SS of length | S | = N | S | = N |S|=N|S|=N, and the pattern we are searching for as P . S [ i ] P . S [ i ] P.S[i]P . S[i] represents the symbol located at position i i ii in S S SS, and S [ i . . j ] S [ i . . j ] S[i..j]S[i . . j] represents the substring of S S SS beginning at position i i ii and ending at j j jj inclusive. Strings are one-based, so in Figure 2 S [ 1 ] = 2 S [ 1 ] = 2S[1]=2 S[1]= ’ m ', S [ 3 ] = S [ 3 ] = S[3]=S[3]= ’ s ', and S [ 1 . .3 ] = S [ 1 . .3 ] = S[1..3]=S[1 . .3]= ‘mis’.
在本文中,我们将要搜索的字符串表示为 S S SS ,长度为 | S | = N | S | = N |S|=N|S|=N ,我们要搜索的模式表示为 P . S [ i ] P . S [ i ] P.S[i]P . S[i] ,表示位于 S S SS 中位置 i i ii 的符号,而 S [ i . . j ] S [ i . . j ] S[i..j]S[i . . j] 表示从位置 i i ii 开始并在 j j jj 结束(包括)的 S S SS 的子字符串。字符串是基于 1 的,因此在图 2 S [ 1 ] = 2 S [ 1 ] = 2S[1]=2 S[1]= 中,'m'、's' 和 'mis'。
The i th i th  i^("th ")i^{\text {th }} suffix is thus defined as S [ i . . N ] S [ i . . N ] S[i..N]S[i . . N], so the 1 st 1 st  1^("st ")1^{\text {st }} suffix in Figure 2 is S [ 1 . .12 ] = S [ 1 . .12 ] = S[1..12]=S[1 . .12]= 'mississipi $ $ $\$ ', and the 5 th 5 th  5^("th ")5^{\text {th }} suffix is S [ 5 . .12 ] = S [ 5 . .12 ] = S[5..12]=S[5 . .12]= ‘issippi$’. The i th i th  i^("th ")i^{\text {th }} prefix is defined as S [ 1 . . i ] S [ 1 . . i ] S[1..i]S[1 . . i], so the 5 th 5 th  5^("th ")5^{\text {th }} prefix in Figure 2 is S [ 1 . .5 ] = S [ 1 . .5 ] = S[1..5]=S[1 . .5]= ‘missi’.
因此, i th i th  i^("th ")i^{\text {th }} 后缀被定义为 S [ i . . N ] S [ i . . N ] S[i..N]S[i . . N] ,所以图 2 中的 1 st 1 st  1^("st ")1^{\text {st }} 后缀是 S [ 1 . .12 ] = S [ 1 . .12 ] = S[1..12]=S[1 . .12]= 'mississipi $ $ $\$ ',而 5 th 5 th  5^("th ")5^{\text {th }} 后缀是 S [ 5 . .12 ] = S [ 5 . .12 ] = S[5..12]=S[5 . .12]= 'issippi$'。 i th i th  i^("th ")i^{\text {th }} 前缀被定义为 S [ 1 . . i ] S [ 1 . . i ] S[1..i]S[1 . . i] ,所以图 2 中的 5 th 5 th  5^("th ")5^{\text {th }} 前缀是 S [ 1 . .5 ] = S [ 1 . .5 ] = S[1..5]=S[1 . .5]= 'missi'。
The log log log\log operation is base 2 unless otherwise stated.
除非另有说明, log log log\log 操作是以 2 为基数。

2.1 Background  2.1 背景

In 1970 Knuth, Morris and Pratt (KMP) discovered an algorithm to match patterns in time proportional to the length of the text [14, 20]. If the text is large, then KMP becomes ineffective for ranking and pattern discovery. Moreover, KMP is only useful for exact matches.
在 1970 年,Knuth、Morris 和 Pratt(KMP)发现了一种算法,可以在与文本长度成比例的时间内匹配模式[14, 20]。如果文本很大,那么 KMP 在排名和模式发现方面变得无效。此外,KMP 仅对精确匹配有用。
One alternative to KMP for document ranking is the use of an inverted index, but they must work with keywords and are thus inappropriate for many applications, such as searches on certain oriental languages, and other strings that do not have a clear definition of keywords (MIDI, for example). Suffix arrays are also more efficient than inverted files for searching phrases or partial patterns [ 13 , 19 ] [ 13 , 19 ] [13,19][13,19]. This was originally possible with the suffix tree [20, 28] (a precursor to the suffix array), although suffix trees require three to five times as much space [18].
KMP 的一个替代方案是使用倒排索引,但它们必须与关键词一起工作,因此不适合许多应用,例如某些东方语言的搜索,以及其他没有明确关键词定义的字符串(例如 MIDI)。后缀数组在搜索短语或部分模式时也比倒排文件更高效 [ 13 , 19 ] [ 13 , 19 ] [13,19][13,19] 。这最初是通过后缀树[20, 28](后缀数组的前身)实现的,尽管后缀树需要三到五倍的空间[18]。

2.2 Suffix Arrays  2.2 后缀数组

In its simplest form, a suffix array can be constructed for a string S [ 1 . . N ] S [ 1 . . N ] S[1..N]S[1 . . N] like so:
在最简单的形式中,可以为字符串 S [ 1 . . N ] S [ 1 . . N ] S[1..N]S[1 . . N] 构建后缀数组,如下所示:
  1. Construct an array of pointers to all suffixes S [ 1 . . N ] , S [ 2 . . N ] , , S [ N . . N ] S [ 1 . . N ] , S [ 2 . . N ] , , S [ N . . N ] S[1..N],S[2..N],dots,S[N..N]S[1 . . N], S[2 . . N], \ldots, S[N . . N].
    构造一个指向所有后缀的指针数组 S [ 1 . . N ] , S [ 2 . . N ] , , S [ N . . N ] S [ 1 . . N ] , S [ 2 . . N ] , , S [ N . . N ] S[1..N],S[2..N],dots,S[N..N]S[1 . . N], S[2 . . N], \ldots, S[N . . N]
  2. Sort these pointers by the lexicographical (i.e. alphabetical) ordering of their associated suffixes.
    按其相关后缀的字典顺序(即字母顺序)对这些指针进行排序。
Figure 2 shows an example string, ‘mississippi’. The construction of the corresponding suffix array is shown in Figure 3. Construction of suffix arrays is now a well studied problem [24].
图 2 显示了一个示例字符串,‘mississippi’。相应后缀数组的构造如图 3 所示。后缀数组的构造现在是一个研究得很透彻的问题[24]。

Figure 3: Construction of Suffix Array for ‘mississippi’.
图 3:为“mississippi”构建后缀数组。
BWT SA
i 1 2 1 2 12\mathbf{1 2} $
p 1 1 1 1 11\mathbf{1 1} i$
s 8 ippi$
s 5 5 5\mathbf{5} issippi$  密西西比$
m 2 2 2\mathbf{2} ississippi$  密西西比$
$ 1 mississippi$  密西西比$
p 10 pi$
i 9 ppi$
s 7 sippi$
s 4 sissippi$
i 6 ssippi$
i 3 ssissippi$
BWT SA i 12 $ p 11 i$ s 8 ippi$ s 5 issippi$ m 2 ississippi$ $ 1 mississippi$ p 10 pi$ i 9 ppi$ s 7 sippi$ s 4 sissippi$ i 6 ssippi$ i 3 ssissippi$| BWT | SA | | | :---: | :---: | :--- | | i | $\mathbf{1 2}$ | $ | | p | $\mathbf{1 1}$ | i$ | | s | 8 | ippi$ | | s | $\mathbf{5}$ | issippi$ | | m | $\mathbf{2}$ | ississippi$ | | $ | 1 | mississippi$ | | p | 10 | pi$ | | i | 9 | ppi$ | | s | 7 | sippi$ | | s | 4 | sissippi$ | | i | 6 | ssippi$ | | i | 3 | ssissippi$ |
Figure 4: Suffix Array and Burrows-Wheeler Transform for ‘mississippi’ string.
图 4:‘mississippi’字符串的后缀数组和 Burrows-Wheeler 变换。

2.3 Burrows-Wheeler Transform
2.3 巴罗斯-惠勒变换

The Burrows-Wheeler Transform (BWT) is a string of length N N NN defined by the suffix array S A S A SAS A and the original text S S SS. In particular, B W T [ i ] = S [ S A [ i ] 1 ] B W T [ i ] = S [ S A [ i ] 1 ] BWT[i]=S[SA[i]-1]B W T[i]=S[S A[i]-1], and B W T [ 1 ] = B W T [ 1 ] = BWT[1]=B W T[1]= $ $ $\$ ', that is, the i t h i t h i^(th)i^{t h} symbol of the BWT is the symbol prior to the i t h i t h i^(th)i^{t h} suffix in the Suffix Array S A S A SAS A. See Figure 4.
Burrows-Wheeler 变换(BWT)是由后缀数组 S A S A SAS A 和原始文本 S S SS 定义的长度为 N N NN 的字符串。特别地, B W T [ i ] = S [ S A [ i ] 1 ] B W T [ i ] = S [ S A [ i ] 1 ] BWT[i]=S[SA[i]-1]B W T[i]=S[S A[i]-1] ,以及 B W T [ 1 ] = B W T [ 1 ] = BWT[1]=B W T[1]= $ $ $\$ ',也就是说,BWT 的 i t h i t h i^(th)i^{t h} 符号是后缀数组 S A S A SAS A i t h i t h i^(th)i^{t h} 后缀之前的符号。见图 4。
As proposed by Ferragina and Manzini, when a BWT is stored alongside a Suffix Array, it is known as a FM-Index [7], which supports backwards search.
根据 Ferragina 和 Manzini 的提议,当 BWT 与后缀数组一起存储时,它被称为 FM-Index [7],支持向后搜索。

2.4 Rank Query  2.4 排名查询

Munro [21] describes how to do rank queries on binary strings in O ( 1 ) O ( 1 ) O(1)O(1) time using o ( n ) o ( n ) o(n)o(n) bits of extra space. Much of the early work on rank queries focussed on binary strings.
Munro [21] 描述了如何在 O ( 1 ) O ( 1 ) O(1)O(1) 时间内使用 o ( n ) o ( n ) o(n)o(n) 位额外空间对二进制字符串进行排名查询。早期关于排名查询的许多工作集中在二进制字符串上。
A rank query on the string S S SS is defined as rank S ( i , c ) = n rank S ( i , c ) = n rank_(S)(i,c)=n\operatorname{rank}_{S}(i, c)=n, with n n nn being the number of times symbol c c cc appears in the range S [ 1 , i ] S [ 1 , i ] S[1,i]S[1, i]. This paper omits the subscript when the string we are querying is clear from the context. For example in Figure 5 , rank ( 9 , s ) = 3 5 , rank ( 9 , s ) = 3 5,rank(9,s)=35, \operatorname{rank}(9, s)=3. If i 0 i 0 i <= 0i \leq 0 then rank ( i , c ) = 0 rank ( i , c ) = 0 rank(i,c)=0\operatorname{rank}(i, c)=0.
对字符串 S S SS 的排名查询定义为 rank S ( i , c ) = n rank S ( i , c ) = n rank_(S)(i,c)=n\operatorname{rank}_{S}(i, c)=n ,其中 n n nn 是符号 c c cc 在范围 S [ 1 , i ] S [ 1 , i ] S[1,i]S[1, i] 中出现的次数。本文在查询的字符串从上下文中清晰可见时省略下标。例如在图 5 , rank ( 9 , s ) = 3 5 , rank ( 9 , s ) = 3 5,rank(9,s)=35, \operatorname{rank}(9, s)=3 中。如果 i 0 i 0 i <= 0i \leq 0 rank ( i , c ) = 0 rank ( i , c ) = 0 rank(i,c)=0\operatorname{rank}(i, c)=0

Figure 5: Rank query rank ( 9 , s ) = 3 rank ( 9 , s ) = 3 rank(9,s)=3\operatorname{rank}(9, s)=3 on the Burrows-Wheeler Transform of ‘mississippi’.
图 5:在“mississippi”的 Burrows-Wheeler 变换上进行的排名查询 rank ( 9 , s ) = 3 rank ( 9 , s ) = 3 rank(9,s)=3\operatorname{rank}(9, s)=3
Since any pattern P P PP in S S SS is a prefix of a suffix, and because the suffixes are lexicographically ordered, all occurrences of a search pattern P P PP lie in a contiguous portion of the Suffix Array. In earlier implementations, the range that this pattern lies on would be located by using successive binary searches. Backward search utilises the BWT in a series of paired rank queries, improving the query performance considerably [ 4 , 6 , 8 , 10 , 16 , 17 , 19 , 22 ] [ 4 , 6 , 8 , 10 , 16 , 17 , 19 , 22 ] [4,6,8,10,16,17,19,22][4,6,8,10,16,17,19,22]
由于 S S SS 中的任何模式 P P PP 都是后缀的前缀,并且后缀是按字典顺序排列的,因此搜索模式 P P PP 的所有出现都位于后缀数组的一个连续部分。在早期的实现中,这个模式所在的范围是通过使用连续的二分搜索来定位的。反向搜索利用 BWT 进行一系列配对排名查询,显著提高了查询性能 [ 4 , 6 , 8 , 10 , 16 , 17 , 19 , 22 ] [ 4 , 6 , 8 , 10 , 16 , 17 , 19 , 22 ] [4,6,8,10,16,17,19,22][4,6,8,10,16,17,19,22]
Backward search issues | P | | P | |P||P| pairs of rank queries, where | P | | P | |P||P| denotes the length of the pattern. The paired rank queries are:
反向搜索问题 | P | | P | |P||P| 对排名查询,其中 | P | | P | |P||P| 表示模式的长度。配对的排名查询是:
s = C [ P [ i ] ] + rank ( s 1 , P [ i ] ) + 1 e = C [ P [ i ] ] + rank ( e , P [ i ] ) s = C [ P [ i ] ] + rank ( s 1 , P [ i ] ) + 1 e = C [ P [ i ] ] + rank ( e , P [ i ] ) {:[s^(')=C[P[i]]+rank(s-1","P[i])+1],[e^(')=C[P[i]]+rank(e","P[i])]:}\begin{gathered} s^{\prime}=C[P[i]]+\operatorname{rank}(s-1, P[i])+1 \\ e^{\prime}=C[P[i]]+\operatorname{rank}(e, P[i]) \end{gathered}
Where s s ss denotes the start of the range, initially at s = 1 s = 1 s=1s=1, and e e ee is the end of the range, initially e = N e = N e=Ne=N.
其中 s s ss 表示范围的起始,最初为 s = 1 s = 1 s=1s=1 ,而 e e ee 是范围的结束,最初为 e = N e = N e=Ne=N
In Figure 7 there is a column F F FF, which contains the first symbol for each suffix. Note that the F F FF column is not stored as we store it encoded as C C CC instead.
在图 7 中,有一列 F F FF ,其中包含每个后缀的第一个符号。请注意, F F FF 列没有存储,因为我们将其编码为 C C CC 来存储。

C C CC is an array 2 2 ^(2){ }^{2} containing the count of all symbols in Σ Σ Sigma\Sigma which sort lexicographically before P [ i ] P [ i ] P[i]P[i], where Σ Σ Sigma\Sigma is the alphabet from our original string S S SS, as in Figure 6.
C C CC 是一个数组 2 2 ^(2){ }^{2} ,包含在 Σ Σ Sigma\Sigma 中按字典顺序排在 P [ i ] P [ i ] P[i]P[i] 之前的所有符号的计数,其中 Σ Σ Sigma\Sigma 是我们原始字符串 S S SS 中的字母表,如图 6 所示。
In the first iteration we query the final character of the pattern, so i = | P | i = | P | i=|P|i=|P|. For each iteration, we decrement i i ii until i = 1 i = 1 i=1i=1. This maintains the invariant that S A [ s . . e ] S A [ s . . e ] SA[s..e]S A[s . . e] contains all the suffixes of which P [ i . . | P | ] P [ i . . | P | ] P[i..|P|]P[i . .|P|] is a prefix, and hence all locations of P [ i . . | P | ] P [ i . . | P | ] P[i..|P|]P[i . .|P|] in S S SS. This is illustrated in Figure 8 through to Figure 10. If at any stage e < s e < s e < se<s, then the pattern does not exist in our original string.
在第一次迭代中,我们查询模式的最后一个字符,因此 i = | P | i = | P | i=|P|i=|P| 。对于每次迭代,我们递减 i i ii ,直到 i = 1 i = 1 i=1i=1 。这保持了不变式,即 S A [ s . . e ] S A [ s . . e ] SA[s..e]S A[s . . e] 包含所有以 P [ i . . | P | ] P [ i . . | P | ] P[i..|P|]P[i . .|P|] 为前缀的后缀,因此包含 S S SS P [ i . . | P | ] P [ i . . | P | ] P[i..|P|]P[i . .|P|] 的所有位置。这在图 8 到图 10 中进行了说明。如果在任何阶段 e < s e < s e < se<s ,那么模式在我们的原始字符串中不存在。
An example is given in Figure 7 through to Figure 10, where the pattern ’ i s s i s s issi s s ’ is searched for in the string ‘mississippi’, starting with i = 3 , P [ 3 ] = i = 3 , P [ 3 ] = i=3,P[3]=i=3, P[3]= s s ss '. The working for each rank query is shown below each figure. We represent the current symbol as c c cc to avoid confusion between ’ s ’ and s s ss and s s s^(')s^{\prime}.
图 7 到图 10 中给出了一个示例,其中在字符串‘mississippi’中搜索模式’ i s s i s s issi s s ’,从 i = 3 , P [ 3 ] = i = 3 , P [ 3 ] = i=3,P[3]=i=3, P[3]= s s ss '开始。每个等级查询的工作在每个图形下方显示。我们将当前符号表示为 c c cc ,以避免在’s’和 s s ss s s s^(')s^{\prime} 之间的混淆。

Figure 6: Table C C CC of number of occurrences in F F FF of each symbol which sorts alphabetically before the displayed symbol.
图 6:表 C C CC 中每个符号在 F F FF 中出现的次数,按字母顺序排列在显示符号之前。
F BWT SA
1 - $ i 12 $ s = 1 s = 1 quad s=1\quad s=1
2 i p 11 i$ e = 12
3 i S 8 ippi$
4 i s 5 issippi$  密西西比$
5 i m 2 ississippi$  密西西比$
6 m $ 1 mississippi$  密西西比$
7 p p 10 pi$
8 p i 9 ppi$
9 s s 7 sippi$
10 S s 4 sissippi$
11 S i 6 ssippi$
12 - S i 3 ssissippi$
F BWT SA 1 - $ i 12 $ quad s=1 2 i p 11 i$ e = 12 3 i S 8 ippi$ 4 i s 5 issippi$ 5 i m 2 ississippi$ 6 m $ 1 mississippi$ 7 p p 10 pi$ 8 p i 9 ppi$ 9 s s 7 sippi$ 10 S s 4 sissippi$ 11 S i 6 ssippi$ 12 - S i 3 ssissippi$| | | F | BWT | SA | | | :---: | :---: | :---: | :---: | :---: | :---: | | 1 | - | $ | i | 12 | $ $\quad s=1$ | | 2 | | i | p | 11 | i$ e = 12 | | 3 | | i | S | 8 | ippi$ | | 4 | | i | s | 5 | issippi$ | | 5 | | i | m | 2 | ississippi$ | | 6 | | m | $ | 1 | mississippi$ | | 7 | | p | p | 10 | pi$ | | 8 | | p | i | 9 | ppi$ | | 9 | | s | s | 7 | sippi$ | | 10 | | S | s | 4 | sissippi$ | | 11 | | S | i | 6 | ssippi$ | | 12 | - | S | i | 3 | ssissippi$ |
Figure 7: First stage of backwards search for ‘iss’ on ‘mississippi’ string - before any rank queries have been made.
图 7:在“mississippi”字符串上对“iss”进行反向搜索的第一阶段 - 在进行任何排名查询之前。
  1. Starting from s = 1 s = 1 s=1s=1 and e = 12 e = 12 e=12e=12 as in Figure 7, and c = P [ i ] = c = P [ i ] = c=P[i]=c=P[i]= ‘s’ where i = 3 i = 3 i=3i=3, we make our first two rank queries:
    s = 1 s = 1 s=1s=1 e = 12 e = 12 e=12e=12 开始,如图 7 所示,以及 c = P [ i ] = c = P [ i ] = c=P[i]=c=P[i]= 的‘s’,在 i = 3 i = 3 i=3i=3 处,我们进行第一次两个排名查询:
s = C [ c ] + rank ( 0 , c ) + 1 = 8 + 0 + 1 = 9 e = C [ c ] + rank ( 12 , c ) = 8 + 4 = 12 s = C [ c ] + rank ( 0 , c ) + 1 = 8 + 0 + 1 = 9 e = C [ c ] + rank ( 12 , c ) = 8 + 4 = 12 {:[s^(')=C[c]+rank(0","c)+1=8+0+1=9],[e^(')=C[c]+rank(12","c)=8+4=12]:}\begin{gathered} s^{\prime}=C[c]+\operatorname{rank}(0, c)+1=8+0+1=9 \\ e^{\prime}=C[c]+\operatorname{rank}(12, c)=8+4=12 \end{gathered}