Abstract 摘要

Motivation : With the development of high-throughput sequencing, the number of assembled genomes continues to rise. It is critical to well organize and index many assembled genomes to promote future genomics studies. Burrows–Wheeler Transform (BWT) is an important data structure of genome indexing, which has many fundamental applications; however, it is still non-trivial to construct BWT for large collection of genomes, especially for highly similar or repetitive genomes. Moreover, the state-of-the-art approaches cannot well support scalable parallel computing owing to their incremental nature, which is a bottleneck to use modern computers to accelerate BWT construction.
动机:随着高通量测序技术的发展,已组装基因组的数量不断增加。对众多已组装的基因组进行良好的组织和索引对促进未来的基因组学研究至关重要。Burrows-Wheeler Transform(BWT)是一种重要的基因组索引数据结构,它有许多基本应用;然而,为大量基因组,尤其是高度相似或重复的基因组构建 BWT 仍然不是一件容易的事。此外,最先进的方法由于其增量性质,不能很好地支持可扩展的并行计算,这成为利用现代计算机加速 BWT 构建的瓶颈。

Results : We propose de Bruijn branch-based BWT constructor (deBWT), a novel parallel BWT construction approach. DeBWT innovatively represents and organizes the suffixes of input sequence with a novel data structure, de Bruijn branch encoding. This data structure takes the advantage of de Bruijn graph to facilitate the comparison between the suffixes with long common prefix, which breaks the bottleneck of the BWT construction of repetitive genomic sequences. Meanwhile, deBWT also uses the structure of de Bruijn graph for reducing unnecessary comparisons between suffixes. The benchmarking suggests that, deBWT is efficient and scalable to construct BWT for large dataset by parallel computing. It is well-suited to index many genomes, such as a collection of individual human genomes, with multiple-core servers or clusters.
成果:我们提出了基于 de Bruijn 分支的 BWT 构造器(deBWT),这是一种新颖的并行 BWT 构造方法。DeBWT 创新性地使用一种新型数据结构 de Bruijn 分支编码来表示和组织输入序列的后缀。这种数据结构利用 de Bruijn 图的优势,便于对具有长公共前缀的后缀进行比较,从而打破了重复基因组序列 BWT 构建的瓶颈。同时,deBWT 还利用 de Bruijn 图的结构减少了后缀间不必要的比较。基准测试表明,deBWT 可以通过并行计算高效、可扩展地构建大型数据集的 BWT。它非常适合使用多核服务器或集群为许多基因组(如人类单个基因组集合)建立索引。

Availability and implementation : deBWT is implemented in C language, the source code is available at https://github.com/hitbc/deBWT or https://github.com/DixianZhu/deBWT
可用性和实现:deBWT 用 C 语言实现,源代码可从 https://github.com/hitbc/deBWT 或 https://github.com/DixianZhu/deBWT 获取。

Contact: ydwang@hit.edu.cn
联系人: ydwang@hit.edu.cn

Supplementary information:  Supplementary data are available at Bioinformatics online.
补充信息:补充数据可在 Bioinformatics online 上获取。

1 Introduction 1 引言

With the rapid development and ubiquitous application of high-throughput sequencing, many genomes have been sequenced in cutting-edge genomics studies. For example, 1000 Genomes ( The 1000 Genomes Project Consortium, 2015 ) and UK10K ( The UK10K Consortium, 2015 ) projects have sequenced many thousands of individual human genomes. Moreover, as the cost of sequencing continuously decreases, e.g. the cost of sequencing a human sample has already been lower than 1000 dollars ( Watson, 2014 ), the number of genomes may explosively increase in the future. Under this circumstance, it is fundamental to well organize and index the large amount of genomes to facilitate future genomics studies.
随着高通量测序技术的快速发展和广泛应用,许多基因组已在前沿基因组学研究中完成测序。例如,"1000 基因组"(The 1000 Genomes Project Consortium, 2015)和 "UK10K"(The UK10K Consortium, 2015)项目已经测序了数千个人类基因组。此外,随着测序成本的不断降低,例如人类样本的测序成本已经低于 1000 美元(Watson,2014 年),基因组的数量在未来可能会呈爆炸式增长。在这种情况下,对大量基因组进行良好的组织和索引是促进未来基因组学研究的基础。

Burrows–Wheeler Transform (BWT; Burrows and Wheeler, 1994 ; Ferragina and Manzini, 2000 ) is a self-indexing data structure having many fundamental applications, such as genome indexing ( Hon et al. , 2004 ; Karkkainen, 2007 ), sequence alignment ( Lam et al. , 2008 ; Li and Durbin, 2009a ; Langmead and Salzberg, 2012 ), genome compression ( Makinen et al. , 2010 ; Cox et al. , 2012 ), genome assembly ( Simpson and Durbin, 2012 ; Li, 2012 ) and sequencing error correction ( Cox et al. , 2011 ). However, the BWT construction of genomic sequence(s) is a non-trivial task. Mainly, the core of BWT construction is to determine the lexicographical order of all the suffixes of the input sequence(s). Because there could be many repetitive sequences within a genome ( Treangen and Salzberg, 2012 ), the cost would be prohibitively high to straightforwardly compare all the suffixes to determine their lexicographical orders. The problem is even more serious for constructing the BWT of many highly similar genomes, such as a large collection of individual human genomes, as there would be many common sequences to make the whole input even more repetitive.
Burrows-Wheeler Transform(BWT;Burrows 和 Wheeler,1994 年;Ferragina 和 Manzini,2000 年)是一种自索引数据结构,有许多基本应用,如基因组索引(Hon 等人,2004 年;Karkkainen,2007 年)、序列比对(Lam 等人,2008 年;Li 和 Durbin,2009a 年;Langmead 和 Salzberg,2012 年)、基因组压缩(Makinen 等人,2010 年;Cox 等人,2012 年)、基因组组装(Makinen et al.,2008;Li 和 Durbin,2009a;Langmead 和 Salzberg,2012)、基因组压缩(Makinen 等人,2010;Cox 等人,2012)、基因组组装(Simpson 和 Durbin,2012;Li,2012)和测序纠错(Cox 等人,2011)。然而,基因组序列的 BWT 构建并非易事。主要而言,BWT 构建的核心是确定输入序列所有后缀的词序。由于基因组中可能有许多重复序列(Treangen 和 Salzberg,2012 年),直接比较所有后缀以确定其词法顺序的成本过高。对于构建许多高度相似基因组的 BWT 来说,这个问题甚至更加严重,比如大量的人类个体基因组,因为会有许多共同的序列使整个输入更加重复。

Efforts have been made to BWT construction. As the BWT of input sequence(s) can be directly derived from the corresponding suffix array (SA), many extant SA construction methods ( Smyth and Turpin, 2007 ) are applicable to this task. However, the memory footprint may be not practical for large genomic sequences, e.g. sequence over tens of Giga basepairs (Gbp), as proposed SA construction methods usually need to store the entire SA in memory. Although there are also proposed SA construction methods having smaller memory footprints ( Crauser and Ferragina, 2008 ; Nong et al. , 2014 ), however, they are at the expense of speed, as they need to use external memory.
人们一直在努力构建 BWT。由于输入序列的 BWT 可以直接从相应的后缀数组(SA)中导出,许多现存的后缀数组构建方法(Smyth 和 Turpin,2007 年)都适用于这项任务。然而,对于大型基因组序列,例如超过数十亿基对(Gbp)的序列,内存占用可能并不实用,因为拟议的后缀阵列构建方法通常需要在内存中存储整个后缀阵列。虽然也有人提出了内存占用更小的 SA 构建方法(Crauser 和 Ferragina,2008 年;Nong 等人,2014 年),但这些方法需要使用外部内存,因此牺牲了速度。

Most of the state-of-the-art BWT construction methods take the advantage of the incremental construction approach, which is on the basis of the property that the relative lexicographical order of a set of sorted suffixes will not be changed by adding new suffixes. ( Hon et al. , 2007 ) proposed the first compressed suffix array (CSA) construction method using this property. Mainly, it logarithmically partitions the input sequence into blocks, and incrementally builds the CSA from shortest to longest suffixes in three steps: (i) construct the SA for a new block of suffixes; (ii) sequentially insert each of the suffixes within the new block into the CSA of the old blocks, based on the property that the relative order of the suffixes within the old blocks do not change, and the CSA values monotonically increase for the suffixes having the same initial character; (iii) merge the new and old blocks to update the CSA. There are other proposed BWT construction methods in this incremental blockwise approach, which have various implementations. ( Ferragina et al. , 2012 ) proposed a BWT construction method, bwt-disk, similar to ( Hon et al. , 2007 ), which also logarithmically partitions the input sequence. But it sorts each new block with a modified DC3 algorithm, and takes advantage of the last-first mapping (LF-mapping) property of BWT ( Ferragina and Manzini, 2005 ) to merge new and old blocks. ( Liu et al. , 2014b ) also proposed a BWT construction method, ParaBWT, in a similar manner. It uses a longest common prefix table to facilitate the sorting of newly added suffixes, but also merges the new and old blocks based on LF-mapping. The main contribution of this method is that it implements parallelization for the sorting of newly added blocks, which is beneficial for processing large input sequences. Other than constructing BWT for one or more large sequences, this approach is also used for indexing large collections of sequencing reads. Bauer et al. (2013) proposed BCR, an algorithm for constructing the BWT of a large set of reads. It uses a specific partition of SA, i.e. partitioning all the suffixes into blocks by their positions on the corresponding sequencing reads. With this partition, the markers (denoted as specific characters) of the reads can be fully used for improving the efficiency of the sorting and merging of blocks. ( Li, 2014 ) also proposed a similar method, RopeBWT2, with improved ability of handling the sequences in various lengths.
大多数最先进的 BWT 构建方法都采用了增量构建方法,该方法基于这样一个特性,即一组已排序后缀的相对词序不会因为添加新后缀而改变。(Hon 等人,2007 年)首次提出了利用这一特性的压缩后缀数组(CSA)构建方法。该方法主要是将输入序列按对数分割成块,然后分三步从最短后缀到最长后缀逐步构建 CSA:(i) 构建新后缀块的 CSA;(ii) 根据旧后缀块中后缀的相对顺序不变,且具有相同初始字符的后缀的 CSA 值单调递增的特性,将新后缀块中的每个后缀依次插入旧后缀块的 CSA 中;(iii) 合并新旧后缀块,更新 CSA。在这种增量式顺时针方法中,还有其他一些拟议的 BWT 构建方法,它们有不同的实现方式。(Ferragina 等人,2012)提出了一种 BWT 构建方法 bwt-disk,与(Hon 等人,2007)类似,也是对输入序列进行对数分割。但它采用改进的 DC3 算法对每个新块进行排序,并利用 BWT 的后先映射(LF-mapping)特性(Ferragina 和 Manzini,2005 年)合并新旧块。(Liu等人,2014b)也以类似的方式提出了一种BWT构建方法--ParaBWT。它使用一个最长公共前缀表来方便对新添加的后缀进行排序,同时也基于 LF 映射来合并新旧词块。 这种方法的主要贡献在于,它实现了新添加区块排序的并行化,有利于处理大型输入序列。除了为一个或多个大型序列构建 BWT 之外,这种方法还可用于为大型测序读取集合建立索引。Bauer 等人(2013 年)提出了 BCR 算法,用于构建大型读取集合的 BWT。它使用了一种特定的 SA 分区,即按照后缀在相应测序读数上的位置将所有后缀划分成块。通过这种划分,可以充分利用读数的标记(表示为特定字符)来提高块的排序和合并效率。(Li,2014)也提出了一种类似的方法 RopeBWT2,并改进了处理不同长度序列的能力。

Besides the blockwise incremental approach, there are also other approaches proposed. ( Karkkainen, 2007 ) proposed a method that constructs BWT in a different blockwise manner. It samples a set of suffixes as splitters to bin all the suffixes into various blocks, and for each of the blocks, the proposed method addresses all the suffixes with difference cover sample (DCS). ( Liu et al. , 2014a ) proposed a graphics processing unit (GPU-based) BWT construction method, CX1, for indexing a large set of short reads. The main idea of the method is to bin all the suffixes by their initial k-mers, and address all the bins with GPU-based radix sort. This method is mainly designed for reads with limited length, as the radix sort relies on the auxiliary characters attached to the reads.
除了顺时针递增法,还有其他一些方法。(Karkkainen, 2007)提出了一种以不同的顺时针方式构建 BWT 的方法。它将一组后缀作为分割器进行采样,将所有后缀分成不同的区块,对于每个区块,提出的方法用差分覆盖采样(DCS)处理所有后缀。( Liu 等人,2014a) 提出了一种基于图形处理器(GPU)的 BWT 构建方法 CX1,用于为大量短读建立索引。该方法的主要思路是将所有后缀按其初始 k-mers 进行分选,并使用基于 GPU 的 radix 排序对所有分选进行处理。这种方法主要针对长度有限的读取数据,因为弧度排序依赖于读取数据所附的辅助字符。

The major advantage of the incremental blockwise approach is that, based on the LF-mapping property, it provides an effective way to compare suffixes with long common prefix, which is critical to the BWT construction of large repetitive genomes. However, owing to the incremental nature, this approach is not suitable for parallel computing. Considering the rapid increase of assembled genomes, the input sequence will be much larger than before. In this situation, processing the large sequence with parallel computing is favorable, especially that modern servers and clusters have much more CPU cores and RAM than before. Recent studies have made the efforts to BWT construction with parallel computing. ParaBWT implemented the parallel BWT construction for large sequence; however, the results demonstrated that it is not scalable, i.e. the speedup will saturate when a couple of threads (e.g. eight threads) are used. This is mainly owing to the bottleneck of the incremental blockwise approach. As a non-incremental approach, CX1 is scalable; however, it has limitation on the length of input sequences.
增量式顺时针方法的主要优点是,基于 LF 映射特性,它提供了一种有效的方法来比较具有长公共前缀的后缀,这对于大型重复基因组的 BWT 构建至关重要。然而,由于其增量性质,这种方法并不适合并行计算。考虑到已组装基因组的快速增长,输入序列将比以前大得多。在这种情况下,利用并行计算处理大序列是有利的,尤其是现代服务器和集群的 CPU 内核和内存都比以前多得多。最近的一些研究为利用并行计算构建 BWT 做出了努力。ParaBWT 实现了大序列的并行 BWT 构建,但结果表明它不具有可扩展性,即当使用几个线程(如 8 个线程)时,速度提升将达到饱和。这主要是由于增量顺时针方法的瓶颈造成的。作为一种非递增方法,CX1 是可扩展的,但它对输入序列的长度有限制。

Herein, we propose de Bruijn branch-based BWT constructor (deBWT), a novel scalable parallel BWT construction method, which draws support from de Bruijn graph (dBG). The relationship between dBG and suffix trie was explored in previous studies ( Marcus et al. , 2014 ); however, it has still not been fully used in BWT construction. The main contribution of deBWT is to represent and organize the suffixes of input sequence with the property of dBG that all the copies of the same k-mer within the input sequence(s) collapse to the same vertex of the dBG of the sequence(s). The critical point of deBWT is to represent the suffixes with a novel data structure, de Bruijn branch encoding, which is derived from the unipaths of the dBG of the input sequence. This data structure facilitates the comparison between the suffixes with long common prefix. Moreover, deBWT partitions the whole BWT into blocks by their initial k-mers, and uses the property of dBG to avoid unnecessary sorting for some of the blocks, i.e. the BWT characters of some blocks can be derived in constant time based on the topology of the graph.
在此,我们提出了基于 de Bruijn 分支的 BWT 构建器(deBWT),这是一种新颖的可扩展并行 BWT 构建方法,它从 de Bruijn 图(dBG)中获得支持。之前的研究(Marcus 等人,2014 年)探讨了 dBG 与后缀三元组之间的关系;然而,它在 BWT 构建中仍未得到充分应用。deBWT 的主要贡献在于利用 dBG 的特性来表示和组织输入序列的后缀,即输入序列中相同 k-mer 的所有副本都塌缩到序列 dBG 的相同顶点上。deBWT 的关键点在于用一种新颖的数据结构--de Bruijn 分支编码--来表示后缀,这种数据结构来自输入序列 dBG 的单路径。这种数据结构便于比较具有长共同前缀的后缀。此外,deBWT 还能将整个 BWT 按其初始 k-mers 分割成块,并利用 dBG 的特性避免对某些块进行不必要的排序,即某些块的 BWT 字符可根据图的拓扑结构在恒定时间内导出。

We benchmarked deBWT with various datasets, and the result suggests that it has fast speed and good scalability to multiple threads. Especially, deBWT is well-suited to the BWT construction of a collection of highly similar genomic sequences, such as multiple human genomes, which may have wide application in the future genomics studies.
我们用各种数据集对 deBWT 进行了基准测试,结果表明它具有快速的速度和良好的多线程可扩展性。特别是,deBWT 非常适合构建高度相似的基因组序列集合(如多个人类基因组)的 BWT,这可能会在未来的基因组学研究中得到广泛应用。

2 Methods 2 种方法

2.1 Preliminary 2.1 初步情况

Let a DNA sequence, G , be a sequence over the alphabet Σ=A,C,G,T having G characters. Further, we assume the sequence to be indexed is S=G$ , where $ is an auxiliary character, and the lexicographical order of the alphabet of S is A<C<G<T<$ . Moreover, Si, i=0,,G denotes the i-th character of S , and S[i,j] denotes the substring of S starting at Si and ending at Sj .
假设 DNA 序列 G 是字母表 Σ=A,C,G,T 上的一个序列,具有 G 个字符。此外,我们假设要索引的序列是 S=G$ ,其中 $ 是一个辅助字符,而 S 字母表的词序是 A<C<G<T<$ 。此外, Si, i=0,,G 表示 S 的第 i 个字符, S[i,j] 表示 SSi 开始到 Sj 结束的子串。

A suffix of S is a substring S[i,G] , i=0,,G , and the SA of S is a function that SAi=j , i=0,,G , where j is the starting position of the i-th lexicographically smallest suffix of S . The BWT of S , BS , is the permutation of the characters of S that, BSi=S[SAi-1] , if SAi0 , and BSi=$ , otherwise.
S 的后缀是子串 S[i,G] , i=0,,G , 而 S 的 SA 是函数 SAi=j , i=0,,G , 其中 jS 的第 i 个词性最小后缀的起始位置。 S , BS , 的 BWT 是 S 字符的置换,如果 SAi0 , 则为 BSi=S[SAi-1] , 否则为 BSi=$

The dBG of G , DG , is a directed graph, where the vertices consist of all the k-mers of G . Each of the vertices is denoted as KMi , i=1,,DG , where DG is the total number of distinct k-mers. For any pair of vertices of DG , KMi, KMj , there is a directed edge KMiKMj , only if KMi and KMj have a k-1 overlapping, i.e. KMi1,k-1=KMj[0,k-2] . With this definition, a set of maximal non-branched paths can be derived from DG . Here, a maximal non-branch path indicates a path that meets the following conditions: (i) for the first vertex, the in-degree is 0 or >1, and the out-degree is 1; (ii) for the last vertex, the out-degree is 0 or >1, and the in-degree 1; (iii) for all the other internal vertices, the in- and out-degrees are exactly 1 ( Tomescu and Medvedev, 2016 ). Such paths are usually termed as ‘unipaths’ or ‘unitigs’, which are commonly used in the de novo assembly of genome ( Gnerre et al. , 2011 ; Zimin et al. , 2013 ; Tomescu and Medvedev, 2016 ). We used the term ‘unipath’ in the following sections. For the convenience of discussion, we assign each of the unipaths of DG an identity, Uj, j=1,,UG , where UG is the total number of the unipaths.
G , DG , 的 dBG 是一个有向图,其中顶点由 G 的所有 k-mers 组成。每个顶点都表示为 KMi , i=1,,DG ,其中 DG 是不同 k 分子的总数。对于 DG , KMi, KMj 的任意一对顶点,只有当 KMiKMj 有 k-1 重合,即 KMi1,k-1=KMj[0,k-2] 时,才有有向边 KMiKMj 。根据这一定义,可以从 DG 推导出一组最大非分支路径。 这里,最大非分支路径表示满足以下条件的路径:(i) 对于第一个顶点,内度为 0 或 >1,外度为 1;(ii) 对于最后一个顶点,外度为 0 或 >1,内度为 1;(iii) 对于所有其他内部顶点,内度和外度正好为 1 ( Tomescu 和 Medvedev,2016)。这种路径通常被称为 "unipaths "或 "unitigs",常用于基因组的从头组装(Gnerre 等人,2011;Zimin 等人,2013;Tomescu 和 Medvedev,2016)。我们在下文中使用了 "unipath "一词。为了讨论方便,我们给 DG 的每一条单路径分配了一个身份,即 Uj, j=1,,UG ,其中 UG 是单路径的总数。

In the following subsections, we present the unipath-based BWT construction approach at first (Section 2.2), then the overview of deBWT (Section 2.3) and more detailed information about the implementation of the various steps of the method (Section 2.4-2.7).
在下面的小节中,我们将首先介绍基于单路径的 BWT 构建方法(第 2.2 节),然后介绍 deBWT 的概述(第 2.3 节)以及有关该方法各步骤实现的详细信息(第 2.4-2.7 节)。

2.2 Unipath-based BWT construction
2.2 基于统一路径的 BWT 构建

2.2.1 The unipath representation of suffix
2.2.1 后缀的单路径表示法

The DNA sequence G can be represented by a specific walk on the dBG DG , i.e. it equals to the sequence of collapsing a specific ordered list of the vertices of DG , KM0G,KM1G, ,KMG-kG , where each KMiG, i=0,,G-k is the corresponding k-mer starting at position i of G , and each of the two neighboring vertices within the list is a specific edge of DG . With this observation, we have the following lemma.
DNA 序列 G 可以用在 dBG DG 上的特定行走来表示,即它等同于折叠 DG , KM0G,KM1G, ,KMG-kG 顶点的特定有序列表的序列,其中每个 KMiG, i=0,,G-k 都是以 G 的位置 i 为起点的相应 k-聚合体,列表中相邻的两个顶点都是 DG 的特定边。有了这个观察结果,我们就有了下面的 Lemma。
 

Lemma 1 例题 1

a DNA sequence can be represented by an ordered list of the unipaths of the dBG.
DNA 序列可以用 dBG 的单路径有序列表来表示。

This lemma is easy to justify by collapsing all such edges, KMiGKMi+1G , within the ordered list KM0G,KM1G,,KMG-kG , where KMiG and KMi+1G are respectively single-out and single-in vertices. Thus, the ordered list can be re-written as U1G, U2G,,UUGG , where UG is the total number of the unipaths representing G , each UiG,i=1,,UG represents a unipath. It is also worth noting that, some of the identities within the list may be same to each other, as a unipath could have multiple copies. For each UiG , we further define its starting position on G as a Unipath Change Point (UCP), UCPiG . An illustration is in Supplementary Figure 1 . Furthermore, we have the following corollary.
将所有这样的边 KMiGKMi+1G 整理到有序列表 KM0G,KM1G,,KMG-kG 中,其中 KMiGKMi+1G 分别是单出顶点和单入顶点,就很容易证明这个 Lemma 的正确性。因此,有序列表可以改写为 U1G, U2G,,UUGG ,其中 UG 是代表 G 的单路径的总数,每个 UiG,i=1,,UG 代表一条单路径。值得注意的是,列表中的某些标识可能彼此相同,因为一条单路径可能有多个副本。对于每个 UiG ,我们进一步将其在 G 上的起始位置定义为单一路径变化点(UCP),即 UCPiG 。图 1 是一个示例。此外,我们还有以下推论。
 

Corollary 1 推论 1

Each of the suffixes can be represented as an ordered list of unipaths, or a substring of a specific unipath appending an ordered list of unipaths.
每个后缀都可以表示为单链路的有序列表,或特定单链路附加单链路有序列表的子串。

Fig. 1. 图 1.
 The unipath-based comparison between two suffixes with the same initial k-mer. ( a ) Because all the copies of the same k-mers collapse to the same vertex of the dBG, two suffixes, SufiG and SufjG with the same initial k-mer must link to the same offset of the same unipath (the copies of the unipaths on the DNA sequence are marked by segments with various colors). Thus, the lexicographical order of the two suffixes cannot be determined until the comparison reaches the end of the unipath, as all the corresponding characters of the two suffixes are same to each other. ( b ) When the comparison goes to new unipaths from the finished (same) unipath, the lexicographical order can be determined only if the two suffixes have two different unipaths on the corresponding positions of their unipath representation, otherwise, more unipaths are needed. In this case, both of the two suffixes have the same unipath (the red unipath) successive to the first unipath (the blue unipath), so that the comparison continues to the third unipaths. For the third unipath, the lexicographical order can be determined as the two suffixes goes to two difference branches (green and purple, respectively) at the end of the second unipath. ( c ) Owing to the property of dBG, two different unipaths must be different to each other at their first k-mers. Furthermore, if two different k-mers have the same precursor, their first (k-1) characters must be same to the last (k-1) characters of their precursor (the gray segments in the figure), i.e. the two branching k-mers are only different at their k-th character (the blocks marked as green and purple in the figure). In this situation, it only needs to compare the k-th characters of two unipaths to determine if they are the same unipath. ( d ) With this property, the de Bruijn encoding, dEG , is defined as a string concatenating all such characters, Gx+k,x=0,…,G-k-1 , along the DNA sequence, G , where the k-mer, KMxG , at the position x of G is a copy of a multiple-out vertex. Each of the Gx+k s is also termed as a branching character (marked as colored blocks in the figure). And for a position j of G , ϕj is defined as the position of Brj on dEG , where Brj is the branching character downstream and closest to the position j in S .
Open in new tabDownload slide
在新标签中打开 下载幻灯片

The unipath-based comparison between two suffixes with the same initial k-mer. ( a ) Because all the copies of the same k-mers collapse to the same vertex of the dBG, two suffixes, SufiG and SufjG with the same initial k-mer must link to the same offset of the same unipath (the copies of the unipaths on the DNA sequence are marked by segments with various colors). Thus, the lexicographical order of the two suffixes cannot be determined until the comparison reaches the end of the unipath, as all the corresponding characters of the two suffixes are same to each other. ( b ) When the comparison goes to new unipaths from the finished (same) unipath, the lexicographical order can be determined only if the two suffixes have two different unipaths on the corresponding positions of their unipath representation, otherwise, more unipaths are needed. In this case, both of the two suffixes have the same unipath (the red unipath) successive to the first unipath (the blue unipath), so that the comparison continues to the third unipaths. For the third unipath, the lexicographical order can be determined as the two suffixes goes to two difference branches (green and purple, respectively) at the end of the second unipath. ( c ) Owing to the property of dBG, two different unipaths must be different to each other at their first k-mers. Furthermore, if two different k-mers have the same precursor, their first (k-1) characters must be same to the last (k-1) characters of their precursor (the gray segments in the figure), i.e. the two branching k-mers are only different at their k-th character (the blocks marked as green and purple in the figure). In this situation, it only needs to compare the k-th characters of two unipaths to determine if they are the same unipath. ( d ) With this property, the de Bruijn encoding, dEG , is defined as a string concatenating all such characters, Gx+k,x=0,,G-k-1 , along the DNA sequence, G , where the k-mer, KMxG , at the position x of G is a copy of a multiple-out vertex. Each of the Gx+k s is also termed as a branching character (marked as colored blocks in the figure). And for a position j of G , ϕj is defined as the position of Brj on dEG , where Brj is the branching character downstream and closest to the position j in S .
具有相同初始 k-聚合物的两个后缀之间基于单路径的比较。( a ) 由于相同 k-聚合物的所有拷贝都会折叠到 dBG 的相同顶点,因此具有相同初始 k-聚合物的两个后缀 SufiGSufjG 必须链接到相同单链路的相同偏移量(DNA 序列上的单链路拷贝用不同颜色的片段标记)。因此,由于两个后缀的所有对应字符都是相同的,因此在比较到单路径末尾之前,无法确定两个后缀的词序。( b ) 当比较从已完成(相同)的单链路径进入新的单链路径时,只有当两个后缀在其单链路径表示的相应位置上有两个不同的单链路径时,才能确定其词法顺序,否则就需要更多的单链路径。在本例中,两个后缀都有相同的单链路(红色单链路),与第一个单链路(蓝色单链路)相继,因此比较继续到第三个单链路。对于第三条单向路径,由于两个后缀分别进入第二条单向路径末尾的两个不同分支(绿色和紫色),因此可以确定其词性顺序。( c ) 由于 dBG 的特性,两条不同的单链路在它们的第一个 k 元素上一定是不同的。此外,如果两个不同的 k-mers 有相同的前体,那么它们的前(k-1)个字符必须与其前体的后(k-1)个字符相同(图中的灰色段),即两个分支 k-mers 只在其第 k 个字符处不同(图中标记为绿色和紫色的块)。在这种情况下,只需比较两条单链路的第 k 个字符,就能确定它们是否是同一条单链路。 ( d ) 根据这一特性,德布鲁因编码 dEG 被定义为沿 DNA 序列 G 连接所有此类字符 Gx+k,x=0,,G-k-1 的字符串,其中 Gx 位置上的 k-mer KMxG 是多出顶点的副本。每个 Gx+k s 也被称为分支字符(图中标记为彩色块)。而对于 Gj 位置, ϕj 被定义为 BrjdEG 上的位置,其中 Brj 是分支字符的下游,最靠近 S 中的 j 位置。

Each of the suffixes, S[i,G] , can be directly represented by the ordered list KMiG,KMi+1G, ,KMG-kG , where KMiG is the corresponding k-mer at position i. If position i is a UCP, the ordered list of k-mers can be re-written as the ordered list of unipaths, UjG,Uj+1G,,UUGG ; otherwise, it can be re-written as Si,UCPj+1G-1,Uj+1G,,UUGG , where Uj+1G is the UCP closest to position i downstream, and Si,UCPj+1G-1 is the substring of the unipath that KMiG is within ( Supplementary Fig. 1 ). We term this list as the unipath representation of the suffix.
每个后缀 S[i,G] 可以直接用有序列表 KMiG,KMi+1G, ,KMG-kG 表示,其中 KMiG 是位置 i 上对应的 k-聚合物。如果位置 i 是一个 UCP,k-聚合物的有序列表可以重写为单链路的有序列表 UjG,Uj+1G,,UUGG ;否则,可以重写为 Si,UCPj+1G-1,Uj+1G,,UUGG ,其中 Uj+1G 是最靠近位置 i 下游的 UCP,而 Si,UCPj+1G-1KMiG 所在单链路的子串(补充图 1)。我们把这个列表称为后缀的单路径表示。

2.2.2 The unipath-based comparison between two suffixes
2.2.2 基于统一路径的两后缀比较

Given two suffixes of S , SufiS=S[i,G] and SufjS=S[j,G] , where ij ,  i<G-k and j<G-k , the comparison between the two suffixes can be deduced as the following two situations.
给定两个后缀分别为 SSufiS=S[i,G]SufjS=S[j,G] ,其中 ij i<G-kj<G-k ,两个后缀之间的比较可以推导出以下两种情况。

First, considering the two ordered lists of k-mers, KMiG,KMi+1G,,KMG-kG and KMjG,KMj+1G, ,KMG-kG , if KMiGKMjG , the lexicographical order of the two suffixes can be easily determined by their initial k-mers.
首先,考虑两个有序的 k-mers 列表 KMiG,KMi+1G,,KMG-kGKMjG,KMj+1G, ,KMG-kG ,如果是 KMiGKMjG ,则两个后缀的词序可以很容易地通过它们的初始 k-mers 确定。

Secondly, if KMiG=KMjG , the comparison is more complicated. In this situation, the two suffixes are two walks on DG starting at the same vertex, as all the copies of the same k-mers collapse to the same vertex of the dBG. Thus, Si,UCPi+1G-1=Sj,UCPj+1G-1 , as the two strings are the same substring of the corresponding unipath. An illustration is in Figure 1a .
其次,如果 KMiG=KMjG ,比较就更复杂了。在这种情况下,两个后缀是在 DG 上从同一个顶点开始的两次行走,因为相同 k-mers 的所有副本都塌缩到了 dBG 的同一个顶点。因此, Si,UCPi+1G-1=Sj,UCPj+1G-1 ,因为这两个字符串是相应单路径的同一子串。图 1a 举例说明了这一点。

Then it becomes an iterative comparison between the aligned unipaths, i.e. if the two unipaths, Ui+1G and Uj+1G , are different, the lexicographical order can be determined, otherwise we need to compare the following unipaths until we meet two different unipaths ( Fig. 1b ).
如果 Ui+1GUj+1G 这两条单向路径不同,则可以确定其词序,否则就需要比较下面的单向路径,直到遇到两条不同的单向路径为止(图 1b)。

Considering the property of dBG, if the two unipaths, Ui+lG and Uj+lG , are identical to each other, the starting vertices of Ui+l+1G and Uj+l+1G must be two vertices with the same precursor; thus, the first k-1 characters of the two vertices must be same, i.e. the comparison can be done by only checking the k-th characters of the two unipaths, i.e. GUCPi+l+1G+k-1 and GUCPj+l+1G+k-1 ( Fig. 1c ). Moreover, if the last vertex of Ui+lG ( Uj+lG ) is a single-out vertex, the comparison can be also omitted because the next unipaths must be the same.
考虑到 dBG 的特性,如果 Ui+lGUj+lG 这两条单向路径完全相同,那么 Ui+l+1GUj+l+1G 的起始顶点一定是两个具有相同前体的顶点,因此这两个顶点的前 k-1 个字符一定相同,即只需检查两条单向路径的第 k 个字符,即 GUCPi+l+1G+k-1GUCPj+l+1G+k-1 即可完成比较(图 1c)。此外,如果 Ui+lG ( Uj+lG ) 的最后一个顶点是单出顶点,也可以省略比较,因为接下来的单路径必须是相同的。

With these observations, we designed a novel data structure named as de Bruijn branch encoding ( Fig. 1d ). The de Bruijn branch encoding of G , dEG , is defined as the concatenation of all such characters Gi+k, i=0,,G-k-1 meeting the condition that KMiG corresponds to a multiple-out vertex. Each of the characters of dEG is also called a branching character. Then, for each of the SufjT , j<G-k , we define its projection suffix on dEG as dEGϕj,dEG-1 , where ϕj is the dEG coordinate of the first branching character after the position j in S , and dEG is the length of dEG .
根据这些观察结果,我们设计了一种名为 de Bruijn 分支编码的新型数据结构(图 1d)。 G , dEG 的 de Bruijn 分支编码被定义为所有满足 KMiG 对应多出顶点条件的字符 Gi+k, i=0,,G-k-1 的连接。 dEG 的每个字符也称为分支字符。然后,对于 SufjT , j<G-k 中的每个字符,我们将其在 dEG 上的投影后缀定义为 dEGϕj,dEG-1 ,其中 ϕjS 中 j 位置后第一个分支字符的 dEG 坐标,而 dEGdEG 的长度。

Lemma 2: For two suffixes at least k bp long, their lexicographical order can be determined by comparing their projection suffixes defined by the de Bruijn branch encoding of the DNA sequence, if the initial k-mers of the two suffixes are identical.
定理 2:对于长度至少为 k bp 的两个后缀,如果两个后缀的初始 k-mers 相同,则可以通过比较 DNA 序列的 de Bruijn 分支编码所定义的投影后缀来确定它们的词典顺序。

This lemma is easy to justify with the observations mentioned above, and it provides a cost-effective solution of the comparison between two suffixes with long common prefix.
根据上述观察结果,这个 Lemma 很容易得到证明,而且它为具有长共同前缀的两个后缀之间的比较提供了一个经济有效的解决方案。

2.2.3 The k-mer partition of BWT
2.2.3 BWT 的 k-mer 分区

According to the definition of SA, the suffixes starting with identical k-mers (suppose it as KMi ) constitute a continuous block in SA, as a suffix with a different initial k-mer (e.g. KMj ) must be larger or smaller than all the suffixes starting with KMi . Thus, given DG , the BWT of S can be partitioned into DG+k parts. DG of them correspond to the DG k-mers of DG , i.e. each of the DG parts involves all the suffixes with a specific k-mers of DG ; and each of the remaining k parts corresponds to a specific suffix whose starting position is less than k bp previous to the end of S . This partition (called k-mer partition of BWT) can be constructed in two steps: (i) sort all the k-mers of DG and the last k suffixes of S by their lexicographical order; (ii) bin all the  >k bp long suffixes of S into the corresponding parts by their initial k-mers. After the two steps, the BWT construction becomes DG sub-problems, i.e. separately determining the lexicographical order of the suffixes within each of the DG , as the lexicographical order of the suffixes with different k-mers have been implicitly determined during the construction of the k-mer partition.
根据 SA 的定义,以相同 k-mer 开始的后缀(假设为 KMi )在 SA 中构成一个连续的块,因为具有不同初始 k-mer 的后缀(例如 KMj )必须大于或小于所有以 KMi 开始的后缀。因此,给定 DG 后, S 的 BWT 可以划分为 DG+k 部分。其中的 DG 部分对应于 DGDG k-mers,即 DG 部分中的每一部分都涉及 DG 中具有特定 k-mers 的所有后缀;而剩下的 k 部分中的每一部分都对应于起始位置小于 S 末尾前 k bp 的特定后缀。这个分区(称为 BWT 的 k-mer 分区)可分两步构建:(i) 将 DG 的所有 k-mers 和 S 的最后 k 个后缀按其词性顺序排序;(ii) 将 S 的所有 >k bp 长后缀按其初始 k-mers 分为相应的部分。经过这两步后,BWT 的构建就变成了 DG 个子问题,即分别确定 DG 中每个后缀的词法顺序,因为在构建 k-mer 分区时已经隐含地确定了不同 k-mer 的后缀的词法顺序。

Thus, for each of the DG parts corresponding to various initial k-mers, the task is the comparison of a series of suffixes with the same initial k-mer, which can be implemented with the help of the de Bruijn branch encoding dEG . Moreover, the problem can be further reduced with the following lemma.
因此,对于对应于不同初始 k-mer 的 DG 部分,任务是比较一系列具有相同初始 k-mer 的后缀,这可以借助 de Bruijn 分支编码 dEG 来实现。此外,这个问题还可以通过下面的 Lemma 进一步简化。

Lemma 3: If a vertex KMi of DG ( i=1,,DG ) is single-in, the corresponding BWT part consists of KMi same characters, where KMi is the number of the copies of KMi in G , only except if the part involves the first suffix.
定理 3:如果 DG ( i=1,,DG ) 的顶点 KMi 是单入的,则相应的 BWT 部分由 KMi 个相同的字符组成,其中 KMiKMiG 中的副本数,除非该部分涉及第一个后缀。

This lemma is obtained with the observation that if a vertex of the dBG is single-in, all the suffixes with this k-mer must have the same previous character, i.e. the BWT of this part is purely KMi copies of the first character of the precursor of KMi . This is except for the parts involving the first suffix, as the BWT character of the first suffix is $. With this lemma, only the parts labeled by multiple-in vertices need to be sorted. Thus, there are at most UG sub-problems, as such vertices must be the initial k-mers of the unipaths.
这个 Lemma 是通过观察得到的,即如果 dBG 的一个顶点是单入顶点,那么所有带有这个 k-mer 的后缀都必须具有相同的前一个字符,即这个部分的 BWT 纯粹是 KMi 前体的第一个字符的 KMi 副本。除了涉及第一个后缀的部分,因为第一个后缀的 BWT 字符是 $。有了这个 Lemma,只需对多入顶点标记的部分进行排序。因此,最多有 UG 个子问题,因为这些顶点必须是单链路的初始 k-mers 。

2.3 Overview of the deBWT method
2.3 deBWT 方法概述

As in many cases, the input is not only one, but a set of DNA sequences, such as a set of genomes, chromosomes or assembled contigs, we re-define the sequence to be indexed as S=G1#G2##GND$ , where ND is the number of input DNA sequences, and each of Gi,i=1,, ND is a specific sequence. # is another auxiliary character, and the alphabet becomes A<C<G<T<#<$ .
由于在很多情况下,输入的不只是一个,而是一组 DNA 序列,如一组基因组、染色体或组装等位基因,因此我们将序列重新定义为 S=G1#G2##GND$ ,其中 ND 是输入 DNA 序列的数量,而 Gi,i=1,, ND 中的每一个都是一个特定的序列。 # 是另一个辅助字符,字母表变为 A<C<G<T<#<$

DeBWT constructs the BWT of S in the following three major phases.
DeBWT 分以下三个主要阶段构建 S 的 BWT。

  • dBG building and analysis: deBWT builds a dBG of all the involved sequences G1,G2,,GND with a user-defined parameter k, sort the k-mers to build the k-mer partition of the BWT, recognize all the unipaths as well as the multiple-out and multiple-in vertices and solves all the parts corresponding to single-in vertices.
    dBG 构建和分析:deBWT 用用户定义的参数 k 对所有涉及的序列 G1,G2,,GND 构建 dBG,对 k 单链体进行排序以构建 BWT 的 k 单链体分区,识别所有单链路以及多出和多入顶点,并解决与单入顶点相对应的所有部分。

  • de Bruijn branch encoding generation: deBWT computes the de Bruijn branch encoding of S ( dES ), bins all the suffixes of S corresponding to the unsolved parts of BWT and computes their projection suffixes.
    de Bruijn 分支编码生成:deBWT 计算 S ( dES ) 的 de Bruijn 分支编码,将 S 中对应于 BWT 未解决部分的所有后缀进行分类,并计算其投影后缀。

  • BWT construction with projection suffixes: for each of the unsolved BWT parts, deBWT builds the SA of the involved projection suffixes to determine the BWT characters of the part.
    使用投影后缀构建 BWT:对于每个未解决的 BWT 部分,deBWT 都会构建相关投影后缀的 SA,以确定该部分的 BWT 字符。

A schematic illustration of the method is in Figure 2 .
图 2 是该方法的示意图。

Fig. 2. 图 2.
 A schematic illustration of the deBWT method. ( a ) DeBWT initially builds a dBG of the input sequence(s) with a user-defined parameter, k, which determines the size of the vertices. The dBG is then analyzed to build the k-mer partition of the BWT and recognize all the unipaths (the colored bars in the figure indicate the copies of various unipaths of the input sequence). With the unipaths, all the multiple-in and multiple-out vertices are indexed by a hash table-based data structure, de Bruijn branch index. Moreover, all the multiple-in vertices are marked. In this case, the red block indicates the first k-mer of the ‘red’ unipath of the dBG which is a multiple-in vertex, and the grey and the white blocks respectively indicate other multiple-in and -out k-mers. ( b ) DeBWT scans the input sequence(s) to recognize the branching characters with de Bruijn branch index (marked as colored reverse rectangles above the input sequence) and generate the de Bruijn branch encoding. Meanwhile, the suffixes with initial k-mers corresponding to multiple-in vertices, i.e. the suffixes belonging to the unsolved parts of the BWT, are also recognized with the index (marked as colored rectangles below the input sequence). Furthermore, for each of the suffixes within the unsolved parts, deBWT calculates its ϕ· value to determine the corresponding projection suffix and also recorded it into the de Bruijn branch index. ( c ) With de Bruijn branch index, deBWT addresses all the unsolved BWT parts by sorting the projection suffixes
Open in new tabDownload slide
在新标签中打开 下载幻灯片

A schematic illustration of the deBWT method. ( a ) DeBWT initially builds a dBG of the input sequence(s) with a user-defined parameter, k, which determines the size of the vertices. The dBG is then analyzed to build the k-mer partition of the BWT and recognize all the unipaths (the colored bars in the figure indicate the copies of various unipaths of the input sequence). With the unipaths, all the multiple-in and multiple-out vertices are indexed by a hash table-based data structure, de Bruijn branch index. Moreover, all the multiple-in vertices are marked. In this case, the red block indicates the first k-mer of the ‘red’ unipath of the dBG which is a multiple-in vertex, and the grey and the white blocks respectively indicate other multiple-in and -out k-mers. ( b ) DeBWT scans the input sequence(s) to recognize the branching characters with de Bruijn branch index (marked as colored reverse rectangles above the input sequence) and generate the de Bruijn branch encoding. Meanwhile, the suffixes with initial k-mers corresponding to multiple-in vertices, i.e. the suffixes belonging to the unsolved parts of the BWT, are also recognized with the index (marked as colored rectangles below the input sequence). Furthermore, for each of the suffixes within the unsolved parts, deBWT calculates its ϕ· value to determine the corresponding projection suffix and also recorded it into the de Bruijn branch index. ( c ) With de Bruijn branch index, deBWT addresses all the unsolved BWT parts by sorting the projection suffixes
deBWT 方法示意图。( a ) DeBWT 最初用用户定义的参数 k(决定顶点的大小)构建输入序列的 dBG。然后对 dBG 进行分析,建立 BWT 的 k-mer 分区,并识别所有单链路(图中彩色条表示输入序列中各种单链路的副本)。有了单路径,所有多入和多出顶点都会被一个基于哈希表的数据结构 de Bruijn 分支索引所索引。此外,所有多入顶点都有标记。在这种情况下,红色块表示 dBG 中 "红色 "单链路的第一个 k-分子,它是多入顶点,灰色块和白色块分别表示其他多入和多出 k-分子。( b ) DeBWT 扫描输入序列,识别具有 de Bruijn 分支索引的分支字符(输入序列上方的彩色反向矩形标记),并生成 de Bruijn 分支编码。与此同时,还能用索引识别出与多入顶点相对应的初始 k-mers 后缀,即属于 BWT 未解决部分的后缀(在输入序列下方标记为彩色矩形)。此外,对于未解决部分中的每个后缀,deBWT 都会计算其 ϕ· 值,以确定相应的投影后缀,并将其记录到 de Bruijn 分支索引中。( c ) 利用 de Bruijn 分支索引,deBWT 通过对投影后缀进行排序,处理所有未解决的 BWT 部分。

2.4 dBG building and analysis
2.4 dBG 的构建与分析

It needs all the k-mers as well as the numbers of their copies in G1,G2,,GND to build the k-mer partition and solve the single-in parts, i.e. a k-mer counting task. In the current version of deBWT, Jellyfish ( Marçais and Kingsford, 2011 ) is used. As both of the multiple-in and multiple-out vertices are needed in later steps, deBWT counts the edges of the dBG, i.e. the (k + 1)-mers, instead of the k-mers.
它需要 G1,G2,,GND 中的所有 k 分子及其副本的数量来建立 k 分子分区并解决单入部分,即 k 分子计数任务。目前的 deBWT 版本使用的是 Jellyfish(Marçais 和 Kingsford,2011 年)。由于后面的步骤都需要多入和多出顶点,因此 deBWT 计算的是 dBG 的边,即 (k + 1)-mers 而不是 k-聚合物。

To build the k-mer partition of the BWT, deBWT sorts all the (k + 1)-mers by their lexicographic order, and respectively merges all the (k + 1)-mers with identical first k bp prefixes into k-mers to build the partition. For each of the k-mers, its number of copies is calculated by directly summing up the corresponding (k + 1)-mers. Meanwhile, the multiple-out vertices can also be recognized during the merging, and deBWT records all the multiple-out vertices for building the de Bruijn branch encoding in the later step.
为了建立 BWT 的 k-聚合体分区,deBWT 将所有 (k + 1)-mers 按其词序排序,并分别将前 k 个 bp 前缀相同的所有 (k + 1)-mers 合并成 k-聚合体,以建立分区。对于每一个 k-mers,其副本数是通过直接求和相应的 (k + 1)-mers 来计算的。同时,在合并过程中还可以识别出多重出顶点,deBWT 会记录所有多重出顶点,以便在后一步建立 de Bruijn 分支编码。

The multiple-in vertices is then recognized based on the sorted (k + 1)-mer list. That is, deBWT partitions the sorted (k + 1)-mer list into four ordered lists, each corresponding to a specific initial character, A, C, G or T. Then deBWT recognizes all the multiple-in vertices by a four-way merging on the lists. During the merging, two tasks are done simultaneously, i.e. if a vertex is recognized as single-in, deBWT assigns it the first character of the corresponding (k + 1)-mer and the number of copies to solve the BWT part; otherwise, deBWT records the k-mer in another data structure to generate the ϕ· function in the later step.
然后,根据排序后的 (k + 1)-mer 列表识别多入顶点。也就是说,deBWT 将排序后的 (k + 1)-mer 列表分成四个有序列表,每个列表对应一个特定的初始字符,即 A、C、G 或 T,然后 deBWT 通过对列表进行四向合并来识别所有多入顶点。在合并过程中,有两项任务同时进行,即如果一个顶点被识别为单入顶点,deBWT 将为其分配相应 (k + 1)-mer 的首字符和副本数,以解决 BWT 部分;否则,deBWT 将在另一个数据结构中记录 k-mer,以便在后一步生成 ϕ· 函数。

It is worth noting that, owing to the existence of the auxiliary character #, deBWT recognizes all the k-mers that have at least one copy previous to and next to # as multiple-out and multiple-in k-mers, respectively.
值得注意的是,由于辅助字符 # 的存在,deBWT 将所有在 # 之前和 # 之后至少有一个副本的 k-mers 分别识别为多出 k-mers 和多入 k-mers。

The parallelization of this step is straightforward. The k-mer counting can be directly parallelized ( Marçais and Kingsford, 2011 ). There are many feasible parallel integer sorting approaches for sorting the (k + 1)-mers and we used a simple approach, i.e. a radix sort is implemented to partition all the (k + 1)-mers into blocks, and each of the blocks is further processed in parallel by integer quick sort. The merging of the k-mer lists is not executed in parallel; however, the cost is linear to the number of (k + 1)-mers and not expensive.
这一步骤的并行化非常简单。k-mer 计数可以直接并行化(Marçais 和 Kingsford,2011 年)。有许多可行的并行整数排序方法来对 (k + 1)-mers 进行排序,而我们使用了一种简单的方法,即使用弧度排序将所有 (k + 1)-mers 划分为多个区块,然后通过整数快速排序对每个区块进行并行处理。k-mer 列表的合并并不是并行执行的;不过,合并成本与 (k + 1)-mers 的数量成线性关系,并不昂贵。

2.5 The generation of de Bruijn branch encoding and projection suffixes
2.5 生成 de Bruijn 分支编码和投影后缀

DeBWT builds a hash table-based data structure, de Bruijn branch index, to index all multiple-in and multiple-out k-mers ( Supplementary Fig. 2 ). With this index, the de Bruijn branch encoding and the ϕ· function are simultaneously generated by scanning S one time.
DeBWT 建立了一个基于哈希表的数据结构,即 de Bruijn 分支索引,用于索引所有多入和多出 k-mers (附图 2)。有了这个索引,只需扫描 S 一次,就能同时生成 de Bruijn 分支编码和 ϕ· 函数。

DeBWT initially allocates an empty string as dES , a counter CdES recording the length of dES and a linear table, PdES , which records the positions and the ϕ· values for all the copies of the multiple-in k-mers ( Supplementary Fig. 2 ). Each of the multiple-in k-mers occupies a series of cells of PdES as a sub-table for their own copies. Each of the sub-table can be accessed with a specific pointer.
DeBWT 最初分配一个空字符串作为 dES ,一个计数器 CdES 记录 dES 的长度,一个线性表 PdES 记录所有多入 k-mers 副本的位置和 ϕ· 值(附图 2)。每个 "多入 k-聚合物 "都占用 PdES 的一系列单元格,作为各自副本的子表。每个子表都可以通过特定的指针访问。

DeBWT then scans S from upstream to downstream to check each of the k-mers. For a position i of S , if the corresponding k-mer, KMiS , is a multiple-in k-mer, deBWT records the character Si-1 and the value CdES+1 into the corresponding sub-table of PdES ; if KMiS is a multiple-out k-mer, deBWT appends the branching character Si+k to dES and updates CdES . Here, Si-1 is the BWT character of KMiS , and CdES+1 is the ϕi value, as the next branching character must be the first character of the projection suffix of KMiS , and CdES+1 is its position on dES .
然后,DeBWT 从上游向下游扫描 S 以检查每个 k-聚合物。对于 Si 位置,如果相应的 k-聚合体 KMiS 是多入 k-聚合体,deBWT 会将字符 Si-1 和值 CdES+1 记录到 PdES 的相应子表中;如果 KMiS 是多出 k-聚合体,deBWT 会将分支字符 Si+k 追加到 dES 并更新 CdES 。这里, Si-1KMiS 的 BWT 字符, CdES+1ϕi 的值,因为下一个分支字符必须是 KMiS 的投影后缀的第一个字符, CdES+1 是它在 dES 上的位置。

The parallelization of this step is implemented as follows. DeBWT divides S into P segments, and assigns each of the segments to a specific thread to generate local dES and ϕ· values. The value P is equal to the number of threads. It is worth noting that all the threads share the same hash table and PdES data structures with read-write locks, but have their own CdES and dES . When all the threads accomplish their own tasks, deBWT appends all the local dES s and updates the local ϕ· values generated by thread p1, 2, ,P by the following operation: assume the length of the dES of the j-th segments is ljdE,j1, 2, ,P , for each of multiple-in k-mers, KMiS , within the p-th segments ( p>0 ), the corresponding ϕ· value, ϕi , is updated as ϕi+j=0p-1ljdE .
这一步的并行化实现如下。DeBWT 将 S 分成 P 段,并将每个段分配给特定的线程,以生成本地 dESϕ· 值。值 P 等于线程数。值得注意的是,所有线程共享同一个哈希表和带有读写锁的 PdES 数据结构,但有各自的 CdESdES 。当所有线程完成各自的任务后,deBWT 会追加所有本地 dES s,并通过以下操作更新线程 p1, 2, ,P 生成的本地 ϕ· 值:假定第 j 段的 dES 长度为 ljdE,j1, 2, ,P ,对于第 p 段( p>0 )中的每一个多进 k 个单词 KMiS ,更新相应的 ϕ·ϕiϕi+j=0p-1ljdE

2.6 BWT construction with projection suffixes
2.6 带投影后缀的 BWT 结构

For each of the BWT parts corresponding to the multiple-in k-mers, deBWT constructs the SA of the projection suffixes with the dES and the ϕ· function. The SA is built by straightforwardly quick-sorting the involved projection suffixes. As all the unsolved parts are independent, it is also easy to accomplish the tasks in parallel.
对于与多入 k-mers 对应的每个 BWT 部分,deBWT 使用 dESϕ· 函数构建投影后缀的 SA。SA 是通过直接对涉及的投影后缀进行快速排序来构建的。由于所有未解决的部分都是独立的,因此也很容易并行地完成任务。

We did a modification on the recursive process of the original quick-sort method to improve the efficiency. That is, for a specific sub-array of suffixes transferred into the recursive function, deBWT checks whether all the suffixes have the same BWT characters at first. If this is the case, deBWT marks the sub-array as sorted, as the precise lexicographical order of these suffixes is not necessary for the BWT construction; otherwise, deBWT calls the original recursive function to further sort the sub-array. This modification can also be seen as an extension of Lemma 3.
为了提高效率,我们对原始快速排序方法的递归过程进行了修改。也就是说,对于转入递归函数的特定后缀子数组,deBWT 会首先检查所有后缀是否都有相同的 BWT 字符。如果是这样,deBWT 会将该子数组标记为已排序,因为 BWT 结构并不需要这些后缀的精确词序;否则,deBWT 会调用原始递归函数对子数组进行进一步排序。这一修改也可以看作是对 Lemma 3 的扩展。

2.7 Additional processing
2.7 额外处理

After all the operations mentioned above, there are still ND×k unsolved BWT characters, each corresponding to one of the suffixes, which start less than k positions before # or $. These suffixes are initially set aside, and deBWT builds the SA of such suffixes stand-alone to fill the BWT string. As there are only few such suffixes, this sorting is implemented by directly comparing the original sequences of the suffixes.
经过上述所有操作后,仍有 ND×k 个未解决的 BWT 字符,每个字符对应一个后缀,这些后缀的起始位置小于 # 或 $ 之前的 k 个位置。这些后缀最初会被搁置,deBWT 会单独建立这些后缀的 SA 来填充 BWT 字符串。由于此类后缀数量很少,因此这种排序是通过直接比较后缀的原始序列来实现的。

3 Results 3 项成果

We benchmarked deBWT with three datasets mimicking various real application scenarios. (i) A dataset consists of 10 in silico human genomes (totally 30.9 Gbp). Each of the genomes is generated by integrating the variants of a specific sample from 1000 Genomes ( The 1000 Genomes Project Consortium, 2015 ) into the human reference genome GRCh37/Hg19. This dataset mimics the indexing of multiple individual human genomes, which has many applications in genomic studies. (ii) A dataset consists of a set of simulated contigs. As long read sequencing technologies, such as Single Molecular Real-Time sequencing, have improved the contig N50 of human genome assembly to >10 million bp ( http://www.pacb.com/blog/toward-platinum-genomes-pacbio-releases-a-new-higher-quality-chm1-assembly-to-ncbi/ ), we randomly extracted 3000 sequences (each is about 10M bp long, totally 30.2 Gbp) from the 10 in silico human genomes with an in-house script, which is revised from Wgsim simulator ( Li et al. , 2009b ; https://github.com/lh3/wgsim ). (iii) A dataset consists of eight primate genomes including gibbon, gorilla, orangutan, rhesus, baboon, chimp, bonobo and human (downloaded from: http://hgdownload.soe.ucsc.edu/downloads.html ). This dataset assessed the ability of deBWT to index more diverse genomes.
我们用三个模拟各种实际应用场景的数据集对 deBWT 进行了基准测试。(i) 一个数据集由 10 个硅人类基因组(总计 30.9 Gbp)组成。每个基因组都是将来自 1000 基因组(The 1000 Genomes Project Consortium, 2015)的特定样本的变异整合到人类参考基因组 GRCh37/Hg19 中生成的。该数据集模拟了多个人类个体基因组的索引,在基因组研究中有许多应用。(ii) 数据集由一组模拟等位基因组成。由于单分子实时测序等长读测序技术已将人类基因组组装的等位基因 N50 提高到大于 1000 万 bp ( http://www.pacb.com/blog/toward-platinum-genomes-pacbio-releases-a-new-higher-quality-chm1-assembly-to-ncbi/ ),我们从 10 个人类基因组中随机提取了 3000 个序列(每个序列长约 1000 万 bp,共计 30.(iii) 一个由长臂猿、大猩猩、猩猩、恒河猴、狒狒、黑猩猩、倭黑猩猩和人类等 8 个灵长类动物基因组组成的数据集(下载地址: http://hgdownload.soe.ucsc.edu/downloads.html )。该数据集评估了 deBWT 索引更多样化基因组的能力。

The benchmark was implemented on a server with four Intel Xeon E4820 CPUs (32 cores in total) at 2.00 GHz and 1 Terabytes RAM, running Linux Ubuntu 14.04. DeBWT uses Jellyfish (version 2.1.4; Marçais and Kingsford, 2011 ) for implementing the k-mer counting of the input sequences (the parameter k is configured as 31). Two recently published methods, RopeBWT2 ( Li, 2014 ) and ParaBWT (version 1.0.8-binary-x86_64) ( Liu et al. , 2014b ), were also performed on the same datasets for comparison.
该基准测试是在一台服务器上进行的,该服务器配备四颗英特尔至强 E4820 CPU(共 32 核),主频为 2.00 GHz,内存为 1 TB,运行 Linux Ubuntu 14.04。DeBWT 使用 Jellyfish(版本 2.1.4;Marçais 和 Kingsford,2011 年)对输入序列进行 k-mer 计数(参数 k 设置为 31)。最近发表的两种方法 RopeBWT2 ( Li, 2014 ) 和 ParaBWT (version 1.0.8-binary-x86_64) ( Liu et al. , 2014b ) 也在相同的数据集上进行了比较。

At first, we tested the performance of deBWT with 32 threads, i.e. running with all the 32 CPU cores of the server. ParaBWT was also run with 32 threads, but RopeBWT2 was run with its default setting, as it does not support parallel computing. The elapsed time ( Table 1 ) indicates that deBWT and ParaBWT have comparable speed, while RopeBWT2 is slower, likely owing to the fact that it does not support parallel computing and the algorithm could not be suited to long sequences. We further investigate the processing of deBWT, and found that the speed of deBWT was largely slowed down by Jellyfish owing to the format of its output file. The default output of Jellyfish is a binary file in an unpublished format. As the details about the format is unknown for us, we used the ‘dump’ command of Jellyfish to convert the output file into text file, and then converted the text file into binary file in our own format as the input of further steps. This file conversion costs a couple of hours for all the three datasets, i.e. about 60–70% of the total running time. The time cost would be much reduced if the output format of jellyfish was available, or if other k-mer counting tools with similar performance and readable output format were used. Deducting the time of file conversion, deBWT is much faster than the other two methods.
首先,我们用 32 个线程测试了 deBWT 的性能,即使用服务器的全部 32 个 CPU 内核运行。ParaBWT 也使用 32 个线程运行,但由于 RopeBWT2 不支持并行计算,因此使用其默认设置运行。运行时间(表 1)表明,deBWT 和 ParaBWT 的速度相当,而 RopeBWT2 的速度较慢,这可能是由于它不支持并行计算,算法不适合长序列。我们进一步研究了 deBWT 的处理过程,发现由于输出文件的格式问题,deBWT 的速度在很大程度上被 Jellyfish 拖慢了。Jellyfish 的默认输出是一个未公开格式的二进制文件。由于我们不知道格式的详细信息,因此我们使用 Jellyfish 的 "dump "命令将输出文件转换成文本文件,然后将文本文件转换成我们自己格式的二进制文件,作为进一步步骤的输入。对所有三个数据集来说,文件转换花费了几个小时,约占总运行时间的 60-70%。如果水母的输出格式可用,或者使用其他具有类似性能和可读输出格式的 k-mer 计数工具,时间成本将大大降低。扣除文件转换时间,deBWT 比其他两种方法快得多。

Table 1.

Running Time with 32 CPU cores (in minutes)


表 1.32 个 CPU 内核的运行时间(分钟)
Methods 方法Human genomes 人类基因组Human contigs 人类等位基因Primate genomes 灵长类动物基因组
deBWT134129330
deBWT (no conversion) deBWT (无转换)4856100
ParaBWT241262180
RopeBWT2169422471546
MethodsHuman genomesHuman contigsPrimate genomes
deBWT134129330
deBWT (no conversion)4856100
ParaBWT241262180
RopeBWT2169422471546

‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出格式转换的时间。

Table 1.

Running Time with 32 CPU cores (in minutes)


表 1.32 个 CPU 内核的运行时间(分钟)
Methods 方法Human genomes 人类基因组Human contigs 人类等位基因Primate genomes 灵长类动物基因组
deBWT134129330
deBWT (no conversion) deBWT (无转换)4856100
ParaBWT241262180
RopeBWT2169422471546
MethodsHuman genomesHuman contigsPrimate genomes
deBWT134129330
deBWT (no conversion)4856100
ParaBWT241262180
RopeBWT2169422471546

‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出格式转换的时间。

We investigated the time cost of the various steps of deBWT ( Table 2 ). Mainly, two issues are observed.
我们研究了 deBWT 各个步骤的时间成本(表 2)。主要发现了两个问题。

Table 2.

The time of the various steps of deBWT (in minutes)


表 2.deBWT 各步骤的时间(分钟)
Steps 步骤Human genomes 人类基因组Human contigs 人类等位基因Primate genomes 灵长类动物基因组
Phase1: dBG building and analysis
第 1 阶段: dBG 建设与分析
k-mer counting k 元计数161626
File conversion 文件转换8774229
k-mer sorting k 元排序338
dBG analysis dBG 分析8719
Phase2: The generation of de Bruijn branch encoding and projection suffixes
第 2 阶段:生成 de Bruijn 分支编码和投影后缀
de Bruijn branch encoding and ϕ· values generation
de Bruijn 分支编码和 ϕ· 值生成
91216
Phase3: BWT construction with projection suffixes
第 3 阶段:使用投影后缀构建 BWT
Projection suffixes sorting
投影后缀排序
41210
Additional processing 附加处理
Additional processing 附加处理7622
Supplement
k-mer counting with KMC2 使用 KMC2 进行 k-聚合物计数7912
StepsHuman genomesHuman contigsPrimate genomes
Phase1: dBG building and analysis
k-mer counting161626
File conversion8774229
k-mer sorting338
dBG analysis8719
Phase2: The generation of de Bruijn branch encoding and projection suffixes
de Bruijn branch encoding and ϕ· values generation 91216
Phase3: BWT construction with projection suffixes
Projection suffixes sorting41210
Additional processing
Additional processing7622
Supplement
k-mer counting with KMC27912
Table 2.

The time of the various steps of deBWT (in minutes)


表 2.deBWT 各步骤的时间(分钟)
Steps 步骤Human genomes 人类基因组Human contigs 人类等位基因Primate genomes 灵长类动物基因组
Phase1: dBG building and analysis
第 1 阶段: dBG 建设与分析
k-mer counting k 元计数161626
File conversion 文件转换8774229
k-mer sorting k 元排序338
dBG analysis dBG 分析8719
Phase2: The generation of de Bruijn branch encoding and projection suffixes
第 2 阶段:生成 de Bruijn 分支编码和投影后缀
de Bruijn branch encoding and ϕ· values generation
de Bruijn 分支编码和 ϕ· 值生成
91216
Phase3: BWT construction with projection suffixes
第 3 阶段:使用投影后缀构建 BWT
Projection suffixes sorting
投影后缀排序
41210
Additional processing 附加处理
Additional processing 附加处理7622
Supplement
k-mer counting with KMC2 使用 KMC2 进行 k-聚合物计数7912
StepsHuman genomesHuman contigsPrimate genomes
Phase1: dBG building and analysis
k-mer counting161626
File conversion8774229
k-mer sorting338
dBG analysis8719
Phase2: The generation of de Bruijn branch encoding and projection suffixes
de Bruijn branch encoding and ϕ· values generation 91216
Phase3: BWT construction with projection suffixes
Projection suffixes sorting41210
Additional processing
Additional processing7622
Supplement
k-mer counting with KMC27912

First, most of the core steps of deBWT, i.e. k-mer counting, k-mer sorting, de Bruijn branch encoding and ϕ· values generation and projection suffixes sorting, are efficient. This is because of a couple of reasons. (i) The de Bruijn branch code greatly reduces the cost of sorting suffixes with long common prefixes. We investigated the lengths of the generated de Bruijn branch code, and found that, for both of the genomes and the contigs datasets, their lengths are respectively one order shorter than those of the original input sequences. Under this circumstance, the comparison between projection suffixes is much less expensive than that of the original suffixes. Furthermore, the k-mer partition of BWT also helps to reduce many unnecessary comparison operations. (ii) The designs of these steps are suitable for parallel computing, which can fully use the multiple CPU cores. It is worth noting that, besides the parallel implementation of the core steps of deBWT, the state-of-the-art k-mer counting tool also has good parallelization. As k-mer counting is still an open problem with wide application, there are a few choices for this step. We also tried a newer published tool, KMC2 ( Deorowicz et al. , 2015 ), and obtained even faster speed ( Table 2 ). However, KMC2 also outputs a binary file, which is hard to directly interpret. It would be beneficial if the state-of-the-art k-mer counting tools have an easy-to-interpret output file. (iii) Besides the parallelism, multiple steps, i.e. k-mer counting, the radix sort of k-mers, de Bruijn branch encoding and ϕ· values generation, have quasi linear time complexity.
首先,deBWT 的大多数核心步骤,即 k-mer 计数、k-mer 排序、de Bruijn 分支编码和 ϕ· 值生成以及投影后缀排序,都是高效的。这有几个原因(i) de Bruijn 分支编码大大降低了对具有长公共前缀的后缀进行排序的成本。我们研究了生成的 de Bruijn 分支代码的长度,发现对于基因组和 contigs 数据集,它们的长度分别比原始输入序列的长度短一个数量级。在这种情况下,对投影后缀进行比较的成本要比对原始后缀进行比较的成本低得多。此外,BWT 的 k-mer 分割也有助于减少许多不必要的比较操作。(ii) 这些步骤的设计适合并行计算,可以充分利用多个 CPU 内核。值得注意的是,除了 deBWT 核心步骤的并行化实现,最先进的 k-mer 计数工具也具有良好的并行化能力。由于 k-mer计数仍是一个应用广泛的开放性问题,因此这一步骤有多种选择。我们还尝试了一种最新发布的工具 KMC2(Deorowicz 等人,2015 年),其速度更快(表 2)。不过,KMC2 输出的也是二进制文件,很难直接解释。如果最先进的 k-mer计数工具能有一个易于解读的输出文件,那将是非常有益的。(iii) 除了并行性,多个步骤,即 k-mer计数、k-mer的弧度排序、de Bruijn分支编码和 ϕ· 值生成,都具有准线性时间复杂性。

Second, the I/O operation is the main issue slowing down the method. Other than the file conversion step mentioned above, there are also many I/O operations in the ‘de Bruijn graph analysis’ and the ‘additional processing’ steps. That is, in the dBG analysis step, deBWT needs to merge the files recording the four ordered lists of k-mers to recognize the multiple-in k-mers; and in the additional processing step, deBWT needs to convert the large sequences to be indexed from text (fasta format) into binary format and merge various BWT parts. Although these operations theoretically have low time complexity, they also depend on the performance of the file system of the computer as well as the implementation of the program. It is also an important future work for us to further optimize these operations.
其次,I/O 操作是拖慢该方法的主要问题。除了上面提到的文件转换步骤,在 "德布鲁因图分析 "和 "附加处理 "步骤中也有很多 I/O 操作。也就是说,在 dBG 分析步骤中,deBWT 需要合并记录四个有序 k-mers 列表的文件,以识别多入 k-mers;而在附加处理步骤中,deBWT 需要将需要索引的大序列从文本(fasta 格式)转换为二进制格式,并合并 BWT 的各个部分。虽然这些操作理论上时间复杂度较低,但也取决于计算机文件系统的性能以及程序的实现。进一步优化这些操作也是我们未来的一项重要工作。

Other than the two issues mentioned above, the time cost of the projection suffixes sorting step is especially critical, as it is the core step to handle the long repetitions within the input sequence(s). The total time cost of this step is j=1UGtUjG , where tUjG is the time for solving the j-th unsolved part of the BWT. As each of the unsolved parts are handled by quicksort, the time cost can be represented as follows ( Bentley and Sedgewick, 1997 ; Karkkainen, 2007 ): tUjG=ONjlogNj+i=1NjDPPSi , where Nj is the number of the projection suffixes involved in the unsolved part j, and DPPSi is the length of the distinguishing prefix of the i-th projection suffix (denoted as PSi ) of the part. Here, the distinguishing prefix of PSi is the shortest prefix of PSi , which is necessary to determine the BWT characters of the corresponding part. For the highly similar input sequences,  G1 , G2 , …, GND , the upper bound of DPPSi is OdEGM in theory, owing to the existence of long repetitions, where GM is the input sequence having the longest de Bruijn branch encoding, and dEGM is the length of the corresponding de Bruijn branch encoding. For example, two sequences Gi and Gj could be almost the same; thus, the length of distinguishing prefix could be close to the length of the de Bruijn branch encoding of the sequences. As the total number of the branching characters of GM , the value dEGM does not only depend on how many unipaths GM has, but also how many copies of the unipaths there are.
除上述两个问题外,投影后缀排序步骤的时间成本尤为关键,因为它是处理输入序列中长重复的核心步骤。该步骤的总时间成本为 j=1UGtUjG ,其中 tUjG 是解决 BWT 第 j 个未解决部分的时间。由于每个未解决部分都由 quicksort 处理,时间成本可表示如下(Bentley 和 Sedgewick,1997;Karkkainen,2007): tUjG=ONjlogNj+i=1NjDPPSi ,其中 Nj 是未解决部分 j 所涉及的投影后缀的数量, DPPSi 是该部分第 i 个投影后缀的区分前缀的长度(表示为 PSi )。这里, PSi 的区分前缀是 PSi 的最短前缀,这是确定相应部分的 BWT 字符所必需的。对于高度相似的输入序列  G1G2 、......、 GND ,由于存在长重复,理论上 DPPSi 的上界是 OdEGM ,其中 GM 是具有最长 de Bruijn 分支编码的输入序列,而 dEGM 是相应 de Bruijn 分支编码的长度。例如,两个序列 GiGj 可能几乎相同,因此,区分前缀的长度可能与序列的 de Bruijn 分支编码长度接近。作为 GM 的分支字符总数, dEGM 的值不仅取决于 GM 有多少个单链路,还取决于有多少个单链路副本。

Although the theoretical upper bound is large, however, DPPSi also greatly depends on the distributions of genomic variations as well as the repetitiveness of the input sequences, which could make it lower in practice. To more precisely investigate the time cost, we assessed the DPj¯ values and the DPjM values of the most repetitive dataset in the benchmark (i.e. the 10 human genomes dataset), where DPj¯ and DPjM are respectively the mean and maximal length of the distinguishing prefix of the projection suffixes within the j-th unsolved part. A series of quantiles of DPj¯ and DPjM values of the 10 human genomes dataset are shown in Table 3 . These quantiles indicate that, for most of the blocks, the distinguishing prefixes are short, e.g. the 0.90 quantile of DPjM is 11 872, indicating that for 90% of the unsolved BWT parts, the max length of the distinguishing prefixes is shorter than 11 872 characters. This is not expensive to determine their lexicographical orders by a straightforward comparison. Thus, the overall cost of this step is not high, although there is still a small proportion (<0.1%) of BWT parts having long distinguishing prefixes (average value is > 298k).
虽然理论上限很大,但 DPPSi 也在很大程度上取决于基因组变异的分布以及输入序列的重复性,这可能会使其在实际应用中更低。为了更精确地研究时间成本,我们评估了基准中重复性最高的数据集(即 10 个人类基因组数据集)的 DPj¯ 值和 DPjM 值,其中 DPj¯DPjM 分别是第 j 个未解决部分中投影后缀的区分前缀的平均长度和最大长度。表 3 列出了 10 个人类基因组数据集中 DPj¯DPjM 值的一系列量化值。这些量化值表明,对于大多数区块来说,区分前缀都很短,例如 DPjM 的 0.90 量化值为 11 872,表明对于 90% 的未解决 BWT 部分来说,区分前缀的最大长度短于 11 872 个字符。通过直接比较来确定它们的词序并不昂贵。因此,尽管仍有一小部分(<0.1%)BWT 部件具有较长的区分前缀(平均值大于 298k),但这一步的总体成本并不高。

Table 3.

Quantiles of DPj¯ and DPjM values of the 10 human genomes dataset


表 3.10 个人类基因组数据集的 DPj¯DPjM 值的数量级
Quantiles 定量0.500.900.950.990.9990.9999
DPj¯107588238295 019298 598515 006
DPjM176011 872238 3681 925 6003 232 8323 387 040
Quantiles0.500.900.950.990.9990.9999
DPj¯107588238295 019298 598515 006
DPjM176011 872238 3681 925 6003 232 8323 387 040
Table 3.

Quantiles of DPj¯ and DPjM values of the 10 human genomes dataset


表 3.10 个人类基因组数据集的 DPj¯DPjM 值的数量级
Quantiles 定量0.500.900.950.990.9990.9999
DPj¯107588238295 019298 598515 006
DPjM176011 872238 3681 925 6003 232 8323 387 040
Quantiles0.500.900.950.990.9990.9999
DPj¯107588238295 019298 598515 006
DPjM176011 872238 3681 925 6003 232 8323 387 040

We also run deBWT with 8, 16, 24 threads to investigate its scalability. The results ( Table 4 ) suggest that deBWT can gradually speedup with the increase of threads, i.e. it has good scalability. However, the speed of ParaBWT is nearly the same with the various settings on threads. This is likely owing to the incremental nature of the ParaBWT method, which may limit its performance on modern servers and clusters. The time of the various steps of deBWT with various numbers of threads is in Figure 3 . It indicates that the two core steps, de Bruijn branch encoding and ϕ· values generation and projection suffixes sorting (steps 4 and 5 in the figure), are most scalable steps, i.e. they speedup with the increasing number of threads. This property is beneficial for implementing the method with more computational resources.
我们还使用 8、16、24 个线程运行 deBWT,以研究其可扩展性。结果(表 4)表明,deBWT 可以随着线程数的增加而逐渐加速,即具有良好的可扩展性。然而,在不同的线程设置下,ParaBWT 的速度几乎相同。这可能是由于 ParaBWT 方法的增量性质,这可能会限制其在现代服务器和集群上的性能。不同线程数下 deBWT 各步骤的时间见图 3。从图中可以看出,de Bruijn 分支编码和 ϕ· 值生成以及投影后缀排序这两个核心步骤(图中的第 4 步和第 5 步)是可扩展性最强的步骤,即随着线程数的增加而加速。这一特性有利于使用更多的计算资源来实现该方法。

Fig. 3. 图 3.
 Time consumption of the various steps of deBWT. The bars respectively indicate the elapsed time (in minutes) of the various steps of deBWT for the 10 human genomes dataset ( a ), the human genome contig dataset ( b ) and the 8 primate genomes dataset ( c ). Bars in the same color correspond to a specific number of threads, i.e. blue, red, green and purple bars are respectively for 8, 16, 24 and 32 threads
Open in new tabDownload slide
在新标签中打开 下载幻灯片

Time consumption of the various steps of deBWT. The bars respectively indicate the elapsed time (in minutes) of the various steps of deBWT for the 10 human genomes dataset ( a ), the human genome contig dataset ( b ) and the 8 primate genomes dataset ( c ). Bars in the same color correspond to a specific number of threads, i.e. blue, red, green and purple bars are respectively for 8, 16, 24 and 32 threads
deBWT 各步骤的耗时。条形图分别表示 10 个人类基因组数据集(a)、人类基因组 contig 数据集(b)和 8 个灵长类动物基因组数据集(c)的 deBWT 各个步骤的耗时(分钟)。相同颜色的条形图对应特定的线程数,即蓝色、红色、绿色和紫色条形图分别代表 8、16、24 和 32 个线程。

Table 4.

Running time with various numbers of threads (in minutes)


表 4.不同线程数量下的运行时间(分钟)
Methods 方法8 threads 8 条线16 threads 16 条线24 threads 24 条线32 threads 32 线程
Human genomes 人类基因组
 deBWT194153142134
 deBWT (no conversion) deBWT (无转换)109685648
 ParaBWT265240240241
Human contigs 人类等位基因
 deBWT183154123129
vdeBWT (no conversion) vdeBWT(无转换)116865656
 ParaBWT294277276262
Primate genomes 灵长类动物基因组
 deBWT423355332330
 deBWT (no conversion) deBWT (无转换)193125105100
 ParaBWT196182181180
Methods8 threads16 threads24 threads32 threads
Human genomes
 deBWT194153142134
 deBWT (no conversion)109685648
 ParaBWT265240240241
Human contigs
 deBWT183154123129
vdeBWT (no conversion)116865656
 ParaBWT294277276262
Primate genomes
 deBWT423355332330
 deBWT (no conversion)193125105100
 ParaBWT196182181180

‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output file.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出文件格式转换的时间。

Table 4.

Running time with various numbers of threads (in minutes)


表 4.不同线程数量下的运行时间(分钟)
Methods 方法8 threads 8 条线16 threads 16 条线24 threads 24 条线32 threads 32 线程
Human genomes 人类基因组
 deBWT194153142134
 deBWT (no conversion) deBWT (无转换)109685648
 ParaBWT265240240241
Human contigs 人类等位基因
 deBWT183154123129
vdeBWT (no conversion) vdeBWT(无转换)116865656
 ParaBWT294277276262
Primate genomes 灵长类动物基因组
 deBWT423355332330
 deBWT (no conversion) deBWT (无转换)193125105100
 ParaBWT196182181180
Methods8 threads16 threads24 threads32 threads
Human genomes
 deBWT194153142134
 deBWT (no conversion)109685648
 ParaBWT265240240241
Human contigs
 deBWT183154123129
vdeBWT (no conversion)116865656
 ParaBWT294277276262
Primate genomes
 deBWT423355332330
 deBWT (no conversion)193125105100
 ParaBWT196182181180

‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output file.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出文件格式转换的时间。

We further run deBWT on the in silico human genome dataset with various configurations on the k parameter to investigate its effect Table 5 . It can be observed from the result that, on a large range of k parameters, i.e. k = 23–31, the total running time is close, but for smaller k parameter, the time consumption is higher. This is likely owing to the fact that the moderate long k-mers (such as 23- to 31-mers) may have similar ability to span short repeats. In this situation, the structure of the dBG does not change much with these k configurations, i.e. there are similar numbers of unipaths as well as their copies in the graph. However, when k is smaller, the unipaths will be shorter and have more copies, which would make the de Bruijn branch encoding longer and more projection suffixes need to be sorted. k > 32 could have better ability to span repeats, which may improve the overall performance; however, it requires much more RAM space, as a k-mer cannot be stored by one 64-bits cell.
我们进一步在硅人类基因组数据集上以不同的 k 参数配置运行 deBWT,以研究其影响(见表 5)。从结果可以看出,在较大的 k 参数范围内(即 k = 23-31),总运行时间接近,但对于较小的 k 参数,耗时较多。这可能是由于中等长度的 k-分子(如 23 至 31-分子)可能具有类似于跨越短重复序列的能力。在这种情况下,dBG 的结构并不会因为这些 k 配置而发生太大变化,即图中的单路径及其副本数量相似。然而,当 k 较小时,单链路会更短,副本会更多,这将使 de Bruijn 分支编码更长,需要排序的投影后缀也更多。

Table 5.

Running time of the in silico human genome dataset with various configurations on the k parameter (in minutes)


表 5.不同 k 参数配置下的硅学人类基因组数据集的运行时间(分钟)
Methods 方法k = 19k = 23k = 27k = 31
deBWT142124131134
deBWT (no conversion) deBWT (无转换)75514748
Methodsk = 19k = 23k = 27k = 31
deBWT142124131134
deBWT (no conversion)75514748

‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output file.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出文件格式转换的时间。

Table 5.

Running time of the in silico human genome dataset with various configurations on the k parameter (in minutes)


表 5.不同 k 参数配置下的硅学人类基因组数据集的运行时间(分钟)
Methods 方法k = 19k = 23k = 27k = 31
deBWT142124131134
deBWT (no conversion) deBWT (无转换)75514748
Methodsk = 19k = 23k = 27k = 31
deBWT142124131134
deBWT (no conversion)75514748

‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output file.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出文件格式转换的时间。

The memory footprint of deBWT ( Table 6 ) depends on both of the method itself and the used k-mer counting tool. The memory usage of Jellyfish and KMC2 is highly configurable, and we set them to use relatively large memory to accomplish the k-mer counting step as fast as possible. The major RAM costs of the three phases of deBWT are different. In the first phase, the major cost originates from the data structure of k-mer sorting. Briefly, deBWT uses a linear table like PdES to bin all the k-mers; however, each cell of the table costs 16 bytes to record the string of the k-mer as well as its number of copies. The cost of the second phase is more complicated. It needs to simultaneously keep the input sequences, the de Bruijn branch index and the generated de Bruijn branch encoding in memory. Thus, the memory usage mainly depends on several issues, i.e. the size of the input sequence(s), the numbers of multiple-out and -in k-mers and the numbers of the copies of the multiple-out and -in k-mers. The last two items respectively determine the length of de Bruijn branch encoding and the number of unsolved suffixes, which need to record in memory. The numbers of the multiple-out and -in k-mers and their copies highly relate to the repetitiveness of the input genomes. We did statistics on the two human datasets (as they are more repetitive), and observed two issues ( Table 7 ).
deBWT 的内存占用(表 6)取决于方法本身和使用的 k-mer 计数工具。Jellyfish 和 KMC2 的内存使用是高度可配置的,我们将它们设置为使用相对较大的内存,以尽可能快地完成 k-mer计数步骤。deBWT 三个阶段的主要内存成本是不同的。在第一阶段,主要成本来自 k-mer排序的数据结构。简而言之,deBWT 使用 PdES 这样的线性表来对所有 k-聚合物进行分类;但是,该表的每个单元需要花费 16 个字节来记录 k-聚合物的字符串及其副本数。第二阶段的成本更为复杂。它需要同时将输入序列、de Bruijn 分支索引和生成的 de Bruijn 分支编码保存在内存中。因此,内存使用量主要取决于几个问题,即输入序列的大小、多出和-入 k-mers 的数量以及多出和-入 k-mers 的副本数量。后两项分别决定了需要记录在内存中的 de Bruijn 分支编码长度和未解决后缀的数量。多出k-mer和-in k-mer及其副本的数量与输入基因组的重复性有很大关系。我们对两个人类数据集(因为它们的重复性更高)进行了统计,发现了两个问题(表 7)。

Table 6.

Memory footprints with 32 CPU cores (in Gigabytes)


表 6.32 个 CPU 内核的内存占用空间(单位:千兆字节)
Methods 方法Human genomes 人类基因组Human contigs 人类等位基因Primate genomes 灵长类动物基因组
deBWT120/78/38120/63/34235/203/58
ParaBWT303029
RopeBWT2302440
Supplement
KMC2119119119
MethodsHuman genomesHuman contigsPrimate genomes
deBWT120/78/38120/63/34235/203/58
ParaBWT303029
RopeBWT2302440
Supplement
KMC2119119119

For the ‘x/y/z’ of deBWT in the memory columns, the x, y and z values respectively indicate the memory footprints of Jellyfish, phase1 of deBWT, and phases2 and phases3 of deBWT.
对于内存列中 deBWT 的 "x/y/z",x、y 和 z 值分别表示 Jellyfish、deBWT 第一阶段、deBWT 第二阶段和第三阶段的内存足迹。

Table 6.

Memory footprints with 32 CPU cores (in Gigabytes)


表 6.32 个 CPU 内核的内存占用空间(单位:千兆字节)
Methods 方法Human genomes 人类基因组Human contigs 人类等位基因Primate genomes 灵长类动物基因组
deBWT120/78/38120/63/34235/203/58
ParaBWT303029
RopeBWT2302440
Supplement
KMC2119119119
MethodsHuman genomesHuman contigsPrimate genomes
deBWT120/78/38120/63/34235/203/58
ParaBWT303029
RopeBWT2302440
Supplement
KMC2119119119

For the ‘x/y/z’ of deBWT in the memory columns, the x, y and z values respectively indicate the memory footprints of Jellyfish, phase1 of deBWT, and phases2 and phases3 of deBWT.
对于内存列中 deBWT 的 "x/y/z",x、y 和 z 值分别表示 Jellyfish、deBWT 第一阶段、deBWT 第二阶段和第三阶段的内存足迹。

Table 7.

Statistics on the in silico human genomes and contigs


表 7.硅学人类基因组和等位基因的统计数据
Statistics 统计资料Human genomes 人类基因组Human contigs 人类等位基因
length of input sequences
输入序列长度
3095543637130200003020
distinct k-mers 不同的 k 元素50737306694025285321
multiple-out k-mers 多层 K1882076317238123
multiple-in k-mers 复式K线1882180517237511
copies of multiple-out k-mers
多出 k-mer 的副本
23640046172301293218
copies of multiple-in k-mers
复本
23644457112300904807
StatisticsHuman genomesHuman contigs
length of input sequences3095543637130200003020
distinct k-mers50737306694025285321
multiple-out k-mers1882076317238123
multiple-in k-mers1882180517237511
copies of multiple-out k-mers23640046172301293218
copies of multiple-in k-mers23644457112300904807
Table 7.

Statistics on the in silico human genomes and contigs


表 7.硅学人类基因组和等位基因的统计数据
Statistics 统计资料Human genomes 人类基因组Human contigs 人类等位基因
length of input sequences
输入序列长度
3095543637130200003020
distinct k-mers 不同的 k 元素50737306694025285321
multiple-out k-mers 多层 K1882076317238123
multiple-in k-mers 复式K线1882180517237511
copies of multiple-out k-mers
多出 k-mer 的副本
23640046172301293218
copies of multiple-in k-mers
复本
23644457112300904807
StatisticsHuman genomesHuman contigs
length of input sequences3095543637130200003020
distinct k-mers50737306694025285321
multiple-out k-mers1882076317238123
multiple-in k-mers1882180517237511
copies of multiple-out k-mers23640046172301293218
copies of multiple-in k-mers23644457112300904807

First, for both the datasets, the numbers of multiple-out and -in k-mers are much less than S , i.e. the number of characters of the input sequences. Thus, the cost of the hash table is not expensive comparing with the entire input sequences. Moreover, it is also worth noting that for highly similar genomes, the increment of the numbers of multiple-out and -in k-mers would be much smaller comparing with the increment of involved genomes, as there are many common sequences and they would not introduce new branches into the dBG.
首先,对于这两个数据集来说,多出和-入 k-mers 的数量都远远小于 S ,即输入序列的字符数。因此,与整个输入序列相比,哈希表的成本并不昂贵。此外,值得注意的是,对于高度相似的基因组,多出和-入 k-mer数目的增量与涉及基因组的增量相比要小得多,因为有许多共同的序列,它们不会给 dBG 带来新的分支。

Second, the numbers of the copies of multiple-out and -in k-mers are also an order lower than S , although human genomes are repetitive. In this situation, the de Bruijn branch encoding can be seen as a DNA sequence an order shorter than S , so that the space cost is not large. The major cost originates from the copies of multiple-in k-mers, as it needs to record the ϕ· value and the BWT character with a few bytes for each copy.
其次,虽然人类基因组具有重复性,但多出和-入 k-mers的拷贝数也比 S 低一个数量级。在这种情况下,德布鲁因分支编码可视为比 S 短一个数量级的 DNA 序列,因此空间成本并不大。主要的成本来自多入 k-mers 的拷贝,因为每个拷贝需要用几个字节记录 ϕ· 值和 BWT 字符。

The RAM cost of the third step is also similar to that of the second phase. To sort the projection suffixes, it needs to keep the de Bruijn branch encoding and the ϕ· values and the BWT characters of the copies of the multiple-in k-mers in RAM.
第三步的 RAM 成本也与第二阶段类似。为了对投影后缀进行排序,需要在 RAM 中保留 de Bruijn 分支编码、 ϕ· 值以及多入 k-mers 副本的 BWT 字符。

4 Discussion 4 讨论

The well organization and indexing of many genomes will be on wide demand in future genomics studies, with the rapid increase of assembled genomes. As an important genome indexing data structure, BWT may have many applications; however, the construction of BWT for a large collection of genomes, especially highly similar re-sequenced genomes (e.g. many human individual genomes), is still a non-trivial task. Moreover, owing to the incremental nature of the state-of-the-art methods, it is hard to construct BWT with scalable parallel computing. This is a bottleneck to fully use the computational resources of modern servers or clusters to handle large amount of data.
随着组装基因组的迅速增加,对许多基因组进行良好的组织和索引将是未来基因组学研究的广泛需求。作为一种重要的基因组索引数据结构,BWT 可能会有很多应用;然而,为大量基因组,尤其是高度相似的重测序基因组(如许多人类个体基因组)构建 BWT 仍然是一项非同小可的任务。此外,由于最先进方法的增量性质,很难通过可扩展的并行计算来构建 BWT。这是充分利用现代服务器或集群的计算资源处理大量数据的瓶颈。

We propose deBWT, a novel parallel BWT construction approach, to break the bottleneck. The main contribution of deBWT is its dBG-based representation and organization of suffixes, which facilitates the comparison of suffixes with long common prefixes and avoid unnecessary comparisons. Moreover, owing to its non-incremental design, deBWT has good scalability to various computational resources. These properties make deBWT well-suited to construct BWT for large collections of highly similar or repetitive genomes with modern servers or clusters. In the experiments, deBWT achieves a substantial improvement on the speed of indexing multiple individual human genomes and contigs. For more diverse genomes, e.g. multiple primate genomes, deBWT also shows faster speed and better parallelization; however, the improvement is smaller, likely owing to that the density of the