这是用户在 2024-5-7 16:01 为 https://app.immersivetranslate.com/pdf-pro/049cfd9f-8bce-4c91-80e9-e7a7692045a3 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_07_194688a8a007543a820eg

ACO:Iossless quality score compression based on adaptive coding order
ACO:基于自适应编码顺序的无损质量评分压缩

Xidian University

Ma Mingming 马明明

Xidian University

Li Fu

Xidian University

Liu Xianming 刘显明

Peng Cheng Laboratory 彭城实验室

Shi Guangming 时光明

Xidian University

Research Article 研究文章

Keywords: High-throughput sequencing, quality score compression, lossless compression, adaptive coding order
关键词: 高通量测序, 质量分数压缩, 无损压缩, 自适应编码顺序
Posted Date: April 22nd, 2021
发布日期:2021 年 4 月 22 日
DOI: https://doi.org/10.21203/rs.3.rs-418072/v1
DOI:https://doi.org/10.21203/rs.3.rs-418072/v1
License: @ (1) This work is licensed under a Creative Commons Attribution 4.0 International License.
许可证:@(1)本作品根据知识共享署名 4.0 国际许可协议许可。
Read Full License 阅读完整许可证

ACO:lossless quality score compression based on adaptive coding order
ACO:基于自适应编码顺序的无损质量分数压缩

Yi Niu , Mingming Ma , Xianming and Guangming Shi

Abstract 摘要

Background: With the rapid development of high-throughput sequencing technology, the cost of whole genome sequencing drops rapidly, which leads to an exponential growth of genome data. Although the compression of DNA bases has achieved significant improvement in recent years, the compression of quality score is still challenging.
背景:随着高通量测序技术的快速发展,整个基因组测序的成本迅速下降,导致基因组数据呈指数增长。尽管 DNA 碱基的压缩在近年取得了显著进展,但质量分数的压缩仍然具有挑战性。

Results: In this paper, by reinvestigating the inherent correlations between the quality score and the sequencing process, we propose a novel lossless quality score compressor based on adaptive coding order (ACO). The main objective of ACO is to traverse the quality score adaptively in the most correlative trajectory according to the sequencing process. By cooperating with the adaptive arithmetic coding and context modeling, ACO achieves the state-of-the-art quality score compression performances with moderate complexity.
结果:通过重新调查质量分数与测序过程之间的固有相关性,本文提出了一种基于自适应编码顺序(ACO)的新型无损质量分数压缩器。ACO 的主要目标是根据测序过程在最相关的轨迹上自适应地遍历质量分数。通过与自适应算术编码和上下文建模的合作,ACO 实现了具有适度复杂性的最先进的质量分数压缩性能。

Conclusions: The competence enables ACO to serve as a candidate tool for quality score compression, ACO has been employed by AVS(Audio Video coding Standard Workgroup of China) and is freely available at https://github.com/Yoniming/code.
结论:ACO 的能力使其成为质量分数压缩的候选工具,ACO 已被中国 AVS(音视频编码标准工作组)采用,并可在 https://github.com/Yoniming/code 免费获取。

Keywords: High-throughput sequencing; quality score compression; lossless compression; adaptive coding order
关键词:高通量测序;质量分数压缩;无损压缩;自适应编码顺序

Background

Sequencing technology has gradually become a basic technology widely used in biological research [1]. Obtaining genetic information of different organisms can help us to improve our understanding of the organic world. In the past decades, the price of human whole genome sequencing (WGS) has dropped to less than , with a faster declining speed over the the Moore's Law expected [2]. In this case, the number of next-generation sequencing (NGS) data grows exponentially, even exceeds that of astronomical data [3]. How to efficiently compress the DNA data generated by large-scale genome projects has become an important factor restricting the further development of the DNA sequencing industry.
测序技术逐渐成为生物研究中广泛使用的基础技术[1]。获取不同生物体的遗传信息可以帮助我们改善对有机世界的理解。在过去的几十年里,人类全基因组测序(WGS)的价格已降至不到 ,并且下降速度比预期的摩尔定律更快[2]。在这种情况下,下一代测序(NGS)数据量呈指数增长,甚至超过天文数据[3]。如何高效压缩大规模基因组项目生成的 DNA 数据已成为限制 DNA 测序行业进一步发展的重要因素。
There are two major problem in the compression of DNA data: the nucleotide compression and quality score compression. The quality values takes more than half of the compression data and has been shown to be more difficult to compress than the nucleotide data . Especially with the development of assembling techniques , the nucleotide compression
DNA 数据压缩存在两个主要问题:核苷酸压缩和质量分数压缩。质量值占据了超过一半的压缩数据,并且已被证明比核苷酸数据更难压缩 。特别是随着组装技术的发展 ,核苷酸压缩
have achieved significant improvement which makes the quality score compression problem to be one of the main bottle-necks in the current DNA data storage and transfer applications.
已经取得了显著的改进,使得质量分数压缩问题成为当前 DNA 数据存储和传输应用中的主要瓶颈之一。
The quality score (QS) represents the confidence level of every base characters in the sequencing procedure, but with a much larger alphabets (41-46 distinct levels). [7] reveals that there are strong correlations among adjacent quality score, which can be regarded as the foundation of the current lossless quality score compression pipeline: 1) using Markov model to estimate the conditional probability of the quality score; 2) traversing every positions of the reads via a raster scan order; 3) encoding the quality score via arithmetic or range coding.
质量分数(QS)代表了测序过程中每个碱基字符的置信水平,但具有更大的字母表(41-46 个不同级别)。[7]揭示了相邻质量分数之间存在强相关性,这可以被视为当前无损质量分数压缩流水线的基础:1)使用马尔可夫模型来估计质量分数的条件概率;2)通过光栅扫描顺序遍历读取的每个位置;3)通过算术或范围编码对质量分数进行编码。
Based on the above pipeline, three distinguished lossless compressor have been proposed GTZ[8], Quip[9] and FQZcomp [5]. The only differences among these three works are the Markov model orders and context quantization strategies, thus the compression ratio varies around , depending on the data distribution. A negative view is unavoidably raised that there is not much rooms for the further improvement of lossless compression ratio.
基于上述流水线,已经提出了三种杰出的无损压缩器 GTZ[8]、Quip[9]和 FQZcomp[5]。这三项工作之间唯一的区别在于马尔可夫模型的阶数和上下文量化策略,因此压缩比率在 左右变化,取决于数据分布。一个消极的观点不可避免地提出,即无损压缩比率的进一步提高空间不大。
In this paper, by reinvestigate the sequencing process, we reveal two main drawback of the existing raster scan based quality score compression strategy. Firstly, the raster scan order is a "depth-first" traverse strategy of the read. However, as it indicated in [7], the quality score have a descent trend along one single read. This makes the piece-wise stationary assumption of Markov modeling untenable. Secondly, considering that the sequencing process is conduced by multi-spectral imaging , but the FASTQ file simply stores the quality score into a stack of 1D signals. The raster-scan based techniques compress every reads independently which fails to explore the potential 2D correlations the spatial-adjacent reads (not the adjacent reads from FASTQ files).
在本文中,通过重新调查测序过程,我们揭示了现有基于光栅扫描的质量分数压缩策略的两个主要缺点。首先,光栅扫描顺序是读取的“深度优先”遍历策略。然而,正如[7]所指出的,质量分数沿着一个单一读取有下降趋势。这使得马尔可夫建模的分段平稳假设站不住脚。其次,考虑到测序过程是通过 多光谱成像 进行的,但 FASTQ 文件只是将质量分数存储在一堆 1D 信号中。基于光栅扫描的技术独立地压缩每个读取,未能探索潜在的 2D 相关性,即空间相邻读取之间的相关性(而不是来自 FASTQ 文件的相邻读取)。
To overcome the above two drawbacks, we propose a novel quality score compression technique based on adaptive coding order(ACO). The main objective of is to traverse the quality score along the most relative directions, which can be regarded as a reorganization of the stack of independent 1D quality score vectors into highly related 2D matrices. Another improvement of the proposed ACO technique over the existing techniques is the compound context modeling strategy. As we will explain the details in Section of Method, instead of the adjacent QS values, the ACO context models consists of two additional aspects: 1) the global average of every reads; 2) the variant of DNA bases. The compound context model not only benefits the probability estimation and arithmetic coding, more importantly, in the implementation, it prevents ACO from multiple random access of the input FASTQ file: the compressing process can be accomplished in only one-path, at the cost of some context dilution and side-information.
为了克服上述两个缺点,我们提出了一种基于自适应编码顺序(ACO)的新型质量分数压缩技术。 的主要目标是沿着最相关的方向遍历质量分数,这可以被视为将独立的 1D 质量分数向量堆叠重新组织为高度相关的 2D 矩阵。所提出的 ACO 技术相对于现有技术的另一个改进是复合上下文建模策略。正如我们将在方法部分详细解释的那样,ACO 上下文模型不仅包括相邻的 QS 值,还包括两个额外方面:1)每个读取的全局平均值;2)DNA 碱基的变异。复合上下文模型不仅有利于概率估计和算术编码,更重要的是,在实现中,它防止了 ACO 对输入 FASTQ 文件的多次随机访问:压缩过程可以在仅一次路径中完成,以一些上下文稀释和辅助信息为代价。
Experimental results show that the proposed ACO technique achieves the state-of-the-arts performances for the lossless quality score compression, which achieves more than gains in the compression ratio over FQZcomp [5]. The only drawback of ACO is the memory cost, comparing with FQZcomp, ACO requires and additional memory to buffer the quality score matrixes and store the compound context models respectively, which should no longer been a big problem for the current PC.
实验结果表明,所提出的 ACO 技术在无损质量评分压缩方面取得了最先进的性能,相比 FQZcomp [5]在压缩比方面获得了超过 的增益。ACO 唯一的缺点是内存成本,与 FQZcomp 相比,ACO 需要额外的 内存来缓冲 质量评分矩阵和分别存储复合上下文模型,这对于当前的 PC 来说不再是一个大问题。

Declaration 声明

In this section, we will first analyze the data characteristics of the quality score, and illustrate that the coding sequence will have a certain impact on the quality score compression through specific examples, so as to promote us to compress the quality score along the direction with the strongest data correlation. Secondly, by analyzing the sequencing principle and the generation process of FASTQ file, we explore the extra relevance in the quality score data to build a novel composite context quantification model.
在本节中,我们将首先分析质量评分的数据特征,并通过具体示例说明编码序列将对质量评分压缩产生一定影响,从而促使我们沿着与数据相关性最强的方向压缩质量评分。其次,通过分析 FASTQ 文件的排序原则和生成过程,我们探讨了质量评分数据中的额外相关性,以建立一种新的复合上下文量化模型。
Impact of coding order
编码顺序的影响
The quality score represent the estimation of the probability of the corresponding nucleotide error in the reads, and it is the evaluation of the reliability of the base character. This information is used for both the quality control of the original data and the downstream analysis. We give the distribution of the quality score of the four reads of ERR2438054 in Fig.1, it can be seen that due to the influence of noise, the quality score is a random and unstable signal, and there is a strong correlation between adjacent quality score. Therefore, we can use these characteristics of quality score to change the coding order to improve the compression ratio. Changing the order doesn't sound like changing the entropy value, because according to the information theory, the information quantity of the source is a function of probability, which is represented by the information entropy of the source. However, since the adaptive arithmetic encoder is used in coding, the encoder will update the symbol probability regularly, so changing the order can reduce the size of the bitstream. The discussion on the principle of arithmetic encoder is not the main content of this paper, so we just give a test experiment to show the influence of coding order on compression results. Firstly, we create two random signals and , and let . Then, randomly disturb the distribution of and record it as , sort the distribution of by size and record it as Z3. Finally, three groups of different signals are encoded by 0 -order arithmetic encoder, the result of the bitstream is . This is because the sorting process is equivalent to placing the data together with similar distribution and strong correlation. The coding after changing the order can better cooperate with the probability update mechanism of adaptive arithmetic encoder.
质量分数代表读取中相应核苷酸错误的概率估计,也是对碱基字符可靠性的评估。这些信息用于原始数据的质量控制和下游分析。我们在图 1 中给出了 ERR2438054 的四个读取的质量分数分布,可以看到由于噪声的影响,质量分数是一个随机且不稳定的信号,并且相邻质量分数之间存在很强的相关性。因此,我们可以利用这些质量分数的特性来改变编码顺序以提高压缩比。改变顺序听起来不像改变熵值,因为根据信息理论,源的信息量是概率的函数,由源的信息熵表示。然而,由于编码中使用了自适应算术编码器,编码器会定期更新符号概率,因此改变顺序可以减小比特流的大小。 本文讨论算术编码器原理并非主要内容,因此我们只进行了一个测试实验,以展示编码顺序对压缩结果的影响。首先,我们创建两个随机信号 ,并让 。然后,随机扰乱 的分布并记录为 ,按大小对 的分布进行排序并记录为 Z3。最后,通过 0 阶算术编码器对三组不同信号进行编码,比特流的结果为 。这是因为排序过程相当于将具有相似分布和强相关性的数据放在一起。更改顺序后的编码可以更好地配合自适应算术编码器的概率更新机制。
Fig. 1: Quality score distribution curve of
图 1: 的质量评分分布曲线

Mining more relevant information
挖掘更多相关信息

Take the current widely used Hiseq sequencing platform as an example, the sequencing process consists of three steps: 1) construction of DNA library, 2) generating DNA cluster by bridge PCR amplification and 3) sequencing. In this paper we reinvestigate the sequencing step to mining more inherent correlations among the quality score to aid the compression task. The basic principle of sequencing is based on multi-spectral imaging of the flowcell. The flowcell is a carrier for sequencing and each flowcell has eight lanes with chemically modified inner surfaces. Each lane contains 96 tiles and each tile has a unique cluster. Every cluster corresponds to one DNA patches thus one flow cell can generate 768 DNA reads simultaneously.
以当前广泛使用的 Hiseq 测序平台为例,测序过程包括三个步骤:1)构建 DNA 文库,2)通过桥式 PCR 扩增生成 DNA 簇,3)测序。本文重新调查测序步骤,以挖掘更多质量分数之间的内在相关性,以帮助压缩任务。测序的基本原理是基于流式细胞多光谱成像。流式细胞是测序的载体,每个流式细胞有八个带有化学修饰内表面的通道。每个通道包含 96 个瓷砖,每个瓷砖都有一个独特的簇。每个簇对应一个 DNA 片段,因此一个流式细胞可以同时生成 768 个 DNA 读取。
As shown in in Fig.2, the sequencing process consists of five steps. In step 1, the polymerase and one type of dNTP are added into the flowcell to activate the fluorescent of the specific clusters. In step 2, the multispectral camera takes one shot of the flow cell with the specific wavelength according to the added dNTP. Then in step 3, chemical reagents are adopted to wash out the flowcell to prepare for the next imaging. The above three steps are repeated four times with different dNTPs and different imaging wavelength to get a four channel multi-spectral image. In step 4 , based on the captured four channel image, the sequencing machine not only estimate the most likely type of every cluster but also evaluate the confident level of the estimation, which are stored as the bases and quality score respectively. Step1-4 is regarded one sequencing cycle which sequence one position (depth) of all the 768 reads in the flowcell. Thus in step 5, the sequencing cycle is repeated several times and the repeated number of cycles corresponds to the length of the reads.
如图 2 所示,测序过程包括五个步骤。在第 1 步中,聚合酶和一种类型的 dNTP 被加入流动池中,以激活特定簇的荧光。在第 2 步中,多光谱相机根据添加的 dNTP 拍摄流动池的一张照片,使用特定波长。然后在第 3 步中,采用化学试剂清洗流动池,为下一次成像做准备。以上三个步骤将使用不同的 dNTP 和不同的成像波长重复四次,以获得四通道多光谱图像。在第 4 步中,基于捕获的四通道图像,测序机不仅估计每个簇最可能的类型,还评估估计的置信水平,分别存储为碱基和质量分数。步骤 1-4 被视为一个测序周期,该周期测序流动池中所有 768 个读取的一个位置(深度)。因此,在第 5 步中,测序周期将重复多次,重复的周期数对应于读取的长度。
As we discussed in details as follows, there are three aspects which corresponds to the quality score values: 1) number of cycles, 2) base change and 3) local position of chip.
正如我们在下面详细讨论的那样,有三个方面与质量评分值相对应:1)循环次数,2)碱基更改和 3)芯片的局部位置。
The number of cycles affects the distribution of quality score. DNA polymerases are used in the process of synthesis and sequencing, at the beginning of sequencing, the synthesis reaction was not very stable, but the quality of the enzyme was very good, so it would fluctuate in the high-quality region. With the progress of sequencing, the reaction tends to be stable, but the enzyme activity and specificity gradually decreases, thus the cumulative error is gradually amplified. As a result, the probability of error increases, and the overall quality score shows a downward trend. As shown in Fig.3, with the progress of sequencing, the mean value of quality score decreases gradually, while the variance is increasing. Therefore, it is improper to assume every read as a stationary random signal along the traditional raster scan order.
循环次数影响质量评分的分布。DNA 聚合酶在合成和测序过程中使用,在测序开始时,合成反应并不是非常稳定,但酶的质量非常好,因此会在高质量区域波动。随着测序的进行,反应趋于稳定,但酶的活性和特异性逐渐降低,因此累积误差逐渐放大。结果,错误的概率增加,整体质量评分呈下降趋势。如图 3 所示,随着测序的进行,质量评分的均值逐渐降低,而方差逐渐增大。因此,不适宜假设每个读取为沿着传统的光栅扫描顺序的静止随机信号。
The base change also affects the distribution of quality score. As we discussed before, the recognition of base types in a flowcell is conducted in a four step loop according to the order of dNTP and wavelength. For example, let's assume the loop order is 'A-C-G-T', if the bases of a read is '... AA...', after the imaging of the first ' ', the flowcell is washed four times until the imaging of the second ' '. But if the bases is '...TA...', the machine only wash the flowcell once before the imaging of ' '. In this way, if the flowcell contains some residuals in the cluster, the former 'A' base will affects the imaging process of the latter ' ' base, which may cause ambiguity of ' ' thus the quality score of ' ' drops significantly. Although some machines adopt compound dNTP to replace the four step loop, the residual is still the case that affect the quality score. Therefore, for the quality score compression, the base change should be considered as a side-information to model the marginal probability of every quality score.
基本变化也会影响质量分数的分布。正如我们之前讨论的,流式细胞中对碱基类型的识别是根据 dNTP 和波长的顺序进行的四步循环。例如,假设循环顺序是'A-C-G-T',如果一个读取的碱基是'... AA...',在第一个' '成像后,流式细胞会被清洗四次,直到第二个' '成像。但如果碱基是'...TA...',机器在第二个' '成像前只会清洗一次流式细胞。这样,如果流式细胞中包含一些残留物,前一个'A'碱基将影响后一个' '碱基的成像过程,可能导致' '的模糊,从而导致' '的质量分数显著下降。尽管一些机器采用复合 dNTP 来替代四步循环,但残留物仍然会影响质量分数。因此,为了质量分数的压缩,基本变化应被视为模拟每个质量分数的边际概率的辅助信息。
The local position of chip affects the distribution of quality score. The flowcell can be regarded as a 2D array that every cluster corresponds to an entry of the array. If the fluorescent of an high amplitude one entry may diffused to the adjacent entries of the array, which is the well-known "cross-talk" phenomena [11]. In other words, there is spatial correlations among the adjacent quality score. However, the stored FASTQ file is a 1D stack of all the reads which ignores the correlation. Therefore, the compression of quality score should mining the potential 2D spatial correlations among the reads.
芯片的本地位置会影响质量分数的分布。流式细胞芯片可以被视为一个二维数组,其中每个聚类对应数组的一个条目。如果一个高振幅的条目的荧光可能会扩散到数组相邻的条目,这就是众所周知的“串扰”现象[11]。换句话说,相邻的质量分数之间存在 空间相关性。然而,存储的 FASTQ 文件是所有读取的一维堆栈,忽略了 相关性。因此,质量分数的压缩应该挖掘读取之间的潜在二维空间相关性。
Fig. 3: Distribution of quality score made by FASTQ
图 3:由 FASTQ 生成的质量分数分布

Methods

In this section, we will discuss the proposed adaptive coding order (ACO) based quality score compression technique. The two contribution of ACO is 1) using an adaptive scan order to replace the traditional raster scan order which forms a more stationary signal. 2) using a compound context modeling which considers the influence of base change while exploring the potential correlations among quality score.
在本节中,我们将讨论提出的基于自适应编码顺序(ACO)的质量分数压缩技术。ACO 的两个贡献是:1)使用自适应扫描顺序来替代传统的光栅扫描顺序,形成更稳定的信号。2)使用复合上下文建模,考虑了碱基变化的影响,同时探索质量分数之间的潜在 相关性。

traverse the quality score along the most relative directions
沿着最相关的方向遍历质量分数

As can be seen from the Fig.3, with the increase of reads length, the column mean decreases but the variance becomes larger, this proves that there is a strong correlation between columns. At the same time, the reduction in the column mean is also consistent with the actual process of specific sequencing, which the quality score have a descent trend along one single read. It has been verified that changing the scan order can improve the performance of the adaptive arithmetic encoder, and coding along the more stable signal will get better coding effect.
如图 3 所示,随着读取长度的增加,列均值减小但方差变大,这证明列之间存在强相关性。同时,列均值的减小也与特定测序过程的实际情况一致,质量分数沿着一个单一读取呈下降趋势。已经验证改变扫描顺序可以提高自适应算术编码器的性能,并且沿着更稳定的信号进行编码将获得更好的编码效果。
All compression methods witch based on arithmetic encoder use the scan method in Fig.4(a) when traversing data encoding. Under this scan method, quality score is encoded line by line, after scan a line, the next step is starting from the beginning of the second line. Obviously, after encoding the last character of the front line, connecting the first character of the next line will cause a great jump and this jump will make the conversion between signals unstable. So we use an adaptive scan order to replace the traditional raster scan order so that realize the stable traversal of the signal which as shown in Fig.4(b). The starting point starts with the first element and traverses down the column until the end of a column, then traverses backward up from the end of the next column. Different from the traditional scanning method, ACO adopts a scanning way like the shape of the snake. The reason to use snake traversal is to make the transition between columns more smooth, the end symbols of one column are more relevant to the end symbols of next column, the correlation between red and green symbol is obviously stronger than the correlation between red and blue symbol. Therefore, after the red symbol is encoded, it is more appropriate to select the green symbol than the blue symbol to encode from the second column. By changing the scanning order, the encoding is carried out in a more stable direction. The probability update mechanism of adaptive arithmetic encoder is fully utilized without introducing other factors.
所有基于算术编码器的压缩方法在遍历数据编码时都使用图 4(a)中的扫描方法。在这种扫描方法下,质量分数是逐行编码的,扫描完一行后,下一步是从第二行的开头开始。显然,在编码完前一行的最后一个字符后,连接下一行的第一个字符会导致一个很大的跳跃,这种跳跃会使信号之间的转换变得不稳定。因此,我们使用自适应扫描顺序来取代传统的光栅扫描顺序,从而实现信号的稳定遍历,如图 4(b)所示。起始点从第一个元素开始,沿着列向下遍历直到列的末尾,然后从下一列的末尾向后向上遍历。与传统的扫描方法不同,ACO 采用了类似蛇形的扫描方式。使用蛇形遍历的原因是使列之间的过渡更加平滑,一列的结束符号与下一列的结束符号更相关,红色和绿色符号之间的相关性明显强于红色和蓝色符号之间的相关性。 因此,在对红色符号进行编码后,从第二列选择绿色符号进行编码比选择蓝色符号更合适。通过改变扫描顺序,编码沿着更稳定的方向进行。自适应算术编码器的概率更新机制得到充分利用,而不引入其他因素。
(a)
(b)
Fig. 4: Comparison of traditional scanning and ACO scanning:(a)traditional traversal method; (b)adaptive scan order compound context modeling
图 4:传统扫描与 ACO 扫描的比较:(a)传统遍历方法;(b)自适应扫描顺序复合上下文建模
As Section of Declaration explains, the compression of quality score should mining the potential 2D spatial correlations among the reads, so we using a compound context modeling to express the extra relevance in the quality score data. There are two additional aspects are contained in the ACO context model and the first aspect is to get the global average of each read. According to the example in Declaration, it can be seen that adjusting the data order to make the similar symbols cluster together will get good results in compression. As shown in Fig.1, the distribution curves of the four reads are very similar, only some singular points show the differences. So it is an improved strategy to cluster and code the data with similar row distribution, but it will take a lot of steps to calculate the distribution of each row, and cluster similar rows will also bring the loss of time and space. We calculate the mean information of each row to reflect its distribution, and classify the rows with the same mean value. For the row information, the mean is a measure standard of stationarity, rows with the same mean value can be approximately regarded as basically the same distribution in the whole row, although some singular points may make the distribution curve not completely coincide. Instead of calculating the Kullback-Leibler Divergence between rows, the use of row mean can save a lot of computation and time without wasting the
正如《声明部分》所解释的那样,质量分数的压缩应该挖掘阅读之间的潜在二维空间相关性,因此我们使用复合上下文建模来表达质量分数数据中的额外相关性。ACO 上下文模型中包含两个额外方面,第一个方面是获得每个读取的全局平均值。根据声明中的示例,可以看到调整数据顺序使相似符号聚集在一起将在压缩中获得良好结果。如图 1 所示,四个读取的分布曲线非常相似,只有一些奇点显示出差异。因此,将具有相似行分布的数据聚类和编码是一种改进的策略,但需要大量步骤来计算每行的分布,并且聚类相似行也会带来时间和空间的损失。我们计算每行的平均信息以反映其分布,并将具有相同平均值的行进行分类。 对于行信息,均值是一个衡量平稳性的标准,具有相同均值的行可以大致被视为整个行中基本相同的分布,尽管一些奇点可能使分布曲线不完全重合。而不是计算行之间的 Kullback-Leibler 散度,使用行均值可以节省大量计算和时间,而不会浪费行之间的相关性。

correlation between rows. The row clustering method needs to transmit extra information to the decoder to record the change of row order, facing the same problem, using the mean also needs to transmit the mean information of each line to the decoder. In practice, we will compare the extra coding amount and the actual revenue value brought by the row mean value. When the gain is greater than the original, we will choose to add the line mean information as the context.
行聚类方法需要向解码器传输额外信息以记录行顺序的变化,面对相同问题,使用均值也需要向解码器传输每行的均值信息。在实践中,我们将比较额外编码量和行均值带来的实际收益价值。当收益大于原始值时,我们将选择添加行均值信息作为上下文。
Specifically, in the process of building context model, there will be the problem of context dilution, so we need to design a suitable quantization method for mean value so that solve the problem of context dilution, this is a dynamic programming problem and the optimization objective of the quantization of a discretely distributed random variable is to minimize the distortion. The expression for the objective is:
具体来说,在构建上下文模型的过程中,会出现上下文稀疏的问题,因此我们需要设计一个适当的均值量化方法来解决上下文稀疏的问题,这是一个动态规划问题,离散分布随机变量的量化优化目标是最小化失真。目标的表达式为:
where is the value of that have nonzero probability, is the quantization value of , and is a specific distortion measure. We can define a condition set to indicate that each specific corresponds to a specific value. Define a quantized set , where represents the quantized variable, so , if . Therefore, each subset corresponds to a partition of , which has . The expressions for and are as follows:
其中 是具有非零概率的 的值, 的量化值, 是特定的失真度量。我们可以定义一个条件集 来表示每个特定的 对应于特定的 值。定义一个量化集 ,其中 表示量化变量,所以 ,如果 。因此,每个子集 对应于 的一个分区,其中包含 的表达式如下:
it can be seen that the generated from to contains all the , so can be regarded as the quantized value of . Following the Equ.(1), we can get the context quantized as:
可以看出,从 生成的 包含所有的 ,因此 可以被视为 的量化值。根据方程(1),我们可以得到上下文的量化为:
where is the quantization objective function, and minimizing means obtaining at least locally optimal quantization results. The optimization objective of the context quantization becomes: for a given number of quantization levels , find an optimum partition scheme for , then calculate the optimum quantization values for all partitions so that the Equ.(4) is minimized.
其中 是量化目标函数,最小化 意味着至少获得局部最优的量化结果。上下文量化的优化目标变为:对于给定的量化级别数 ,找到 的最佳分区方案,然后计算所有 个分区的最佳量化值 ,使得 Equ.(4)被最小化。
By calculating the dynamic programming problem, we get a result that is suitable for the test data. In order to improve the computational efficiency in the actual compression process, we use the calculated result as the quantization method. If necessary to improve the compression ratio for the specified file, users can solve the optimization problem separately and get the best way to quantify it. Finally, our method of quantifying row characteristics as shown in Fig.5:
通过计算动态规划问题,我们得到了适用于测试数据的结果。为了提高实际压缩过程中的计算效率,我们使用计算结果作为量化方法。如果需要提高指定文件的压缩比,用户可以单独解决优化问题,并找到最佳的量化方式。最后,我们的量化行特征的方法如图 5 所示:
Fig. 5: The way of row mean quantization
图 5:行均值量化方式
It can be seen that the quantization method in this case is very similar to the lossy compression which joined thresholds. The difference is that we quantify the row characteristics without affecting the lossless decoding, just extract the correlation features between the rows and using the method of dynamic programming to get a better result. The final quantitative method which include the current value is:
可以看到,在这种情况下,量化方法与加入阈值的有损压缩非常相似。不同之处在于,我们量化行特征而不影响无损解码,只是提取行之间的相关特征,并使用动态规划方法获得更好的结果。最终的量化方法包括当前值 是:
On the other hand, the distribution of quality score is random and the waveform will be not smooth transition influenced by quality score values. So different from modeling by the waveform, we can start with sequencing principle for these singular points. [12] reveals that quality score between adjacent bases is usually similar and the probability distribution of the quality score is affected by the base distribution. Considering that there are some natural similarities in the process of obtaining nucleotide sequences, the distribution of bases is regarded as a criterion to measure whether there is a singular point in the quality score, which is used to simulate the stationarity between two symbols. The increase in the base order will cause the model to grow exponentially, and balancing the model size and effect improvement rate, we choose the second-order to describe the correlation between base
另一方面,质量分数的分布是随机的,波形不会受质量分数值的影响而平滑过渡。因此,与波形建模不同,我们可以从这些奇异点的排序原则开始。[12]揭示了相邻碱基之间的质量分数通常相似,质量分数的概率分布受碱基分布的影响。考虑到在获取核苷酸序列的过程中存在一些自然相似之处,碱基的分布被视为衡量质量分数是否存在奇异点的标准,用于模拟两个符号之间的稳定性。碱基顺序的增加将导致模型呈指数增长,并在模型大小和效果改进率之间取得平衡,我们选择第二顺序来描述碱基之间的相关性。

and quality score. In FASTQ file, a quality score corresponds to a base and the conditional entropy is:
以及质量分数。在 FASTQ 文件中,一个质量分数对应一个碱基,条件熵为:
where is the base value of the quality score of the current code and this formula shows that the influence of base on entropy. After synthesizing all the context models, we provide the final composite context modeling strategy in Fig. 6 and the ACO algorithm in Algorithm 1.
其中 是当前代码质量评分 的基础值,该公式显示了基础对熵的影响。在综合所有上下文模型后,我们在图 6 中提供了最终的复合上下文建模策略,以及算法 1 中的 ACO 算法。
Fig. 6: Composite context modeling strategy
图 6:复合上下文建模策略

Results and discussion 结果和讨论

In this section, we have compared the performance of our algorithm ACO with other state of the art algorithms and report the results. We have compared our algorithm with general purpose compression algorithms like gzip and 7zip and also a set of algorithms specific to the domain namely GTZ, fqzcomp and quip. We have restricted our focus to lossless compression, and have not evaluated a number of promising lossy methods, nor methods only capable of compressing nucleotide sequences. For the fqzcomp algorithm, we compare the results of and compression modes and for GTZ algorithm, it does not display the quality score compression results separately, so we compare the normal mode. It is important to note that ACO has more advantages in compressing aligned quality score and does not accept any input other than a raw FASTQ file. At the same time, we do not bring into comparison algorithms that accept any reference genome. The datasets used in our experiments are downloaded in FASTQ format from the National Center for Biotechnology Information - Sequence Read Archive (NCBI-SRA) database [13] and are presented in Table 1.
在本节中,我们已经将我们的算法 ACO 与其他最先进的算法进行了比较,并报告了结果。我们将我们的算法与通用压缩算法(如 gzip 和 7zip)以及一组特定于领域的算法(即 GTZ、fqzcomp 和 quip)进行了比较。我们将焦点限定在无损压缩上,并未评估许多有前途的有损方法,也没有评估仅能压缩核苷酸序列的方法。对于 fqzcomp 算法,我们比较了 压缩模式的结果,对于 GTZ 算法,它没有单独显示质量分数压缩结果,因此我们比较正常模式。需要注意的是,ACO 在压缩对齐质量分数方面具有更多优势,并且除了原始 FASTQ 文件外,不接受任何其他输入。同时,我们没有将接受任何参考基因组的算法纳入比较范围。我们实验中使用的数据集是从国家生物技术信息中心-序列读取存档(NCBI-SRA)数据库[13]以 FASTQ 格式下载的,并在表 1 中呈现。
All the experiments were run on a server with a Inter Core i9-9900K CPU processor, of RAM, 2TB disk space and Ubuntu 16.04. All algorithms are compared in terms of compression ratios and bits per quality value (BPQ). The CR and
所有实验均在一台配备 Inter Core i9-9900K CPU 处理器, 内存,2TB 磁盘空间和 Ubuntu 16.04 的服务器上运行。所有算法都根据压缩比 和每质量值比特数(BPQ)进行比较。CR 和
Algorithm 1:ACO algorithm framework
Input: A FASTQ file.
Output: The compressed quality score file
    STEP1: Data preprocessing
    1. Use a \(N \times K\) matrix \(\mathbf{Q}\) to store the quality score of
    FASTQ file.
    2.Use a \(N \times K\) matrix \(\mathbf{P}\) to store the base value of FASTQ
    file.
    for all \(0 \leq n \leq N\) do
        for all \(0 \leq k \leq K\) do
            calculate the max of \(\mathbf{Q}(n, k)\) as symbol_max
            calculate the \(\min\) of \(\mathbf{Q}(n, k)\) as symbol_min
            symbol_number=symbol_max-symbol_min+1
        end for
    end for
    STEP2: Composite context model
    1. Calculate model_num by model_num = symbol_number \({ }^{3}\).
    2. Use a \(N \times 1\) vector \(\mathbf{M}\) to store the mean value of the quality
    value in each line.
    for all \(0 \leq n \leq N\) do
        for all \(0 \leq k \leq K\) do
            \(A=\mathbf{M}(n, 1)\)
            \(B=\max [\mathbf{Q}(n, k-1), \mathbf{Q}(n, k-2)]\)
            \(C=\max [\mathbf{Q}(n, k-3), \mathbf{Q}(n, k-4)]\)
            if \(\quad([\mathbf{Q}(n, k-3)==\mathbf{Q}(n, k-4)]) \quad D=0\);
            else \(\quad D=1\)
            \(E=[\mathbf{P}(n, k), \mathbf{P}(n, k-1)]\)
            model_ \(i d x=[A, B, C, D, E]\)
        end for
    end for
    STEP3: Snake traversal coding
    \(\mathrm{i}=0\);
    for all \(0 \leq k \leq K\) do
        for all \(0 \leq n \leq N\) do
        if \((k \% 2==0) \quad i=k\)
        else \(\quad i=N-n-1\);
        use arithmetic encoder to compress \(\mathbf{Q}(n, k)\) with model_idx
        end for
    end for
is defined as follows:
定义如下:
where indicates the compressed file size, indicates the size of the file before compression, compression results of all algorithms on the NGS datasets are summarized in Table 2.
其中 表示压缩文件大小, 表示压缩前文件大小,所有算法在 NGS 数据集上的压缩结果总结在表 2 中。
Table 2 gives an improvement of ACO relative to each comparison algorithm, and further reflects the advantages of the method we use by compression ratio. Compared with Gzip, the file size is reduced by an average of , and the average file size is reduced by compared with the 7-Zip under the optimal setting. The results show that the proposed ACO algorithm achieves better results on six representative data. Particularly, ACO obtains an average compression ratio of , resulting in an over size
表 2 显示 ACO 相对于每个比较算法的改进,并进一步通过压缩比反映了我们使用的方法的优势。与 Gzip 相比,文件大小平均减少 ,在最佳设置下与 7-Zip 相比,平均文件大小减少 。结果表明,提出的 ACO 算法在六个代表性数据上取得了更好的结果。特别是,ACO 获得了平均压缩比 ,导致超过 的大小
Table 1: Descriptions of 6 FASTQ datasets used for evaluation
表 1:用于评估的 6 个 FASTQ 数据集的描述
Run ID Sequencing Platform 测序平台 FASTQ Size(bytes) Read Length Quality Size(bytes)
NA12878_2 BGISEQ-500 134363357648 56983386200
ERR2438054_1 BGISEQ-500 133406591610 47097570000
ERR174324_1 Illumina HiSeq 2000 57800970448 22580690796
ERR174331_1 Illumina HiSeq 2000 57210954538 22350322320
ERR174327_1 Illumina HiSeq 2000 54724344869 21379957043
ERR174324_2 Illumina HiSeq 2000 57800970448 22580690796
Table 2: All algorithmic
rompression results for NGS data sets
NGS 数据集的压缩结果
Run ID ratio gzip gtz quip fqzcomp(q1) fqzcomp(q2) ACO
NA12878_2 CR(%) 48.55 49.66 38.47 38.48 39.08 38.59
BPQ 3.88 3.97 3.08 3.08 3.13 3.09
ERR2438054_1 CR(%) 46.23 47.11 37.09 36.52 37.08 36.71
BPQ 3.70 3.77 2.97 2.92 2.97 2.94
ERR174324_1 CR(%) 36.58 36.94 25.47 26.14 27.30 25.81
BPQ 2.93 2.96 2.04 2.09 2.18 2.06
ERR174331_1 CR(%) 36.55 36.91 25.45 26.11 27.27 25.77
BPQ 2.92 2.95 2.04 2.09 2.18 1.89
ERR174327_1 CR(%) 35.53 35.88 24.56 25.31 26.39 24.90
BPQ 2.84 2.87 1.96 2.02 2.11 1.99
ERR174324_2 CR(%) 38.47 38.81 27.07 27.52 28.89 27.37
BPQ 3.08 3.10 2.17 2.20 2.31 2.19
reduction in the quality score data. At the same time, the average result is much smaller than the original in ASCII format. Two evaluation criteria indicate that ACO has achieved the best compression results for the different methods of the same document. At the same time, according to the differences platforms, the ACO algorithm proposed in this paper sets different modes and processing strategies, which makes the compression efficiency higher.
质量分数数据的降低。同时,平均 结果比 ASCII 格式的原始 要小得多。两个评估标准表明 ACO 已经实现了相同文档不同方法的最佳压缩结果。同时,根据不同平台的差异,本文提出的 ACO 算法设置了不同的模式和处理策略,使得压缩效率更高。

Conclusion and future works
结论和未来工作

This paper introduces ACO, a lossless quality score compression algorithm based on adaptive coding order. ACO traverse the quality score along the most relative directions and use compound context modeling strategy to achieve the state-of-the-art lossless compression performances. However, the current ACO version, especially the proposed compound context modeling strategy is proposed for the second generation sequencing machines. For the third generation sequencing data, the compound context models may be modified to involve more genoms and adjacent quality score, but the context dilution problem may be appears as the increasing of context models. An alternative solution maybe using the deep learning technique to estimate the marginal probability of every quality score to replace the current context modeling. In our further works, we will concentrate on both of the above two strategies and extend ACO to the third generation sequencing data.
本文介绍了 ACO,一种基于自适应编码顺序的无损质量分数压缩算法。ACO 沿着最相关的方向遍历质量分数,并使用复合上下文建模策略来实现最先进的无损压缩性能。然而,当前 ACO 版本,特别是提出的复合上下文建模策略是为第二代测序机提出的。对于第三代测序数据,复合上下文模型可能会被修改以涉及更多的基因组和相邻的质量分数,但上下文稀释问题可能会随着上下文模型的增加而出现。另一种解决方案可能是使用深度学习技术来估计每个质量分数的边际概率,以取代当前的上下文建模。在我们的进一步工作中,我们将集中于上述两种策略,并将 ACO 扩展到第三代测序数据。

Abbreviations 缩写

NCBI: National Centre for Biotechnology Information;
NCBI:国家生物技术信息中心;
NGS: next-generation sequencing;
NGS:下一代测序;

WGS: Whole-genome sequencing;
WGS:全基因组测序;
QS: quality score; QS:质量分数;
CR: compression ratios; CR:压缩比率;
: bits per quality value.
:每个质量值的比特数。

Acknowledgements 致谢

We would like to thank the Editor and the reviewers for their precious comments on this work which helped improve the quality of this paper.
我们要感谢编辑和审稿人对这项工作提出的宝贵意见,这些意见有助于提高本文的质量。

Funding

This work was supported in part of NSFC (No. 61875157,61672404 , 61632019, 61751310 and 61836008), National Defense Basic Scientific Research Program of China (JCKY2017204B102), Science and Technology Plan of Xi'an (20191122015KYPT011JC013), the Fundamental Research Funds of the Central Universities of China (No. RW200141,JC1904 and JX18001), the National Key Research and Development Project of China (2018YFB2202400)
这项工作得到了 NSFC(No. 61875157,61672404, 61632019, 61751310 和 61836008)的部分支持,中国国防基础科学研究计划(JCKY2017204B102),西安市科技计划(20191122015KYPT011JC013),中国中央高校基本科研业务费(No. RW200141,JC1904 和 JX18001),中国国家重点研发计划(2018YFB2202400)。

Availability of data and materials
数据和材料的可获得性

Project name: ACO 项目名称:ACO
Project website: https://github.com/Yoniming/ACO
项目网站: https://github.com/Yoniming/ACO
Operating systems: Linux or Windows
操作系统: Linux 或 Windows
Programming language:
Other requirements: GCC compiler and the archiving tool 'tar' License: The MIT License
其他要求: GCC 编译器和归档工具'tar' 许可证: MIT 许可证
Any restrictions to use by non-academics: For commercial use, please contact the authors. All datasets are downloaded from SRA of NCBI. All data supporting the conclusions of this article are included within the article and its additional files.
非学术人员使用的任何限制:商业用途,请与作者联系。所有数据集均从 NCBI 的 SRA 下载。支持本文结论的所有数据均包含在本文及其附加文件中。
Ethics approval and consent to participate
道德批准和参与同意
Not applicable. 不适用。

Competing interests 竞争利益

The authors declare that they have no competing interests.
作者声明他们没有竞争利益。
Not applicable. 不适用。
Authors' contributions 作者贡献
YN and MM conceived the algorithm, developed the program, and wrote the manuscript. FL and GS helped with manuscript editing, designed and performed experiments. XL prepared the data sets, carried out analyses and helped with program design. All authors read and approved the final manuscript.
YN 和 MM 构思了算法,开发了程序,并撰写了手稿。FL 和 GS 协助手稿编辑,设计并执行实验。XL 准备了数据集,进行了分析,并协助进行程序设计。所有作者都阅读并批准了最终手稿。
Author details 作者详细信息
School of artificial intelligence, Xidian University, Xian, 710071 China.
西安电子科技大学人工智能学院,中国陕西省西安市,710071。
The Pengcheng Lab, Shenzhen, 518055 China.

References 参考文献

  1. You, Z.-H., Yin, Z., Han, K., Huang, D.-S., Zhou, X.: A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network. Bmc Bioinformatics 11(1), 343 (2010). [PMC free article] [PubMed] [Google Scholar]
    You, Z.-H., Yin, Z., Han, K., Huang, D.-S., Zhou, X.: 通过结合功能基因网络的功能和拓扑特性,采用半监督学习方法预测合成遗传相互作用。Bmc 生物信息学 11(1),343 (2010)。[PMC 免费文章] [PubMed] [Google 学术搜索]
  2. Wetterstrand, K.A.: DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP). www.genome.gov/sequencingcostsdata (2016)
    Wetterstrand, K.A.: DNA 测序成本:来自 NHGRI 基因组测序计划(GSP)的数据。www.genome.gov/sequencingcostsdata (2016)
  3. Stephens, Z.D.: Big data: Astronomical or genomical? Plos Biology 13(7), 1002195 (2015). [PMC free article] [PubMed] [Google Scholar]
    斯蒂芬斯,Z.D.:大数据:天文学还是基因组学?Plos Biology 13(7), 1002195 (2015). [PMC 免费文章] [PubMed] [Google 学者]
  4. Ochoa, I., Hernaez, M., Goldfeder, R., Weissman, T., Ashley, E.: Effect of lossy compression of quality scores on variant calling. Briefings in bioinformatics 18(2), 183-194 (2016). [PubMed] [Ref list]
  5. Bonfield, J.K., Mahoney, M.V.: Compression of fastq and sam format sequencing data. PloS one 8(3), 59190 (2013). [PMC free article] [PubMed] [Google Scholar]
    邦菲尔德,J.K.,马洪尼,M.V.:快速测序数据的 fastq 和 sam 格式压缩。PloS one 8(3), 59190 (2013). [PMC 免费文章] [PubMed] [Google 学者]
  6. Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479-486 (2011)
    布罗马奇,A.J.:用于组装大基因组的简洁数据结构。Bioinformatics 27(4), 479-486 (2011)
  7. Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V., Varghese, G.: Compressing genomic sequence fragments using slimgene. Journal of Computational Biology 18(3), 401-413 (2011). [PMC free article] [PubMed] [Google Scholar]
    Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V., Varghese, G.: 使用 slimgene 压缩基因组序列片段。计算生物学杂志 18(3), 401-413 (2011)。[PMC 免费文章] [PubMed] [Google 学者]
  8. Xing, Y., Li, G., Wang, Z., Feng, B., Song, Z., Wu, C.: Gtz: a fast compression and cloud transmission tool optimized for fastq files. BMC bioinformatics 18(16), 549 (2017). [PMC free article] [PubMed] [Google Scholar]
    Xing, Y., Li, G., Wang, Z., Feng, B., Song, Z., Wu, C.: Gtz: 一种针对 fastq 文件进行优化的快速压缩和云传输工具。BMC 生物信息学 18(16), 549 (2017)。[PMC 免费文章] [PubMed] [Google 学者]
  9. Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic acids research 40(22), 171-171 (2012). [PMC free article] [PubMed] [Google Scholar]
    Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: 通过高效的 de novo 组装辅助压缩下一代测序读数。核酸研究 40(22), 171-171 (2012)。[PMC 免费文章] [PubMed] [Google 学者]
  10. Sanger, F., Nicklen, S., Coulson, A.R.: Dna sequencing with chain-terminating inhibitors. Proceedings of the national academy of sciences 74(12), 5463-5467 (1977)
    Sanger, F., Nicklen, S., Coulson, A.R.:使用链终止抑制剂进行 DNA 测序。国家科学院院刊 74(12),5463-5467(1977 年)
  11. Geiger, B., Bershadsky, A., Pankov, R., Yamada, K.M.:
Transmembrane crosstalk between the extracellular matrix-cytoskeleton crosstalk. Nature Reviews Molecular Cell Biology 2(11), 793-805 (2001)
胞外基质-细胞骨架之间的跨膜相互作用。自然评论分子细胞生物学 2(11),793-805(2001 年)
  1. Das, S., Vikalo, H.: Base-calling for illumina's next-generation dna sequencing systems via viterbi algorithm. In: 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1733-1736 (2011). IEEE. [Crossrref] [PubMed] [Google Scholar]
    Das, S., Vikalo, H.:通过维特比算法为 illumina 的下一代 DNA 测序系统进行碱基呼叫。在:2011 年第 49 届奥尔顿通信、控制和计算年会(Allerton),第 1733-1736 页(2011 年)。IEEE。【Crossrref】【PubMed】【Google Scholar】
  2. Leinonen, R., Sugawara, H.: the International Nucleotide Sequence Database (2010)
    Leinonen, R., Sugawara, H.:国际核苷酸序列数据库(2010)
Figures 图表
Figure 1 图 1
Quality scores distribution curve of ERR2438054
ERR2438054 的质量分数分布曲线
Figure 2 图 2
Schematic diagram of sequencing principle
测序原理示意图
Figure 3 图 3
Distribution of quality scores made by FASTQ
由 FASTQ 生成的质量分数分布
Figure 4 图 4
Figure 5 图 5
The way of row mean quantization
行均值量化的方式
Base
sequence
A C T G C A T
Quality
scores
33 35 38 34 35 36 35 M
Quality score being encoded
被编码的质量分数
Previous quality scores context of length
先前的质量分数上下文长度
Base sequence context of length
基本序列上下文长度
The average value of the quality value in this line
本行中质量值的平均值
Figure 6 图 6
Composite context modeling strategy
复合上下文建模策略

  1. Correspondence: niuyi@mail.xidian.edu.cn
    School of artificial intelligence, Xidian University, Xian, 710071 China
    西安电子科技大学人工智能学院,中国陕西省西安市,710071
    Full list of author information is available at the end of the article
    作者信息的完整列表可在文章末尾找到