这是用户在 2024-5-7 16:03 为 https://app.immersivetranslate.com/pdf-pro/049cfd9f-8bce-4c91-80e9-e7a7692045a3 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_07_194688a8a007543a820eg

ACO:Iossless quality score compression based on adaptive coding order
ACO:基于自适应编码顺序的无损质量分数压缩

Xidian University

Ma Mingming 马光明

Xidian University

Li Fu

Xidian University

Liu Xianming 刘先明

Peng Cheng Laboratory 鹏程实验室

Shi Guangming 石光明

Xidian University

Research Article 研究文章

Keywords: High-throughput sequencing, quality score compression, lossless compression, adaptive coding order
关键词高通量测序、质量分数压缩、无损压缩、自适应编码顺序
Posted Date: April 22nd, 2021
发布日期2021 年 4 月 22 日
License: @ (1) This work is licensed under a Creative Commons Attribution 4.0 International License.
授权许可:@ (1) 本作品采用知识共享 署名 4.0 国际许可协议进行许可。
Read Full License 阅读完整许可证

ACO:lossless quality score compression based on adaptive coding order
ACO:基于自适应编码顺序的无损质量分数压缩

Yi Niu , Mingming Ma , Xianming and Guangming Shi

Abstract 摘要

Background: With the rapid development of high-throughput sequencing technology, the cost of whole genome sequencing drops rapidly, which leads to an exponential growth of genome data. Although the compression of DNA bases has achieved significant improvement in recent years, the compression of quality score is still challenging.
背景:随着高通量测序技术的快速发展,全基因组测序的成本迅速下降,导致基因组数据呈指数级增长。虽然近年来 DNA 碱基的压缩技术取得了显著的进步,但质量分数的压缩仍然是一项挑战。

Results: In this paper, by reinvestigating the inherent correlations between the quality score and the sequencing process, we propose a novel lossless quality score compressor based on adaptive coding order (ACO). The main objective of ACO is to traverse the quality score adaptively in the most correlative trajectory according to the sequencing process. By cooperating with the adaptive arithmetic coding and context modeling, ACO achieves the state-of-the-art quality score compression performances with moderate complexity.
结果本文通过重新研究质量分数与排序过程之间的内在关联性,提出了一种基于自适应编码顺序(ACO)的新型无损质量分数压缩器。自适应编码顺序的主要目的是根据排序过程,以最相关的轨迹自适应地遍历质量分数。通过自适应算术编码和上下文建模的配合,ACO 以中等的复杂度实现了最先进的质量分数压缩性能。

Conclusions: The competence enables ACO to serve as a candidate tool for quality score compression, ACO has been employed by AVS(Audio Video coding Standard Workgroup of China) and is freely available at https://github.com/Yoniming/code.
结论ACO 已被 AVS(中国音视频编码标准工作组)采用,并可在 https://github.com/Yoniming/code 免费获取。

Keywords: High-throughput sequencing; quality score compression; lossless compression; adaptive coding order
关键词高通量测序;质量分数压缩;无损压缩;自适应编码顺序

Background

Sequencing technology has gradually become a basic technology widely used in biological research [1]. Obtaining genetic information of different organisms can help us to improve our understanding of the organic world. In the past decades, the price of human whole genome sequencing (WGS) has dropped to less than , with a faster declining speed over the the Moore's Law expected [2]. In this case, the number of next-generation sequencing (NGS) data grows exponentially, even exceeds that of astronomical data [3]. How to efficiently compress the DNA data generated by large-scale genome projects has become an important factor restricting the further development of the DNA sequencing industry.
测序技术已逐渐成为生物研究中广泛应用的基础技术[1]。获取不同生物的遗传信息可以帮助我们增进对有机世界的了解。在过去的几十年里,人类全基因组测序(WGS)的价格已经降到了 以下,其下降速度超过了摩尔定律的预期[2]。在这种情况下,下一代测序(NGS)数据的数量呈指数级增长,甚至超过了天文数据[3]。如何有效压缩大规模基因组项目产生的 DNA 数据,已成为制约 DNA 测序行业进一步发展的重要因素。
There are two major problem in the compression of DNA data: the nucleotide compression and quality score compression. The quality values takes more than half of the compression data and has been shown to be more difficult to compress than the nucleotide data . Especially with the development of assembling techniques , the nucleotide compression
DNA 数据压缩有两个主要问题:核苷酸压缩和质量分数压缩。质量值占压缩数据的一半以上,而且已被证明比核苷酸数据更难压缩 。尤其是随着组装技术的发展 ,核苷酸压缩的难度更大。
have achieved significant improvement which makes the quality score compression problem to be one of the main bottle-necks in the current DNA data storage and transfer applications.
已取得重大改进,这使得质量分数压缩问题成为当前 DNA 数据存储和传输应用中的主要瓶颈之一。
The quality score (QS) represents the confidence level of every base characters in the sequencing procedure, but with a much larger alphabets (41-46 distinct levels). [7] reveals that there are strong correlations among adjacent quality score, which can be regarded as the foundation of the current lossless quality score compression pipeline: 1) using Markov model to estimate the conditional probability of the quality score; 2) traversing every positions of the reads via a raster scan order; 3) encoding the quality score via arithmetic or range coding.
质量得分(QS)代表测序过程中每个碱基字符的置信度,但其字母表更大(41-46 个不同等级)。文献[7]揭示了相邻质量得分之间存在很强的相关性,这可以被视为当前无损质量得分压缩流水线的基础:1)使用马尔可夫模型估计质量得分的条件概率;2)通过光栅扫描顺序遍历读数的每个位置;3)通过算术或范围编码对质量得分进行编码。
Based on the above pipeline, three distinguished lossless compressor have been proposed GTZ[8], Quip[9] and FQZcomp [5]. The only differences among these three works are the Markov model orders and context quantization strategies, thus the compression ratio varies around , depending on the data distribution. A negative view is unavoidably raised that there is not much rooms for the further improvement of lossless compression ratio.
在上述管道的基础上,提出了 GTZ[8]、Quip[9] 和 FQZcomp [5]三种不同的无损压缩器。这三种压缩器的区别仅在于马尔可夫模型阶数和上下文量化策略,因此压缩率在 左右,具体取决于数据分布。这就不可避免地提出了一个负面观点,即无损压缩率的进一步提高空间不大。
In this paper, by reinvestigate the sequencing process, we reveal two main drawback of the existing raster scan based quality score compression strategy. Firstly, the raster scan order is a "depth-first" traverse strategy of the read. However, as it indicated in [7], the quality score have a descent trend along one single read. This makes the piece-wise stationary assumption of Markov modeling untenable. Secondly, considering that the sequencing process is conduced by multi-spectral imaging , but the FASTQ file simply stores the quality score into a stack of 1D signals. The raster-scan based techniques compress every reads independently which fails to explore the potential 2D correlations the spatial-adjacent reads (not the adjacent reads from FASTQ files).
本文通过对测序过程的重新研究,揭示了现有基于光栅扫描的质量分数压缩策略的两个主要缺点。首先,光栅扫描顺序是一种 "深度优先 "的读取遍历策略。然而,正如文献[7]所指出的,质量得分沿着单个读数呈下降趋势。这使得马尔可夫模型的片断静止假设站不住脚。其次,考虑到测序过程是由 多光谱成像 ,但 FASTQ 文件只是将质量得分存储到一维信号的堆栈中。基于光栅扫描的技术对每个读数进行独立压缩,无法探索空间相邻读数(而非 FASTQ 文件中的相邻读数)之间潜在的二维相关性。
To overcome the above two drawbacks, we propose a novel quality score compression technique based on adaptive coding order(ACO). The main objective of is to traverse the quality score along the most relative directions, which can be regarded as a reorganization of the stack of independent 1D quality score vectors into highly related 2D matrices. Another improvement of the proposed ACO technique over the existing techniques is the compound context modeling strategy. As we will explain the details in Section of Method, instead of the adjacent QS values, the ACO context models consists of two additional aspects: 1) the global average of every reads; 2) the variant of DNA bases. The compound context model not only benefits the probability estimation and arithmetic coding, more importantly, in the implementation, it prevents ACO from multiple random access of the input FASTQ file: the compressing process can be accomplished in only one-path, at the cost of some context dilution and side-information.
为了克服上述两个缺点,我们提出了一种基于自适应编码顺序(ACO)的新型质量分数压缩技术。 的主要目标是沿着最相关的方向遍历质量分数,这可以被视为将独立的一维质量分数向量堆重组为高度相关的二维矩阵。与现有技术相比,拟议 ACO 技术的另一项改进是复合上下文建模策略。我们将在 "方法 "部分详细说明,ACO 上下文模型由两个额外方面组成,而不是相邻的 QS 值:1) 每个读数的全局平均值;2) DNA 碱基的变异。复合上下文模型不仅有利于概率估计和算术编码,更重要的是,在实现过程中,它避免了 ACO 对输入 FASTQ 文件的多次随机访问:压缩过程只需一条路径即可完成,但代价是一些上下文稀释和副信息。
Experimental results show that the proposed ACO technique achieves the state-of-the-arts performances for the lossless quality score compression, which achieves more than gains in the compression ratio over FQZcomp [5]. The only drawback of ACO is the memory cost, comparing with FQZcomp, ACO requires and additional memory to buffer the quality score matrixes and store the compound context models respectively, which should no longer been a big problem for the current PC.
实验结果表明,所提出的 ACO 技术在无损质量分数压缩方面达到了最先进的水平,与 FQZcomp 相比,其压缩率提高了 [5]。ACO 的唯一缺点是内存成本,与 FQZcomp 相比,ACO 需要分别增加 内存来缓冲 质量得分矩阵和存储复合上下文模型,这对于目前的 PC 机来说应该不再是一个大问题。

Declaration 声明

In this section, we will first analyze the data characteristics of the quality score, and illustrate that the coding sequence will have a certain impact on the quality score compression through specific examples, so as to promote us to compress the quality score along the direction with the strongest data correlation. Secondly, by analyzing the sequencing principle and the generation process of FASTQ file, we explore the extra relevance in the quality score data to build a novel composite context quantification model.
本节首先分析质量得分的数据特征,通过具体实例说明编码序列会对质量得分压缩产生一定影响,从而推动我们沿着数据相关性最强的方向压缩质量得分。其次,通过分析 FASTQ 文件的测序原理和生成过程,挖掘质量得分数据中的额外相关性,从而建立一个新颖的复合上下文量化模型。
Impact of coding order
编码顺序的影响
The quality score represent the estimation of the probability of the corresponding nucleotide error in the reads, and it is the evaluation of the reliability of the base character. This information is used for both the quality control of the original data and the downstream analysis. We give the distribution of the quality score of the four reads of ERR2438054 in Fig.1, it can be seen that due to the influence of noise, the quality score is a random and unstable signal, and there is a strong correlation between adjacent quality score. Therefore, we can use these characteristics of quality score to change the coding order to improve the compression ratio. Changing the order doesn't sound like changing the entropy value, because according to the information theory, the information quantity of the source is a function of probability, which is represented by the information entropy of the source. However, since the adaptive arithmetic encoder is used in coding, the encoder will update the symbol probability regularly, so changing the order can reduce the size of the bitstream. The discussion on the principle of arithmetic encoder is not the main content of this paper, so we just give a test experiment to show the influence of coding order on compression results. Firstly, we create two random signals and , and let . Then, randomly disturb the distribution of and record it as , sort the distribution of by size and record it as Z3. Finally, three groups of different signals are encoded by 0 -order arithmetic encoder, the result of the bitstream is . This is because the sorting process is equivalent to placing the data together with similar distribution and strong correlation. The coding after changing the order can better cooperate with the probability update mechanism of adaptive arithmetic encoder.
质量得分是对读数中相应核苷酸错误概率的估计,也是对碱基特征可靠性的评估。这些信息既可用于原始数据的质量控制,也可用于下游分析。我们在图 1 中给出了ERR2438054 四个读数的质量得分分布情况,可以看出,由于受到噪声的影响,质量得分是一个随机的、不稳定的信号,相邻质量得分之间存在很强的相关性。因此,我们可以利用质量分数的这些特点来改变编码顺序,从而提高压缩比。改变顺序听起来并不像改变熵值,因为根据信息论,信源的信息量是概率的函数,用信源的信息熵来表示。但是,由于在编码中使用了自适应算术编码器,编码器会定期更新符号概率,因此改变阶次可以减小比特流的大小。关于算术编码器原理的讨论不是本文的主要内容,因此我们只给出一个测试实验来说明编码顺序对压缩结果的影响。首先,我们创建两个随机信号 ,并让 。然后,随机扰乱 的分布并将其记录为 ,按大小排序 的分布并将其记录为 Z3。最后,三组不同的信号通过 0 阶算术编码器进行编码,比特流的结果为 。这是因为排序过程相当于把分布相似、相关性强的数据放在一起。改变顺序后的编码能更好地配合自适应算术编码器的概率更新机制。
Fig. 1: Quality score distribution curve of
图 1:图 1

Mining more relevant information
挖掘更多相关信息

Take the current widely used Hiseq sequencing platform as an example, the sequencing process consists of three steps: 1) construction of DNA library, 2) generating DNA cluster by bridge PCR amplification and 3) sequencing. In this paper we reinvestigate the sequencing step to mining more inherent correlations among the quality score to aid the compression task. The basic principle of sequencing is based on multi-spectral imaging of the flowcell. The flowcell is a carrier for sequencing and each flowcell has eight lanes with chemically modified inner surfaces. Each lane contains 96 tiles and each tile has a unique cluster. Every cluster corresponds to one DNA patches thus one flow cell can generate 768 DNA reads simultaneously.
以目前广泛使用的 Hiseq 测序平台为例,测序过程包括三个步骤:1)构建 DNA 文库;2)通过桥式 PCR 扩增生成 DNA 簇;3)测序。本文对测序步骤进行了重新研究,以挖掘更多质量得分之间的内在联系,从而帮助完成压缩任务。测序的基本原理基于流式细胞的多光谱成像。流式细胞是测序的载体,每个流式细胞有八个内表面经过化学修饰的泳道。每个泳道包含 96 块瓷砖,每块瓷砖都有一个独特的簇。每个簇对应一个 DNA 片段,因此一个流动池可同时产生 768 个 DNA 读数。
As shown in in Fig.2, the sequencing process consists of five steps. In step 1, the polymerase and one type of dNTP are added into the flowcell to activate the fluorescent of the specific clusters. In step 2, the multispectral camera takes one shot of the flow cell with the specific wavelength according to the added dNTP. Then in step 3, chemical reagents are adopted to wash out the flowcell to prepare for the next imaging. The above three steps are repeated four times with different dNTPs and different imaging wavelength to get a four channel multi-spectral image. In step 4 , based on the captured four channel image, the sequencing machine not only estimate the most likely type of every cluster but also evaluate the confident level of the estimation, which are stored as the bases and quality score respectively. Step1-4 is regarded one sequencing cycle which sequence one position (depth) of all the 768 reads in the flowcell. Thus in step 5, the sequencing cycle is repeated several times and the repeated number of cycles corresponds to the length of the reads.
如图 2 所示,测序过程包括五个步骤。第一步,在流动池中加入聚合酶和一种 dNTP,以激活特定簇的荧光。第 2 步,多光谱照相机根据添加的 dNTP 用特定波长拍摄流动池。然后在步骤 3 中,采用化学试剂冲洗流动池,为下一次成像做准备。以上三个步骤重复四次,使用不同的 dNTP 和不同的成像波长,得到四通道多光谱图像。在步骤 4 中,测序仪根据采集到的四通道图像,不仅能估计出每个团簇最可能的类型,还能评估估计结果的可信度,分别存储为碱基和质量得分。步骤 1-4 被视为一个测序周期,对流池中所有 768 个读数的一个位置(深度)进行测序。因此,在步骤 5 中,测序周期会重复多次,重复的周期数与读数的长度相对应。
As we discussed in details as follows, there are three aspects which corresponds to the quality score values: 1) number of cycles, 2) base change and 3) local position of chip.
正如我们在下文中详细讨论的,有三个方面与质量分值相对应:1) 周期数;2) 基数变化;3) 芯片的局部位置。
The number of cycles affects the distribution of quality score. DNA polymerases are used in the process of synthesis and sequencing, at the beginning of sequencing, the synthesis reaction was not very stable, but the quality of the enzyme was very good, so it would fluctuate in the high-quality region. With the progress of sequencing, the reaction tends to be stable, but the enzyme activity and specificity gradually decreases, thus the cumulative error is gradually amplified. As a result, the probability of error increases, and the overall quality score shows a downward trend. As shown in Fig.3, with the progress of sequencing, the mean value of quality score decreases gradually, while the variance is increasing. Therefore, it is improper to assume every read as a stationary random signal along the traditional raster scan order.
循环次数会影响质量得分的分布。在合成和测序过程中都会用到 DNA 聚合酶,在测序初期,合成反应不是很稳定,但酶的质量很好,所以会在高质量区域波动。随着测序的进行,反应趋于稳定,但酶的活性和特异性逐渐降低,累积误差逐渐放大。因此,出错概率增加,总体质量得分呈下降趋势。如图 3 所示,随着测序的进展,质量得分的平均值逐渐降低,而方差却在增加。因此,按照传统的光栅扫描顺序将每个读数都假定为静止的随机信号是不妥当的。
The base change also affects the distribution of quality score. As we discussed before, the recognition of base types in a flowcell is conducted in a four step loop according to the order of dNTP and wavelength. For example, let's assume the loop order is 'A-C-G-T', if the bases of a read is '... AA...', after the imaging of the first ' ', the flowcell is washed four times until the imaging of the second ' '. But if the bases is '...TA...', the machine only wash the flowcell once before the imaging of ' '. In this way, if the flowcell contains some residuals in the cluster, the former 'A' base will affects the imaging process of the latter ' ' base, which may cause ambiguity of ' ' thus the quality score of ' ' drops significantly. Although some machines adopt compound dNTP to replace the four step loop, the residual is still the case that affect the quality score. Therefore, for the quality score compression, the base change should be considered as a side-information to model the marginal probability of every quality score.
碱基变化也会影响质量得分的分布。正如我们之前所讨论的,流式细胞中碱基类型的识别是按照 dNTP 和波长的顺序以四步循环的方式进行的。例如,假设循环顺序为 "A-C-G-T",如果读取的碱基为"...AA......",在第一个 " "成像后,流式细胞将被清洗四次,直到第二个 " "成像。但如果读数为"...TA...",机器在 " "成像前只清洗一次流动池。这样,如果流式细胞簇中含有一些残留物,前一个 "A "碱基就会影响后一个 " "碱基的成像过程,从而可能导致 " "的模糊性,使 " "的质量得分大幅下降。虽然有些机器采用了复合 dNTP 来代替四步循环,但残差仍然会影响质量得分。因此,在压缩质量得分时,应将碱基变化作为一个侧面信息来建立每个质量得分的边际概率模型。
The local position of chip affects the distribution of quality score. The flowcell can be regarded as a 2D array that every cluster corresponds to an entry of the array. If the fluorescent of an high amplitude one entry may diffused to the adjacent entries of the array, which is the well-known "cross-talk" phenomena [11]. In other words, there is spatial correlations among the adjacent quality score. However, the stored FASTQ file is a 1D stack of all the reads which ignores the correlation. Therefore, the compression of quality score should mining the potential 2D spatial correlations among the reads.
芯片的局部位置会影响质量得分的分布。流式细胞可被视为一个二维阵列,每个簇都对应阵列的一个条目。如果一个条目的荧光振幅较高,就可能向阵列的相邻条目扩散,这就是著名的 "串扰 "现象[11]。换句话说,相邻质量得分之间存在 空间相关性。然而,存储的 FASTQ 文件是所有读数的一维堆栈,忽略了 相关性。因此,质量得分的压缩应该挖掘读数之间潜在的二维空间相关性。
Fig. 3: Distribution of quality score made by FASTQ
图 3:FASTQ 质量得分的分布情况

Methods

In this section, we will discuss the proposed adaptive coding order (ACO) based quality score compression technique. The two contribution of ACO is 1) using an adaptive scan order to replace the traditional raster scan order which forms a more stationary signal. 2) using a compound context modeling which considers the influence of base change while exploring the potential correlations among quality score.
在本节中,我们将讨论所提出的基于自适应编码顺序(ACO)的质量分数压缩技术。ACO 的两个贡献是:1)使用自适应扫描顺序取代传统的光栅扫描顺序,后者形成的信号更加固定。2) 使用复合上下文建模,在探索质量分数之间潜在 相关性的同时,考虑基数变化的影响。

traverse the quality score along the most relative directions
沿着最相对的方向遍历质量得分

As can be seen from the Fig.3, with the increase of reads length, the column mean decreases but the variance becomes larger, this proves that there is a strong correlation between columns. At the same time, the reduction in the column mean is also consistent with the actual process of specific sequencing, which the quality score have a descent trend along one single read. It has been verified that changing the scan order can improve the performance of the adaptive arithmetic encoder, and coding along the more stable signal will get better coding effect.
从图 3 中可以看出,随着读数长度的增加,列均值减小,但方差变大,这证明列与列之间存在很强的相关性。同时,列均值的减小也与具体测序过程中质量得分沿着单个读数呈下降趋势的实际情况相吻合。由此可以验证,改变扫描顺序可以提高自适应算术编码器的性能,沿着更稳定的信号进行编码可以获得更好的编码效果。
All compression methods witch based on arithmetic encoder use the scan method in Fig.4(a) when traversing data encoding. Under this scan method, quality score is encoded line by line, after scan a line, the next step is starting from the beginning of the second line. Obviously, after encoding the last character of the front line, connecting the first character of the next line will cause a great jump and this jump will make the conversion between signals unstable. So we use an adaptive scan order to replace the traditional raster scan order so that realize the stable traversal of the signal which as shown in Fig.4(b). The starting point starts with the first element and traverses down the column until the end of a column, then traverses backward up from the end of the next column. Different from the traditional scanning method, ACO adopts a scanning way like the shape of the snake. The reason to use snake traversal is to make the transition between columns more smooth, the end symbols of one column are more relevant to the end symbols of next column, the correlation between red and green symbol is obviously stronger than the correlation between red and blue symbol. Therefore, after the red symbol is encoded, it is more appropriate to select the green symbol than the blue symbol to encode from the second column. By changing the scanning order, the encoding is carried out in a more stable direction. The probability update mechanism of adaptive arithmetic encoder is fully utilized without introducing other factors.
所有基于算术编码器的压缩方法在遍历数据编码时都使用图 4(a)中的扫描方法。在这种扫描方法下,质量分数是逐行编码的,扫描完一行后,下一步从第二行开始。显然,在对前一行的最后一个字符进行编码后,连接下一行的第一个字符会产生很大的跳变,这种跳变会使信号之间的转换不稳定。因此,我们采用自适应扫描顺序来取代传统的光栅扫描顺序,从而实现信号的稳定遍历,如图 4(b)所示。起点从第一个元素开始,向下遍历一列,直到一列结束,然后从下一列结束处向上遍历。与传统的扫描方式不同,ACO 采用的是一种类似蛇形的扫描方式。采用蛇形遍历的原因是为了使列与列之间的过渡更加平滑,一列的末尾符号与下一列的末尾符号更加相关,红绿符号之间的相关性明显强于红蓝符号之间的相关性。因此,在对红色符号进行编码后,选择绿色符号比蓝色符号更适合从第二列开始编码。通过改变扫描顺序,编码可以朝着更稳定的方向进行。在不引入其他因素的情况下,自适应算术编码器的概率更新机制得到了充分利用。
(a)
(b)
Fig. 4: Comparison of traditional scanning and ACO scanning:(a)traditional traversal method; (b)adaptive scan order compound context modeling
图 4:传统扫描与 ACO 扫描的比较:(a)传统遍历方法;(b)自适应扫描顺序复合上下文建模
As Section of Declaration explains, the compression of quality score should mining the potential 2D spatial correlations among the reads, so we using a compound context modeling to express the extra relevance in the quality score data. There are two additional aspects are contained in the ACO context model and the first aspect is to get the global average of each read. According to the example in Declaration, it can be seen that adjusting the data order to make the similar symbols cluster together will get good results in compression. As shown in Fig.1, the distribution curves of the four reads are very similar, only some singular points show the differences. So it is an improved strategy to cluster and code the data with similar row distribution, but it will take a lot of steps to calculate the distribution of each row, and cluster similar rows will also bring the loss of time and space. We calculate the mean information of each row to reflect its distribution, and classify the rows with the same mean value. For the row information, the mean is a measure standard of stationarity, rows with the same mean value can be approximately regarded as basically the same distribution in the whole row, although some singular points may make the distribution curve not completely coincide. Instead of calculating the Kullback-Leibler Divergence between rows, the use of row mean can save a lot of computation and time without wasting the
正如声明部分所述,质量得分的压缩应挖掘读数之间潜在的二维空间相关性,因此我们使用复合上下文建模来表达质量得分数据中的额外相关性。ACO 上下文模型还包含两个方面,第一个方面是获取每个读数的全局平均值。根据《宣言》中的例子可以看出,调整数据顺序,使相似符号聚类在一起,会得到很好的压缩效果。如图 1 所示,四个读数的分布曲线非常相似,只有一些奇异点存在差异。因此,对具有相似行分布的数据进行聚类和编码是一种改进策略,但计算每一行的分布需要很多步骤,而且对相似行进行聚类也会带来时间和空间上的损失。我们计算每一行的平均值信息,以反映其分布情况,并对平均值相同的行进行分类。对于行信息来说,均值是静止性的衡量标准,均值相同的行可以近似地认为整行的分布基本相同,尽管一些奇异点可能会使分布曲线不完全重合。使用行均值可以节省大量的计算量和时间,而不需要计算行间的库尔贝克-莱伯勒离差,同时也不会浪费行均值。

correlation between rows. The row clustering method needs to transmit extra information to the decoder to record the change of row order, facing the same problem, using the mean also needs to transmit the mean information of each line to the decoder. In practice, we will compare the extra coding amount and the actual revenue value brought by the row mean value. When the gain is greater than the original, we will choose to add the line mean information as the context.
行之间的相关性。行聚类法需要向解码器传输额外的信息来记录行序的变化,面临同样的问题,使用均值法也需要向解码器传输每行的均值信息。在实际应用中,我们会比较行均值带来的额外编码量和实际收益值。当收益大于原值时,我们会选择添加行均值信息作为上下文。
Specifically, in the process of building context model, there will be the problem of context dilution, so we need to design a suitable quantization method for mean value so that solve the problem of context dilution, this is a dynamic programming problem and the optimization objective of the quantization of a discretely distributed random variable is to minimize the distortion. The expression for the objective is:
具体来说,在建立上下文模型的过程中,会出现上下文稀释的问题,因此我们需要设计一种合适的均值量化方法,从而解决上下文稀释的问题,这是一个动态编程问题,离散分布随机变量量化的优化目标是使失真最小。目标的表达式为
where is the value of that have nonzero probability, is the quantization value of , and is a specific distortion measure. We can define a condition set to indicate that each specific corresponds to a specific value. Define a quantized set , where represents the quantized variable, so , if . Therefore, each subset corresponds to a partition of , which has . The expressions for and are as follows:
其中, 是概率不为零的 值, 的量化值, 是特定的失真度。我们可以定义一个条件集 ,表示每个特定的 对应于一个特定的 值。定义一个量化集 ,其中 代表量化变量,所以 ,如果 。因此,每个子集 对应于 的一个分区,该分区有 的表达式如下:
it can be seen that the generated from to contains all the , so can be regarded as the quantized value of . Following the Equ.(1), we can get the context quantized as:
可以看出,从 生成的 包含了所有的 ,因此 可以看作是 的量化值。根据等式(1),我们可以得到上下文量化为
where is the quantization objective function, and minimizing means obtaining at least locally optimal quantization results. The optimization objective of the context quantization becomes: for a given number of quantization levels , find an optimum partition scheme for , then calculate the optimum quantization values for all partitions so that the Equ.(4) is minimized.
其中 是量化目标函数,最小化 意味着至少获得局部最优的量化结果。上下文量化的优化目标变为:对于给定的量化级数 ,找到 的最优分区方案,然后计算所有 分区的最优量化值 ,从而使等式(4)最小化。
By calculating the dynamic programming problem, we get a result that is suitable for the test data. In order to improve the computational efficiency in the actual compression process, we use the calculated result as the quantization method. If necessary to improve the compression ratio for the specified file, users can solve the optimization problem separately and get the best way to quantify it. Finally, our method of quantifying row characteristics as shown in Fig.5:
通过计算动态编程问题,我们得到了适合测试数据的结果。为了提高实际压缩过程中的计算效率,我们将计算出的结果作为量化方法。如果需要提高指定文件的压缩率,用户可以单独求解优化问题,得到最佳的量化方法。最后,我们的行特性量化方法如图 5 所示:
Fig. 5: The way of row mean quantization
图 5:行均值量化方式
It can be seen that the quantization method in this case is very similar to the lossy compression which joined thresholds. The difference is that we quantify the row characteristics without affecting the lossless decoding, just extract the correlation features between the rows and using the method of dynamic programming to get a better result. The final quantitative method which include the current value is:
可以看出,这种情况下的量化方法与加入阈值的有损压缩非常相似。不同的是,我们在不影响无损解码的情况下量化了行的特征,只是提取了行之间的相关特征,并使用动态编程的方法得到了更好的结果。最终的量化方法包括当前值
On the other hand, the distribution of quality score is random and the waveform will be not smooth transition influenced by quality score values. So different from modeling by the waveform, we can start with sequencing principle for these singular points. [12] reveals that quality score between adjacent bases is usually similar and the probability distribution of the quality score is affected by the base distribution. Considering that there are some natural similarities in the process of obtaining nucleotide sequences, the distribution of bases is regarded as a criterion to measure whether there is a singular point in the quality score, which is used to simulate the stationarity between two symbols. The increase in the base order will cause the model to grow exponentially, and balancing the model size and effect improvement rate, we choose the second-order to describe the correlation between base
另一方面,质量分数的分布是随机的,波形也不会受质量分数值的影响而平滑过渡。因此,与波形建模不同,我们可以从这些奇异点的测序原理入手。文献[12]揭示了相邻碱基之间的质量得分通常是相似的,质量得分的概率分布受碱基分布的影响。考虑到核苷酸序列在获取过程中存在一些天然的相似性,碱基的分布被视为衡量质量得分是否存在奇异点的标准,用来模拟两个符号之间的静止性。碱基阶数的增加会使模型呈指数增长,在平衡模型大小和效果改进率的基础上,我们选择二阶来描述碱基之间的相关性

and quality score. In FASTQ file, a quality score corresponds to a base and the conditional entropy is:
和质量得分。在 FASTQ 文件中,质量得分对应一个碱基,条件熵为:
where is the base value of the quality score of the current code and this formula shows that the influence of base on entropy. After synthesizing all the context models, we provide the final composite context modeling strategy in Fig. 6 and the ACO algorithm in Algorithm 1.
其中 是当前代码质量得分 的基础值,该公式显示了基础值对熵的影响。综合所有上下文模型后,我们在图 6 中提供了最终的复合上下文建模策略,并在算法 1 中提供了 ACO 算法。
Fig. 6: Composite context modeling strategy
图 6:复合情境建模策略

Results and discussion 结果和讨论

In this section, we have compared the performance of our algorithm ACO with other state of the art algorithms and report the results. We have compared our algorithm with general purpose compression algorithms like gzip and 7zip and also a set of algorithms specific to the domain namely GTZ, fqzcomp and quip. We have restricted our focus to lossless compression, and have not evaluated a number of promising lossy methods, nor methods only capable of compressing nucleotide sequences. For the fqzcomp algorithm, we compare the results of and compression modes and for GTZ algorithm, it does not display the quality score compression results separately, so we compare the normal mode. It is important to note that ACO has more advantages in compressing aligned quality score and does not accept any input other than a raw FASTQ file. At the same time, we do not bring into comparison algorithms that accept any reference genome. The datasets used in our experiments are downloaded in FASTQ format from the National Center for Biotechnology Information - Sequence Read Archive (NCBI-SRA) database [13] and are presented in Table 1.
在本节中,我们将 ACO 算法的性能与其他先进算法进行了比较,并报告了结果。我们将我们的算法与 gzip 和 7zip 等通用压缩算法以及 GTZ、fqzcomp 和 quip 等特定领域的算法进行了比较。我们的重点仅限于无损压缩,而没有评估一些有前途的有损方法,也没有评估仅能压缩核苷酸序列的方法。对于 fqzcomp 算法,我们比较了 压缩模式的结果;对于 GTZ 算法,它没有单独显示质量分数压缩结果,因此我们比较了正常模式。值得注意的是,ACO 在压缩对齐质量得分方面有更多优势,而且不接受原始 FASTQ 文件以外的任何输入。同时,我们也不将接受任何参考基因组的算法纳入比较范围。实验中使用的数据集是从美国国家生物技术信息中心序列读取档案(NCBI-SRA)数据库[13]下载的 FASTQ 格式文件,如表 1 所示。
All the experiments were run on a server with a Inter Core i9-9900K CPU processor, of RAM, 2TB disk space and Ubuntu 16.04. All algorithms are compared in terms of compression ratios and bits per quality value (BPQ). The CR and
所有实验都在一台配备 Inter Core i9-9900K CPU 处理器、 内存、2TB 磁盘空间和 Ubuntu 16.04 的服务器上进行。所有算法都在压缩率 和比特/质量值 (BPQ) 方面进行了比较。CR 和
Algorithm 1:ACO algorithm framework
Input: A FASTQ file.
Output: The compressed quality score file
    STEP1: Data preprocessing
    1. Use a \(N \times K\) matrix \(\mathbf{Q}\) to store the quality score of
    FASTQ file.
    2.Use a \(N \times K\) matrix \(\mathbf{P}\) to store the base value of FASTQ
    file.
    for all \(0 \leq n \leq N\) do
        for all \(0 \leq k \leq K\) do
            calculate the max of \(\mathbf{Q}(n, k)\) as symbol_max
            calculate the \(\min\) of \(\mathbf{Q}(n, k)\) as symbol_min
            symbol_number=symbol_max-symbol_min+1
        end for
    end for
    STEP2: Composite context model
    1. Calculate model_num by model_num = symbol_number \({ }^{3}\).
    2. Use a \(N \times 1\) vector \(\mathbf{M}\) to store the mean value of the quality
    value in each line.
    for all \(0 \leq n \leq N\) do
        for all \(0 \leq k \leq K\) do
            \(A=\mathbf{M}(n, 1)\)
            \(B=\max [\mathbf{Q}(n, k-1), \mathbf{Q}(n, k-2)]\)
            \(C=\max [\mathbf{Q}(n, k-3), \mathbf{Q}(n, k-4)]\)
            if \(\quad([\mathbf{Q}(n, k-3)==\mathbf{Q}(n, k-4)]) \quad D=0\);
            else \(\quad D=1\)
            \(E=[\mathbf{P}(n, k), \mathbf{P}(n, k-1)]\)
            model_ \(i d x=[A, B, C, D, E]\)
        end for
    end for
    STEP3: Snake traversal coding
    \(\mathrm{i}=0\);
    for all \(0 \leq k \leq K\) do
        for all \(0 \leq n \leq N\) do
        if \((k \% 2==0) \quad i=k\)
        else \(\quad i=N-n-1\);
        use arithmetic encoder to compress \(\mathbf{Q}(n, k)\) with model_idx
        end for
    end for
is defined as follows:
定义如下
where indicates the compressed file size, indicates the size of the file before compression, compression results of all algorithms on the NGS datasets are summarized in Table 2.
其中 表示压缩后的文件大小, 表示压缩前的文件大小,表 2 总结了所有算法在 NGS 数据集上的压缩结果。
Table 2 gives an improvement of ACO relative to each comparison algorithm, and further reflects the advantages of the method we use by compression ratio. Compared with Gzip, the file size is reduced by an average of , and the average file size is reduced by compared with the 7-Zip under the optimal setting. The results show that the proposed ACO algorithm achieves better results on six representative data. Particularly, ACO obtains an average compression ratio of , resulting in an over size
表 2 给出了 ACO 相对于每种对比算法的改进,并通过压缩比进一步反映了我们使用的方法的优势。与 Gzip 相比,文件大小平均减少了 ,与最优设置下的 7-Zip 相比,文件大小平均减少了 。结果表明,所提出的 ACO 算法在六个代表性数据上取得了更好的效果。特别是,ACO 获得了 的平均压缩率,从而使 的文件大小比 7-Zip 减少了超过 100%。
Table 1: Descriptions of 6 FASTQ datasets used for evaluation
表 1:用于评估的 6 个 FASTQ 数据集说明
Run ID Sequencing Platform 测序平台 FASTQ Size(bytes) Read Length Quality Size(bytes)
NA12878_2 BGISEQ-500 134363357648 56983386200
ERR2438054_1 BGISEQ-500 133406591610 47097570000
ERR174324_1 Illumina HiSeq 2000 57800970448 22580690796
ERR174331_1 Illumina HiSeq 2000 57210954538 22350322320
ERR174327_1 Illumina HiSeq 2000 54724344869 21379957043
ERR174324_2 Illumina HiSeq 2000 57800970448 22580690796
Table 2: All algorithmic
rompression results for NGS data sets
NGS 数据集的压缩结果
Run ID ratio gzip gtz quip fqzcomp(q1) fqzcomp(q2) ACO
NA12878_2 CR(%) 48.55 49.66 38.47 38.48 39.08 38.59
BPQ 3.88 3.97 3.08 3.08 3.13 3.09
ERR2438054_1 CR(%) 46.23 47.11 37.09 36.52 37.08 36.71
BPQ 3.70 3.77 2.97 2.92 2.97 2.94
ERR174324_1 CR(%) 36.58 36.94 25.47 26.14 27.30 25.81
BPQ 2.93 2.96 2.04 2.09 2.18 2.06
ERR174331_1 CR(%) 36.55 36.91 25.45 26.11 27.27 25.77
BPQ 2.92 2.95 2.04 2.09 2.18 1.89
ERR174327_1 CR(%) 35.53 35.88 24.56 25.31 26.39 24.90
BPQ 2.84 2.87 1.96 2.02 2.11 1.99
ERR174324_2 CR(%) 38.47 38.81 27.07 27.52 28.89 27.37
BPQ 3.08 3.10 2.17 2.20 2.31 2.19
reduction in the quality score data. At the same time, the average result is much smaller than the original in ASCII format. Two evaluation criteria indicate that ACO has achieved the best compression results for the different methods of the same document. At the same time, according to the differences platforms, the ACO algorithm proposed in this paper sets different modes and processing strategies, which makes the compression efficiency higher.
质量得分数据的减少。同时,平均 的结果比 ASCII 格式的原始 要小得多。两个评价标准表明,ACO 在同一文档的不同方法中取得了最好的压缩效果。同时,根据平台的不同,本文提出的 ACO 算法设置了不同的模式和处理策略,使得压缩效率更高。

Conclusion and future works
结论和未来工作

This paper introduces ACO, a lossless quality score compression algorithm based on adaptive coding order. ACO traverse the quality score along the most relative directions and use compound context modeling strategy to achieve the state-of-the-art lossless compression performances. However, the current ACO version, especially the proposed compound context modeling strategy is proposed for the second generation sequencing machines. For the third generation sequencing data, the compound context models may be modified to involve more genoms and adjacent quality score, but the context dilution problem may be appears as the increasing of context models. An alternative solution maybe using the deep learning technique to estimate the marginal probability of every quality score to replace the current context modeling. In our further works, we will concentrate on both of the above two strategies and extend ACO to the third generation sequencing data.
本文介绍了基于自适应编码顺序的无损质量分数压缩算法 ACO。ACO 沿着最相对的方向遍历质量分数,并使用复合上下文建模策略来实现最先进的无损压缩性能。不过,目前的 ACO 版本,尤其是提出的复合上下文建模策略,是针对第二代测序机提出的。对于第三代测序数据,可以修改复合上下文模型,以涉及更多基因组和相邻质量分数,但随着上下文模型的增加,可能会出现上下文稀释问题。另一种解决方案是使用深度学习技术来估计每个质量得分的边际概率,以取代当前的上下文模型。在接下来的工作中,我们将集中研究上述两种策略,并将 ACO 扩展到第三代测序数据。

Abbreviations 缩略语

NCBI: National Centre for Biotechnology Information;
NCBI:国家生物技术信息中心;
NGS: next-generation sequencing;
NGS:新一代测序;

WGS: Whole-genome sequencing;
WGS:全基因组测序
QS: quality score; QS:质量得分;
CR: compression ratios; CR:压缩比;
: bits per quality value.
:每个质量值的位数。

Acknowledgements 致谢

We would like to thank the Editor and the reviewers for their precious comments on this work which helped improve the quality of this paper.
我们要感谢编辑和审稿人对这项工作提出的宝贵意见,这些意见有助于提高本文的质量。

Funding

This work was supported in part of NSFC (No. 61875157,61672404 , 61632019, 61751310 and 61836008), National Defense Basic Scientific Research Program of China (JCKY2017204B102), Science and Technology Plan of Xi'an (20191122015KYPT011JC013), the Fundamental Research Funds of the Central Universities of China (No. RW200141,JC1904 and JX18001), the National Key Research and Development Project of China (2018YFB2202400)
这项工作得到了国家自然科学基金(No. 61875157,61672404 , 61632019, 61751310 and 61836008)、国家国防基础科学研究计划(JCKY2017204B102)、西安市科技计划(20191122015KYPT011JC013)、中央高校基本科研业务费(No. RW200141,JC1904 and JX18001)、国家重点研发计划(2018YFB2202400)的部分资助。

Availability of data and materials
数据和材料的可用性

Project name: ACO 项目名称: ACO
Project website: https://github.com/Yoniming/ACO
项目网站: https://github.com/Yoniming/ACO
Operating systems: Linux or Windows
操作系统:Linux 或 Windows
Programming language:
Other requirements: GCC compiler and the archiving tool 'tar' License: The MIT License
其他要求GCC 编译器和存档工具 'tar' 许可证MIT 许可
Any restrictions to use by non-academics: For commercial use, please contact the authors. All datasets are downloaded from SRA of NCBI. All data supporting the conclusions of this article are included within the article and its additional files.
对非学术界人士的使用有任何限制:如需用于商业用途,请联系作者。所有数据集均从 NCBI 的 SRA 下载。支持本文结论的所有数据均包含在文章及其附加文件中。
Ethics approval and consent to participate
伦理批准和参与同意书
Not applicable. 不适用。

Competing interests 竞争利益

The authors declare that they have no competing interests.
作者声明他们没有利益冲突。
Not applicable. 不适用。
Authors' contributions 作者的贡献
YN and MM conceived the algorithm, developed the program, and wrote the manuscript. FL and GS helped with manuscript editing, designed and performed experiments. XL prepared the data sets, carried out analyses and helped with program design. All authors read and approved the final manuscript.
YN和MM构思了算法、开发了程序并撰写了手稿。FL和GS帮助编辑手稿、设计和进行实验。XL 准备了数据集,进行了分析并帮助设计了程序。所有作者都阅读并批准了最终手稿。
Author details 作者详细信息
School of artificial intelligence, Xidian University, Xian, 710071 China.
西安电子科技大学人工智能学院,西安,710071
The Pengcheng Lab, Shenzhen, 518055 China.

References 参考资料

  1. You, Z.-H., Yin, Z., Han, K., Huang, D.-S., Zhou, X.: A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network. Bmc Bioinformatics 11(1), 343 (2010). [PMC free article] [PubMed] [Google Scholar]
    You, Z.-H., Yin, Z., Han, K., Huang, D.-S., Zhou, X.:结合功能基因网络的功能和拓扑特性预测合成基因相互作用的半监督学习方法。Bmc Bioinformatics 11(1), 343 (2010).[PMC free article] [PubMed] [Google Scholar].
  2. Wetterstrand, K.A.: DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP). www.genome.gov/sequencingcostsdata (2016)
    Wetterstrand,K.A.:DNA测序成本:来自国家人类基因研究机构基因组测序计划(GSP)的数据。www.genome.gov/sequencingcostsdata (2016)
  3. Stephens, Z.D.: Big data: Astronomical or genomical? Plos Biology 13(7), 1002195 (2015). [PMC free article] [PubMed] [Google Scholar]
    斯蒂芬斯,Z.D:大数据:天文还是基因?Plos Biology 13(7), 1002195 (2015)。[PMC free article] [PubMed] [Google Scholar] [PMC免费文章] [PubMed]。
  4. Ochoa, I., Hernaez, M., Goldfeder, R., Weissman, T., Ashley, E.: Effect of lossy compression of quality scores on variant calling. Briefings in bioinformatics 18(2), 183-194 (2016). [PubMed] [Ref list]
  5. Bonfield, J.K., Mahoney, M.V.: Compression of fastq and sam format sequencing data. PloS one 8(3), 59190 (2013). [PMC free article] [PubMed] [Google Scholar]
    Bonfield, J.K., Mahoney, M.V..:fastq 和 sam 格式测序数据的压缩。PloS one 8(3), 59190 (2013)。[PMC free article] [PubMed] [Google Scholar] [PMC免费文章] [PubMed]。
  6. Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479-486 (2011)
    Bromage, A.J.: Succinct data structures for assembling large genomes.Bioinformatics 27(4), 479-486 (2011)
  7. Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V., Varghese, G.: Compressing genomic sequence fragments using slimgene. Journal of Computational Biology 18(3), 401-413 (2011). [PMC free article] [PubMed] [Google Scholar]
    Kozanitis, C., Saunders, C., Kruglyak, S., Bafna, V., Varghese, G.: Compressing genomic sequence fragments using slimgene.计算生物学杂志》18(3),401-413(2011 年)。[PMC免费文章] [PubMed] [Google Scholar].
  8. Xing, Y., Li, G., Wang, Z., Feng, B., Song, Z., Wu, C.: Gtz: a fast compression and cloud transmission tool optimized for fastq files. BMC bioinformatics 18(16), 549 (2017). [PMC free article] [PubMed] [Google Scholar]
    Xing, Y., Li, G., Wang, Z., Feng, B., Song, Z., Wu, C..:Gtz:针对 fastq 文件优化的快速压缩和云传输工具。BMC bioinformatics 18(16), 549 (2017).[PMC free article] [PubMed] [Google Scholar].
  9. Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic acids research 40(22), 171-171 (2012). [PMC free article] [PubMed] [Google Scholar]
    Jones,D.C.、Ruzzo,W.L.、Peng,X.、Katze,M.G.:在高效从头装配的帮助下压缩下一代测序读数。核酸研究 40(22), 171-171 (2012)。[PMC free article] [PubMed] [Google Scholar] [PMC免费文章] [PubMed]。
  10. Sanger, F., Nicklen, S., Coulson, A.R.: Dna sequencing with chain-terminating inhibitors. Proceedings of the national academy of sciences 74(12), 5463-5467 (1977)
    Sanger,F.、Nicklen,S.、Coulson,A.R.:使用链终止抑制剂进行 DNA 测序。美国国家科学院院刊》74(12),5463-5467(1977 年)
  11. Geiger, B., Bershadsky, A., Pankov, R., Yamada, K.M.:
Transmembrane crosstalk between the extracellular matrix-cytoskeleton crosstalk. Nature Reviews Molecular Cell Biology 2(11), 793-805 (2001)
细胞外基质-骨架串联之间的跨膜串联。自然评论分子细胞生物学》2(11),793-805(2001 年)
  1. Das, S., Vikalo, H.: Base-calling for illumina's next-generation dna sequencing systems via viterbi algorithm. In: 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1733-1736 (2011). IEEE. [Crossrref] [PubMed] [Google Scholar]
    Das,S.,Vikalo,H.:通过维特比算法对 illumina 的下一代 DNA 测序系统进行碱基调用。In: 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp.IEEE.[Crossrref] [PubMed] [Google Scholar].
  2. Leinonen, R., Sugawara, H.: the International Nucleotide Sequence Database (2010)
    Leinonen, R., Sugawara, H.: 国际核苷酸序列数据库 (2010)
Figures 数字
Figure 1 图 1
Quality scores distribution curve of ERR2438054
ERR2438054的质量分数分布曲线
Figure 2 图 2
Schematic diagram of sequencing principle
排序原理示意图
Figure 3 图 3
Distribution of quality scores made by FASTQ
FASTQ 质量评分的分布情况
Figure 4 图 4
Figure 5 图 5
The way of row mean quantization
行平均量化方法
Base
sequence
A C T G C A T
Quality
scores
33 35 38 34 35 36 35 M
Quality score being encoded
正在编码的质量得分
Previous quality scores context of length
以前的质量分数长度背景
Base sequence context of length
长度的碱基序列上下文
The average value of the quality value in this line
该行质量值的平均值
Figure 6 图 6
Composite context modeling strategy
复合语境建模策略

  1. Correspondence: niuyi@mail.xidian.edu.cn
    School of artificial intelligence, Xidian University, Xian, 710071 China
    西安电子科技大学人工智能学院,中国西安,710071
    Full list of author information is available at the end of the article
    作者信息的完整列表见文章末尾