这是用户在 2024-5-7 16:03 为 https://app.immersivetranslate.com/pdf-pro/049cfd9f-8bce-4c91-80e9-e7a7692045a3 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

ACO:Iossless quality score compression based on adaptive coding order

Xidian University

Ma Mingming 马光明

Xidian University

Li Fu

Xidian University

Liu Xianming 刘先明

Peng Cheng Laboratory 鹏程实验室

Shi Guangming 石光明

Xidian University

Research Article 研究文章

Keywords: High-throughput sequencing, quality score compression, lossless compression, adaptive coding order
Posted Date: April 22nd, 2021
发布日期2021 年 4 月 22 日
License: @ (1) This work is licensed under a Creative Commons Attribution 4.0 International License.
授权许可:@ (1) 本作品采用知识共享 署名 4.0 国际许可协议进行许可。
Read Full License 阅读完整许可证

ACO:lossless quality score compression based on adaptive coding order

Yi Niu , Mingming Ma , Xianming and Guangming Shi

Abstract 摘要

Background: With the rapid development of high-throughput sequencing technology, the cost of whole genome sequencing drops rapidly, which leads to an exponential growth of genome data. Although the compression of DNA bases has achieved significant improvement in recent years, the compression of quality score is still challenging.
背景:随着高通量测序技术的快速发展,全基因组测序的成本迅速下降,导致基因组数据呈指数级增长。虽然近年来 DNA 碱基的压缩技术取得了显著的进步,但质量分数的压缩仍然是一项挑战。

Results: In this paper, by reinvestigating the inherent correlations between the quality score and the sequencing process, we propose a novel lossless quality score compressor based on adaptive coding order (ACO). The main objective of ACO is to traverse the quality score adaptively in the most correlative trajectory according to the sequencing process. By cooperating with the adaptive arithmetic coding and context modeling, ACO achieves the state-of-the-art quality score compression performances with moderate complexity.
结果本文通过重新研究质量分数与排序过程之间的内在关联性,提出了一种基于自适应编码顺序(ACO)的新型无损质量分数压缩器。自适应编码顺序的主要目的是根据排序过程,以最相关的轨迹自适应地遍历质量分数。通过自适应算术编码和上下文建模的配合,ACO 以中等的复杂度实现了最先进的质量分数压缩性能。

Conclusions: The competence enables ACO to serve as a candidate tool for quality score compression, ACO has been employed by AVS(Audio Video coding Standard Workgroup of China) and is freely available at https://github.com/Yoniming/code.
结论ACO 已被 AVS(中国音视频编码标准工作组)采用,并可在 https://github.com/Yoniming/code 免费获取。

Keywords: High-throughput sequencing; quality score compression; lossless compression; adaptive coding order


Sequencing technology has gradually become a basic technology widely used in biological research [1]. Obtaining genetic information of different organisms can help us to improve our understanding of the organic world. In the past decades, the price of human whole genome sequencing (WGS) has dropped to less than , with a faster declining speed over the the Moore's Law expected [2]. In this case, the number of next-generation sequencing (NGS) data grows exponentially, even exceeds that of astronomical data [3]. How to efficiently compress the DNA data generated by large-scale genome projects has become an important factor restricting the further development of the DNA sequencing industry.
测序技术已逐渐成为生物研究中广泛应用的基础技术[1]。获取不同生物的遗传信息可以帮助我们增进对有机世界的了解。在过去的几十年里,人类全基因组测序(WGS)的价格已经降到了 以下,其下降速度超过了摩尔定律的预期[2]。在这种情况下,下一代测序(NGS)数据的数量呈指数级增长,甚至超过了天文数据[3]。如何有效压缩大规模基因组项目产生的 DNA 数据,已成为制约 DNA 测序行业进一步发展的重要因素。
There are two major problem in the compression of DNA data: the nucleotide compression and quality score compression. The quality values takes more than half of the compression data and has been shown to be more difficult to compress than the nucleotide data . Especially with the development of assembling techniques , the nucleotide compression
DNA 数据压缩有两个主要问题:核苷酸压缩和质量分数压缩。质量值占压缩数据的一半以上,而且已被证明比核苷酸数据更难压缩 。尤其是随着组装技术的发展 ,核苷酸压缩的难度更大。
have achieved significant improvement which makes the quality score compression problem to be one of the main bottle-necks in the current DNA data storage and transfer applications.
已取得重大改进,这使得质量分数压缩问题成为当前 DNA 数据存储和传输应用中的主要瓶颈之一。
The quality score (QS) represents the confidence level of every base characters in the sequencing procedure, but with a much larger alphabets (41-46 distinct levels). [7] reveals that there are strong correlations among adjacent quality score, which can be regarded as the foundation of the current lossless quality score compression pipeline: 1) using Markov model to estimate the conditional probability of the quality score; 2) traversing every positions of the reads via a raster scan order; 3) encoding the quality score via arithmetic or range coding.
质量得分(QS)代表测序过程中每个碱基字符的置信度,但其字母表更大(41-46 个不同等级)。文献[7]揭示了相邻质量得分之间存在很强的相关性,这可以被视为当前无损质量得分压缩流水线的基础:1)使用马尔可夫模型估计质量得分的条件概率;2)通过光栅扫描顺序遍历读数的每个位置;3)通过算术或范围编码对质量得分进行编码。
Based on the above pipeline, three distinguished lossless compressor have been proposed GTZ[8], Quip[9] and FQZcomp [5]. The only differences among these three works are the Markov model orders and context quantization strategies, thus the compression ratio varies around , depending on the data distribution. A negative view is unavoidably raised that there is not much rooms for the further improvement of lossless compression ratio.
在上述管道的基础上,提出了 GTZ[8]、Quip[9] 和 FQZcomp [5]三种不同的无损压缩器。这三种压缩器的区别仅在于马尔可夫模型阶数和上下文量化策略,因此压缩率在 左右,具体取决于数据分布。这就不可避免地提出了一个负面观点,即无损压缩率的进一步提高空间不大。
In this paper, by reinvestigate the sequencing process, we reveal two main drawback of the existing raster scan based quality score compression strategy. Firstly, the raster scan order is a "depth-first" traverse strategy of the read. However, as it indicated in [7], the quality score have a descent trend along one single read. This makes the piece-wise stationary assumption of Markov modeling untenable. Secondly, considering that the sequencing process is conduced by multi-spectral imaging , but the FASTQ file simply stores the quality score into a stack of 1D signals. The raster-scan based techniques compress every reads independently which fails to explore the potential 2D correlations the spatial-adjacent reads (not the adjacent reads from FASTQ files).
本文通过对测序过程的重新研究,揭示了现有基于光栅扫描的质量分数压缩策略的两个主要缺点。首先,光栅扫描顺序是一种 "深度优先 "的读取遍历策略。然而,正如文献[7]所指出的,质量得分沿着单个读数呈下降趋势。这使得马尔可夫模型的片断静止假设站不住脚。其次,考虑到测序过程是由 多光谱成像 ,但 FASTQ 文件只是将质量得分存储到一维信号的堆栈中。基于光栅扫描的技术对每个读数进行独立压缩,无法探索空间相邻读数(而非 FASTQ 文件中的相邻读数)之间潜在的二维相关性。
To overcome the above two drawbacks, we propose a novel quality score compression technique based on adaptive coding order(ACO). The main objective of is to traverse the quality score along the most relative directions, which can be regarded as a reorganization of the stack of independent 1D quality score vectors into highly related 2D matrices. Another improvement of the proposed ACO technique over the existing techniques is the compound context modeling strategy. As we will explain the details in Section of Method, instead of the adjacent QS values, the ACO context models consists of two additional aspects: 1) the global average of every reads; 2) the variant of DNA bases. The compound context model not only benefits the probability estimation and arithmetic coding, more importantly, in the implementation, it prevents ACO from multiple random access of the input FASTQ file: the compressing process can be accomplished in only one-path, at the cost of some context dilution and side-information.
为了克服上述两个缺点,我们提出了一种基于自适应编码顺序(ACO)的新型质量分数压缩技术。 的主要目标是沿着最相关的方向遍历质量分数,这可以被视为将独立的一维质量分数向量堆重组为高度相关的二维矩阵。与现有技术相比,拟议 ACO 技术的另一项改进是复合上下文建模策略。我们将在 "方法 "部分详细说明,ACO 上下文模型由两个额外方面组成,而不是相邻的 QS 值:1) 每个读数的全局平均值;2) DNA 碱基的变异。复合上下文模型不仅有利于概率估计和算术编码,更重要的是,在实现过程中,它避免了 ACO 对输入 FASTQ 文件的多次随机访问:压缩过程只需一条路径即可完成,但代价是一些上下文稀释和副信息。
Experimental results show that the proposed ACO technique achieves the state-of-the-arts performances for the lossless quality score compression, which achieves more than gains in the compression ratio over FQZcomp [5]. The only drawback of ACO is the memory cost, comparing with FQZcomp, ACO requires and additional memory to buffer the quality score matrixes and store the compound context models respectively, which should no longer been a big problem for the current PC.
实验结果表明,所提出的 ACO 技术在无损质量分数压缩方面达到了最先进的水平,与 FQZcomp 相比,其压缩率提高了 [5]。ACO 的唯一缺点是内存成本,与 FQZcomp 相比,ACO 需要分别增加 内存来缓冲 质量得分矩阵和存储复合上下文模型,这对于目前的 PC 机来说应该不再是一个大问题。

Declaration 声明

In this section, we will first analyze the data characteristics of the quality score, and illustrate that the coding sequence will have a certain impact on the quality score compression through specific examples, so as to promote us to compress the quality score along the direction with the strongest data correlation. Secondly, by analyzing the sequencing principle and the generation process of FASTQ file, we explore the extra relevance in the quality score data to build a novel composite context quantification model.
本节首先分析质量得分的数据特征,通过具体实例说明编码序列会对质量得分压缩产生一定影响,从而推动我们沿着数据相关性最强的方向压缩质量得分。其次,通过分析 FASTQ 文件的测序原理和生成过程,挖掘质量得分数据中的额外相关性,从而建立一个新颖的复合上下文量化模型。
Impact of coding order
The quality score represent the estimation of the probability of the corresponding nucleotide error in the reads, and it is the evaluation of the reliability of the base character. This information is used for both the quality control of the original data and the downstream analysis. We give the distribution of the quality score of the four reads of ERR2438054 in Fig.1, it can be seen that due to the influence of noise, the quality score is a random and unstable signal, and there is a strong correlation between adjacent quality score. Therefore, we can use these characteristics of quality score to change the coding order to improve the compression ratio. Changing the order doesn't sound like changing the entropy value, because according to the information theory, the information quantity of the source is a function of probability, which is represented by the information entropy of the source. However, since the adaptive arithmetic encoder is used in coding, the encoder will update the symbol probability regularly, so changing the order can reduce the size of the bitstream. The discussion on the principle of arithmetic encoder is not the main content of this paper, so we just give a test experiment to show the influence of coding order on compression results. Firstly, we create two random signals and , and let . Then, randomly disturb the distribution of and record it as , sort the distribution of by size and record it as Z3. Finally, three groups of different signals are encoded by 0 -order arithmetic encoder, the result of the bitstream is . This is because the sorting process is equivalent to placing the data together with similar distribution and strong correlation. The coding after changing the order can better cooperate with the probability update mechanism of adaptive arithmetic encoder.
质量得分是对读数中相应核苷酸错误概率的估计,也是对碱基特征可靠性的评估。这些信息既可用于原始数据的质量控制,也可用于下游分析。我们在图 1 中给出了ERR2438054 四个读数的质量得分分布情况,可以看出,由于受到噪声的影响,质量得分是一个随机的、不稳定的信号,相邻质量得分之间存在很强的相关性。因此,我们可以利用质量分数的这些特点来改变编码顺序,从而提高压缩比。改变顺序听起来并不像改变熵值,因为根据信息论,信源的信息量是概率的函数,用信源的信息熵来表示。但是,由于在编码中使用了自适应算术编码器,编码器会定期更新符号概率,因此改变阶次可以减小比特流的大小。关于算术编码器原理的讨论不是本文的主要内容,因此我们只给出一个测试实验来说明编码顺序对压缩结果的影响。首先,我们创建两个随机信号 ,并让 。然后,随机扰乱 的分布并将其记录为 ,按大小排序 的分布并将其记录为 Z3。最后,三组不同的信号通过 0 阶算术编码器进行编码,比特流的结果为 。这是因为排序过程相当于把分布相似、相关性强的数据放在一起。改变顺序后的编码能更好地配合自适应算术编码器的概率更新机制。
Fig. 1: Quality score distribution curve of
图 1:图 1

Mining more relevant information

Take the current widely used Hiseq sequencing platform as an example, the sequencing process consists of three steps: 1) construction of DNA library, 2) generating DNA cluster by bridge PCR amplification and 3) sequencing. In this paper we reinvestigate the sequencing step to mining more inherent correlations among the quality score to aid the compression task. The basic principle of sequencing is based on multi-spectral imaging of the flowcell. The flowcell is a carrier for sequencing and each flowcell has eight lanes with chemically modified inner surfaces. Each lane contains 96 tiles and each tile has a unique cluster. Every cluster corresponds to one DNA patches thus one flow cell can generate 768 DNA reads simultaneously.
以目前广泛使用的 Hiseq 测序平台为例,测序过程包括三个步骤:1)构建 DNA 文库;2)通过桥式 PCR 扩增生成 DNA 簇;3)测序。本文对测序步骤进行了重新研究,以挖掘更多质量得分之间的内在联系,从而帮助完成压缩任务。测序的基本原理基于流式细胞的多光谱成像。流式细胞是测序的载体,每个流式细胞有八个内表面经过化学修饰的泳道。每个泳道包含 96 块瓷砖,每块瓷砖都有一个独特的簇。每个簇对应一个 DNA 片段,因此一个流动池可同时产生 768 个 DNA 读数。
As shown in in Fig.2, the sequencing process consists of five steps. In step 1, the polymerase and one type of dNTP are added into the flowcell to activate the fluorescent of the specific clusters. In step 2, the multispectral camera takes one shot of the flow cell with the specific wavelength according to the added dNTP. Then in step 3, chemical reagents are adopted to wash out the flowcell to prepare for the next imaging. The above three steps are repeated four times with different dNTPs and different imaging wavelength to get a four channel multi-spectral image. In step 4 , based on the captured four channel image, the sequencing machine not only estimate the most likely type of every cluster but also evaluate the confident level of the estimation, which are stored as the bases and quality score respectively. Step1-4 is regarded one sequencing cycle which sequence one position (depth) of all the 768 reads in the flowcell. Thus in step 5, the sequencing cycle is repeated several times and the repeated number of cycles corresponds to the length of the reads.
如图 2 所示,测序过程包括五个步骤。第一步,在流动池中加入聚合酶和一种 dNTP,以激活特定簇的荧光。第 2 步,多光谱照相机根据添加的 dNTP 用特定波长拍摄流动池。然后在步骤 3 中,采用化学试剂冲洗流动池,为下一次成像做准备。以上三个步骤重复四次,使用不同的 dNTP 和不同的成像波长,得到四通道多光谱图像。在步骤 4 中,测序仪根据采集到的四通道图像,不仅能估计出每个团簇最可能的类型,还能评估估计结果的可信度,分别存储为碱基和质量得分。步骤 1-4 被视为一个测序周期,对流池中所有 768 个读数的一个位置(深度)进行测序。因此,在步骤 5 中,测序周期会重复多次,重复的周期数与读数的长度相对应。
As we discussed in details as follows, there are three aspects which corresponds to the quality score values: 1) number of cycles, 2) base change and 3) local position of chip.
正如我们在下文中详细讨论的,有三个方面与质量分值相对应:1) 周期数;2) 基数变化;3) 芯片的局部位置。
The number of cycles affects the distribution of quality score. DNA polymerases are used in the process of synthesis and sequencing, at the beginning of sequencing, the synthesis reaction was not very stable, but the quality of the enzyme was very good, so it would fluctuate in the high-quality region. With the progress of sequencing, the reaction tends to be stable, but the enzyme activity and specificity gradually decreases, thus the cumulative error is gradually amplified. As a result, the probability of error increases, and the overall quality score shows a downward trend. As shown in Fig.3, with the progress of sequencing, the mean value of quality score decreases gradually, while the variance is increasing. Therefore, it is improper to assume every read as a stationary random signal along the traditional raster scan order.
循环次数会影响质量得分的分布。在合成和测序过程中都会用到 DNA 聚合酶,在测序初期,合成反应不是很稳定,但酶的质量很好,所以会在高质量区域波动。随着测序的进行,反应趋于稳定,但酶的活性和特异性逐渐降低,累积误差逐渐放大。因此,出错概率增加,总体质量得分呈下降趋势。如图 3 所示,随着测序的进展,质量得分的平均值逐渐降低,而方差却在增加。因此,按照传统的光栅扫描顺序将每个读数都假定为静止的随机信号是不妥当的。
The base change also affects the distribution of quality score. As we discussed before, the recognition of base types in a flowcell is conducted in a four step loop according to the order of dNTP and wavelength. For example, let's assume the loop order is 'A-C-G-T', if the bases of a read is '... AA...', after the imaging of the first ' ', the flowcell is washed four times until the imaging of the second ' '. But if the bases is '...TA...', the machine only wash the flowcell once before the imaging of ' '. In this way, if the flowcell contains some residuals in the cluster, the former 'A' base will affects the imaging process of the latter ' ' base, which may cause ambiguity of ' ' thus the quality score of ' ' drops significantly. Although some machines adopt compound dNTP to replace the four step loop, the residual is still the case that affect the quality score. Therefore, for the quality score compression, the base change should be considered as a side-information to model the marginal probability of every quality score.
碱基变化也会影响质量得分的分布。正如我们之前所讨论的,流式细胞中碱基类型的识别是按照 dNTP 和波长的顺序以四步循环的方式进行的。例如,假设循环顺序为 "A-C-G-T",如果读取的碱基为"...AA......",在第一个 " "成像后,流式细胞将被清洗四次,直到第二个 " "成像。但如果读数为"...TA...",机器在 " "成像前只清洗一次流动池。这样,如果流式细胞簇中含有一些残留物,前一个 "A "碱基就会影响后一个 " "碱基的成像过程,从而可能导致 " "的模糊性,使 " "的质量得分大幅下降。虽然有些机器采用了复合 dNTP 来代替四步循环,但残差仍然会影响质量得分。因此,在压缩质量得分时,应将碱基变化作为一个侧面信息来建立每个质量得分的边际概率模型。
The local position of chip affects the distribution of quality score. The flowcell can be regarded as a 2D array that every cluster corresponds to an entry of the array. If the fluorescent of an high amplitude one entry may diffused to the adjacent entries of the array, which is the well-known "cross-talk" phenomena [11]. In other words, there is spatial correlations among the adjacent quality score. However, the stored FASTQ file is a 1D stack of all the reads which ignores the correlation. Therefore, the compression of quality score should mining the potential 2D spatial correlations among the reads.
芯片的局部位置会影响质量得分的分布。流式细胞可被视为一个二维阵列,每个簇都对应阵列的一个条目。如果一个条目的荧光振幅较高,就可能向阵列的相邻条目扩散,这就是著名的 "串扰 "现象[11]。换句话说,相邻质量得分之间存在 空间相关性。然而,存储的 FASTQ 文件是所有读数的一维堆栈,忽略了 相关性。因此,质量得分的压缩应该挖掘读数之间潜在的二维空间相关性。
Fig. 3: Distribution of quality score made by FASTQ
图 3:FASTQ 质量得分的分布情况


In this section, we will discuss the proposed adaptive coding order (ACO) based quality score compression technique. The two contribution of ACO is 1) using an adaptive scan order to replace the traditional raster scan order which forms a more stationary signal. 2) using a compound context modeling which considers the influence of base change while exploring the potential correlations among quality score.
在本节中,我们将讨论所提出的基于自适应编码顺序(ACO)的质量分数压缩技术。ACO 的两个贡献是:1)使用自适应扫描顺序取代传统的光栅扫描顺序,后者形成的信号更加固定。2) 使用复合上下文建模,在探索质量分数之间潜在 相关性的同时,考虑基数变化的影响。

traverse the quality score along the most relative directions

As can be seen from the Fig.3, with the increase of reads length, the column mean decreases but the variance becomes larger, this proves that there is a strong correlation between columns. At the same time, the reduction in the column mean is also consistent with the actual process of specific sequencing, which the quality score have a descent trend along one single read. It has been verified that changing the scan order can improve the performance of the adaptive arithmetic encoder, and coding along the more stable signal will get better coding effect.
从图 3 中可以看出,随着读数长度的增加,列均值减小,但方差变大,这证明列与列之间存在很强的相关性。同时,列均值的减小也与具体测序过程中质量得分沿着单个读数呈下降趋势的实际情况相吻合。由此可以验证,改变扫描顺序可以提高自适应算术编码器的性能,沿着更稳定的信号进行编码可以获得更好的编码效果。
All compression methods witch based on arithmetic encoder use the scan method in Fig.4(a) when traversing data encoding. Under this scan method, quality score is encoded line by line, after scan a line, the next step is starting from the beginning of the second line. Obviously, after encoding the last character of the front line, connecting the first character of the next line will cause a great jump and this jump will make the conversion between signals unstable. So we use an adaptive scan order to replace the traditional raster scan order so that realize the stable traversal of the signal which as shown in Fig.4(b). The starting point starts with the first element and traverses down the column until the end of a column, then traverses backward up from the end of the next column. Different from the traditional scanning method, ACO adopts a scanning way like the shape of the snake. The reason to use snake traversal is to make the transition between columns more smooth, the end symbols of one column are more relevant to the end symbols of next column, the correlation between red and green symbol is obviously stronger than the correlation between red and blue symbol. Therefore, after the red symbol is encoded, it is more appropriate to select the green symbol than the blue symbol to encode from the second column. By changing the scanning order, the encoding is carried out in a more stable direction. The probability update mechanism of adaptive arithmetic encoder is fully utilized without introducing other factors.
所有基于算术编码器的压缩方法在遍历数据编码时都使用图 4(a)中的扫描方法。在这种扫描方法下,质量分数是逐行编码的,扫描完一行后,下一步从第二行开始。显然,在对前一行的最后一个字符进行编码后,连接下一行的第一个字符会产生很大的跳变,这种跳变会使信号之间的转换不稳定。因此,我们采用自适应扫描顺序来取代传统的光栅扫描顺序,从而实现信号的稳定遍历,如图 4(b)所示。起点从第一个元素开始,向下遍历一列,直到一列结束,然后从下一列结束处向上遍历。与传统的扫描方式不同,ACO 采用的是一种类似蛇形的扫描方式。采用蛇形遍历的原因是为了使列与列之间的过渡更加平滑,一列的末尾符号与下一列的末尾符号更加相关,红绿符号之间的相关性明显强于红蓝符号之间的相关性。因此,在对红色符号进行编码后,选择绿色符号比蓝色符号更适合从第二列开始编码。通过改变扫描顺序,编码可以朝着更稳定的方向进行。在不引入其他因素的情况下,自适应算术编码器的概率更新机制得到了充分利用。
Fig. 4: Comparison of traditional scanning and ACO scanning:(a)traditional traversal method; (b)adaptive scan order compound context modeling
图 4:传统扫描与 ACO 扫描的比较:(a)传统遍历方法;(b)自适应扫描顺序复合上下文建模
As Section of Declaration explains, the compression of quality score should mining the potential 2D spatial correlations among the reads, so we using a compound context modeling to express the extra relevance in the quality score data. There are two additional aspects are contained in the ACO context model and the first aspect is to get the global average of each read. According to the example in Declaration, it can be seen that adjusting the data order to make the similar symbols cluster together will get good results in compression. As shown in Fig.1, the distribution curves of the four reads are very similar, only some singular points show the differences. So it is an improved strategy to cluster and code the data with similar row distribution, but it will take a lot of steps to calculate the distribution of each row, and cluster similar rows will also bring the loss of time and space. We calculate the mean information of each row to reflect its distribution, and classify the rows with the same mean value. For the row information, the mean is a measure standard of stationarity, rows with the same mean value can be approximately regarded as basically the same distribution in the whole row, although some singular points may make the distribution curve not completely coincide. Instead of calculating the Kullback-Leibler Divergence between rows, the use of row mean can save a lot of computation and time without wasting the
正如声明部分所述,质量得分的压缩应挖掘读数之间潜在的二维空间相关性,因此我们使用复合上下文建模来表达质量得分数据中的额外相关性。ACO 上下文模型还包含两个方面,第一个方面是获取每个读数的全局平均值。根据《宣言》中的例子可以看出,调整数据顺序,使相似符号聚类在一起,会得到很好的压缩效果。如图 1 所示,四个读数的分布曲线非常相似,只有一些奇异点存在差异。因此,对具有相似行分布的数据进行聚类和编码是一种改进策略,但计算每一行的分布需要很多步骤,而且对相似行进行聚类也会带来时间和空间上的损失。我们计算每一行的平均值信息,以反映其分布情况,并对平均值相同的行进行分类。对于行信息来说,均值是静止性的衡量标准,均值相同的行可以近似地认为整行的分布基本相同,尽管一些奇异点可能会使分布曲线不完全重合。使用行均值可以节省大量的计算量和时间,而不需要计算行间的库尔贝克-莱伯勒离差,同时也不会浪费行均值。

correlation between rows. The row clustering method needs to transmit extra information to the decoder to record the change of row order, facing the same problem, using the mean also needs to transmit the mean information of each line to the decoder. In practice, we will compare the extra coding amount and the actual revenue value brought by the row mean value. When the gain is greater than the original, we will choose to add the line mean information as the context.
Specifically, in the process of building context model, there will be the problem of context dilution, so we need to design a suitable quantization method for mean value so that solve the problem of context dilution, this is a dynamic programming problem and the optimization objective of the quantization of a discretely distributed random variable is to minimize the distortion. The expression for the objective is:
where is the value of that have nonzero probability, is the quantization value of , and is a specific distortion measure. We can define a condition set to indicate that each specific corresponds to a specific value. Define a quantized set