Value Type Name
0x7 itf8 codec id
0x2 itf8 number of bytes to follow
0x0 itf8 offset
0x1 itf8 K parameter| Value | Type | Name |
| :--- | :--- | :--- |
| 0x7 | itf8 | codec id |
| 0x2 | itf8 | number of bytes to follow |
| 0x0 | itf8 | offset |
| 0x1 | itf8 | K parameter |
size in bytes N key 1 value 1 key... value ... key N value N| size in bytes | N | key 1 | value 1 | key... | value ... | key N | value N |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
the block content identifier used to
associate external data blocks with
data series| the block content identifier used to |
| :--- |
| associate external data blocks with |
| data series |
coding of byte arrays with array
length| coding of byte arrays with array |
| :--- |
| length |
BYTE_ARRAY_STOP
5
字节停止,整数外部块内容 ID
byte stop, int external block
content id| byte stop, int external block |
| :--- |
| content id |
字节数组的编码与停止值
coding of byte arrays with a stop
value| coding of byte arrays with a stop |
| :--- |
| value |
BETA
6
int 偏移量, int 位数
二进制编码
SUBEXP
7
int offset, int K
亚指数编码
已弃用 (GOLOMB_RICE)
8
int offset, int log_(2)m\log _{2} \mathrm{~m}
戈隆布-赖斯编码
GAMMA
9
int offset
埃利亚斯伽马编码
Codec ID Parameters Comment
NULL 0 none series not preserved
EXTERNAL 1 int block content id "the block content identifier used to
associate external data blocks with
data series"
Deprecated (GOLOMB) 2 int offset, int M Golomb coding
HUFFMAN 3 array<int>, array<int> coding with int/byte values
BYTE_ARRAY_LEN 4 "encoding<int> array length,
encoding<byte> bytes" "coding of byte arrays with array
length"
BYTE_ARRAY_STOP 5 "byte stop, int external block
content id" "coding of byte arrays with a stop
value"
BETA 6 int offset, int number of bits binary coding
SUBEXP 7 int offset, int K subexponential coding
Deprecated (GOLOMB_RICE) 8 int offset, int log_(2)m Golomb-Rice coding
GAMMA 9 int offset Elias gamma coding| Codec | ID | Parameters | Comment |
| :--- | :--- | :--- | :--- |
| NULL | 0 | none | series not preserved |
| EXTERNAL | 1 | int block content id | the block content identifier used to <br> associate external data blocks with <br> data series |
| Deprecated (GOLOMB) | 2 | int offset, int M | Golomb coding |
| HUFFMAN | 3 | array<int>, array<int> | coding with int/byte values |
| BYTE_ARRAY_LEN | 4 | encoding<int> array length, <br> encoding<byte> bytes | coding of byte arrays with array <br> length |
| BYTE_ARRAY_STOP | 5 | byte stop, int external block <br> content id | coding of byte arrays with a stop <br> value |
| BETA | 6 | int offset, int number of bits | binary coding |
| SUBEXP | 7 | int offset, int K | subexponential coding |
| Deprecated (GOLOMB_RICE) | 8 | int offset, int $\log _{2} \mathrm{~m}$ | Golomb-Rice coding |
| GAMMA | 9 | int offset | Elias gamma coding |
请参阅第 13 节以获取上述所有编码算法及其参数的更详细描述。
4 个校验和
校验和用于确保数据完整性。CRAM 中使用了以下校验和算法。
4.1 CRC32
这是一个 32 位长的循环冗余校验,使用多项式 0x04C11DB7。有关更多详细信息,请参阅 ITU-T V. 42。CRC32 哈希函数的值以整数形式表示。
容器由一个或多个块组成。第一个容器称为 CRAM 头容器,用于存储如 SAM 规范中所述的文本头(参见第 7.1 节)。此容器可能会有额外的填充字节,以便允许对 SAM 头进行小规模的内联重写。这些填充字节是未定义的,但我们建议用零填充。填充字节可以是显式未压缩的块结构,或者是未分配的额外空间,其中容器的大小大于其内部块的总大小。
Data type Name Value
byte[4] format magic number CRAM (0x43 0x52 0x41 0x4d)
unsigned byte major format number 3(0x3)
unsigned byte minor format number 1 (0x1)
byte[20] file id CRAM file identifier (e.g. file name or SHA1 checksum)| Data type | Name | Value |
| :--- | :--- | :--- |
| byte[4] | format magic number | CRAM (0x43 0x52 0x41 0x4d) |
| unsigned byte | major format number | $3(0 x 3)$ |
| unsigned byte | minor format number | 1 (0x1) |
| byte[20] | file id | CRAM file identifier (e.g. file name or SHA1 checksum) |
the sum of the lengths of all blocks in this container
(headers and data) and any padding bytes (CRAM header
container only); equal to the total byte length of the
container minus the byte length of this header structure| the sum of the lengths of all blocks in this container |
| :--- |
| (headers and data) and any padding bytes (CRAM header |
| container only); equal to the total byte length of the |
| container minus the byte length of this header structure |
reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
All slices in this container must have a reference sequence
id matching this value.| reference sequence identifier or |
| :--- |
| -1 for unmapped reads |
| -2 for multiple reference sequences. |
| All slices in this container must have a reference sequence |
| id matching this value. |
itf8
参考上的起始位置
starting position on the
reference| starting position on the |
| :--- |
| reference |
the locations of slices in this container as byte offsets from
the end of this container header, used for random access
indexing. For sequence data containers, the landmark
count must equal the slice count.
Since the block before the first slice is the compression
header, landmarks[0] is equal to the byte length of the
compression header.| the locations of slices in this container as byte offsets from |
| :--- |
| the end of this container header, used for random access |
| indexing. For sequence data containers, the landmark |
| count must equal the slice count. |
| Since the block before the first slice is the compression |
| header, landmarks[0] is equal to the byte length of the |
| compression header. |
整数
crc32
容器中所有前面字节的 CRC32 哈希。
字节[
区块
容器内包含的块。
Data type Name Value
int32 length "the sum of the lengths of all blocks in this container
(headers and data) and any padding bytes (CRAM header
container only); equal to the total byte length of the
container minus the byte length of this header structure"
itf8 reference sequence id "reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
All slices in this container must have a reference sequence
id matching this value."
itf8 "starting position on the
reference" the alignment start position
itf8 alignment span the length of the alignment
itf8 number of records number of records in the container
ltf8 record counter 1-based sequential index of records in the file/stream.
ltf8 bases number of read bases
itf8 number of blocks the total number of blocks in this container
array<itf8> landmarks "the locations of slices in this container as byte offsets from
the end of this container header, used for random access
indexing. For sequence data containers, the landmark
count must equal the slice count.
Since the block before the first slice is the compression
header, landmarks[0] is equal to the byte length of the
compression header."
int crc32 CRC32 hash of the all the preceding bytes in the container.
byte[ blocks The blocks contained within the container.| Data type | Name | Value |
| :---: | :---: | :---: |
| int32 | length | the sum of the lengths of all blocks in this container <br> (headers and data) and any padding bytes (CRAM header <br> container only); equal to the total byte length of the <br> container minus the byte length of this header structure |
| itf8 | reference sequence id | reference sequence identifier or <br> -1 for unmapped reads <br> -2 for multiple reference sequences. <br> All slices in this container must have a reference sequence <br> id matching this value. |
| itf8 | starting position on the <br> reference | the alignment start position |
| itf8 | alignment span | the length of the alignment |
| itf8 | number of records | number of records in the container |
| ltf8 | record counter | 1-based sequential index of records in the file/stream. |
| ltf8 | bases | number of read bases |
| itf8 | number of blocks | the total number of blocks in this container |
| array<itf8> | landmarks | the locations of slices in this container as byte offsets from <br> the end of this container header, used for random access <br> indexing. For sequence data containers, the landmark <br> count must equal the slice count. <br> Since the block before the first slice is the compression <br> header, landmarks[0] is equal to the byte length of the <br> compression header. |
| int | crc32 | CRC32 hash of the all the preceding bytes in the container. |
| byte[ | blocks | The blocks contained within the container. |
Data type Name Value
byte method the block compression method (and first CRAM version):
0: raw (none)*
1: gzip
2: bzip2 (v2.0)
3: lzma (v3.0)
4: rans4x8 (v3.0)
5: rans4x16 (v3.1)
6: adaptive arithmetic coder (v3.1)
7: fqzcomp (v3.1)
8: name tokeniser (v3.1)
byte block content type id the block content type identifier
itf8 size in bytes* the block content identifier used to associate external data
raw size in bytes* blocks with data series
itf8 block data size of the block data after applying block compression
itf8 the data stored in the before applying block compression
byte[] ・ bit stream of CRAM records (core data block)
∙ byte stream (external data block)
CRC32 additional fields ( header blocks)
byte[4] CRC32 hash value for all preceding bytes in the block | Data type | Name | Value |
| :--- | :--- | :--- |
| byte | method | the block compression method (and first CRAM version): |
| | | 0: raw (none)* |
| | | 1: gzip |
| | | 2: bzip2 (v2.0) |
| | | 3: lzma (v3.0) |
| | | 4: rans4x8 (v3.0) |
| | | 5: rans4x16 (v3.1) |
| | | 6: adaptive arithmetic coder (v3.1) |
| | | 7: fqzcomp (v3.1) |
| | | 8: name tokeniser (v3.1) |
| byte | block content type id | the block content type identifier |
| itf8 | size in bytes* | the block content identifier used to associate external data |
| | raw size in bytes* | blocks with data series |
| itf8 | block data | size of the block data after applying block compression |
| itf8 | | the data stored in the before applying block compression |
| byte[] | ・ bit stream of CRAM records (core data block) | |
| | | $\bullet$ byte stream (external data block) |
| | CRC32 | additional fields ( header blocks) |
| byte[4] | CRC32 hash value for all preceding bytes in the block | |
true if reference sequence is required to restore
the data completely| true if reference sequence is required to restore |
| :--- |
| the data completely |
SM
字节[5]
替代矩阵
替代矩阵
TD
字节数组
标签 ID 字典
标签 ID 的列表列表,请参见标签编码部分
Key Value data type Name Value
RN bool read names included true if read names are preserved for all reads
AP bool AP data series delta true if AP data series is delta, false otherwise
RR bool reference required "true if reference sequence is required to restore
the data completely"
SM byte[5] substitution matrix substitution matrix
TD array<byte> tag ids dictionary a list of lists of tag ids, see tag encoding section| Key | Value data type | Name | Value |
| :--- | :--- | :--- | :--- |
| RN | bool | read names included | true if read names are preserved for all reads |
| AP | bool | AP data series delta | true if AP data series is delta, false otherwise |
| RR | bool | reference required | true if reference sequence is required to restore <br> the data completely |
| SM | byte[5] | substitution matrix | substitution matrix |
| TD | array<byte> | tag ids dictionary | a list of lists of tag ids, see tag encoding section |
如果 AP-Delta = true:基于 0 的对齐开始增量来自于前一记录中的 AP 值。请注意,这个增量可能是负数,例如在多参考切片中切换参考时。当记录是切片中的第一个时,使用的前一个位置是切片对齐开始字段(因此对于单参考切片,第一个增量应该为零,对于多参考切片,应该是 AP 值本身)。如果 AP-Delta = false:直接编码对齐开始位置。
if AP-Delta = true: 0-based alignment start
delta from the AP value in the previous record.
Note this delta may be negative, for example
when switching references in a multi-reference
slice. When the record is the first in the slice, the
previous position used is the slice alignment-start
field (hence the first delta should be zero for
single-reference slices, or the AP value itself for
multi-reference slices).
if AP-Delta = false: encodes the alignment start
position directly| if AP-Delta = true: 0-based alignment start |
| :--- |
| delta from the AP value in the previous record. |
| Note this delta may be negative, for example |
| when switching references in a multi-reference |
| slice. When the record is the first in the slice, the |
| previous position used is the slice alignment-start |
| field (hence the first delta should be zero for |
| single-reference slices, or the AP value itself for |
| multi-reference slices). |
| if AP-Delta = false: encodes the alignment start |
| position directly |
RG
编码
阅读组
读取组。特殊值 '-1' 表示没有组。
read groups. Special value ' -1 ' stands for no
group.| read groups. Special value ' -1 ' stands for no |
| :--- |
| group. |
RN^(a)\mathrm{RN}^{\mathrm{a}}
编码
读取名称
读取名称
MF
编码
下一个伙伴位标志
查看特定部分
NS
编码
下一个片段参考序列 ID
next fragment
reference sequence id| next fragment |
| :--- |
| reference sequence id |
下一个片段的参考序列 ID
NP
编码
下一个配对对齐开始
next mate alignment
start| next mate alignment |
| :--- |
| start |
下一个片段的对齐位置
TS
编码
模板大小
模板大小
NF
编码
到下一个片段的距离
distance to next
fragment| distance to next |
| :--- |
| fragment |
跳过到下一个片段的记录数 ^(b){ }^{b}
TL^(C)\mathrm{TL}^{\mathrm{C}}
编码
标签 ID
标签 ID 列表,请参见标签编码部分
FN
编码
读取特征的数量
number of read
features| number of read |
| :--- |
| features |
每条记录中读取特征的数量
FC
编码
阅读功能代码
请参见单独部分
FP
编码
阅读中的位置
读取特征的位置;相对于最后一个位置的正增量(从零开始)
positions of the read features; a positive delta to
the last position (starting with zero)| positions of the read features; a positive delta to |
| :--- |
| the last position (starting with zero) |
DL
编码
删除长度
碱基对缺失长度
BB
编码
基础的延伸
基础
QQ
编码
质量分数的区间
stretches of quality
scores| stretches of quality |
| :--- |
| scores |
质量分数
BS
编码
碱基替代编码
base substitution
codes| base substitution |
| :--- |
| codes |
碱基替代编码
IN
编码
插入
插入的碱基
RS
编码
参考跳过长度
'N'读取特征的跳过碱基数量
PD
编码
填充
填充碱基的数量
HC
编码
硬剪辑
硬剪切碱基的数量
SC
编码
软剪辑
软剪切碱基
MQ
编码
映射质量
映射质量分数
BA
编码
基础
基础
QS
编码
质量分数
质量分数
TC^(d)\mathrm{TC}^{\mathrm{d}}
不适用
遗留字段
被忽略
TN^(d)\mathrm{TN}^{\mathrm{d}}
不适用
遗留字段
被忽略
Key Value data type Name Value
BF encoding<int> BAM bit flags see separate section
CF encoding<int> CRAM bit flags see specific section
RI encoding<int> reference id record reference id from the SAM file header
RL encoding<int> read lengths read lengths
AP encoding<int> in-seq positions "if AP-Delta = true: 0-based alignment start
delta from the AP value in the previous record.
Note this delta may be negative, for example
when switching references in a multi-reference
slice. When the record is the first in the slice, the
previous position used is the slice alignment-start
field (hence the first delta should be zero for
single-reference slices, or the AP value itself for
multi-reference slices).
if AP-Delta = false: encodes the alignment start
position directly"
RG encoding<int> read groups "read groups. Special value ' -1 ' stands for no
group."
RN^(a) encoding<byte[ ]> read names read names
MF encoding<int> next mate bit flags see specific section
NS encoding<int> "next fragment
reference sequence id" reference sequence ids for the next fragment
NP encoding<int> "next mate alignment
start" alignment positions for the next fragment
TS encoding<int> template size template sizes
NF encoding<int> "distance to next
fragment" number of records to skip to the next fragment ^(b)
TL^(C) encoding<int> tag ids list of tag ids, see tag encoding section
FN encoding<int> "number of read
features" number of read features in each record
FC encoding<byte> read features codes see separate section
FP encoding<int> in-read positions "positions of the read features; a positive delta to
the last position (starting with zero)"
DL encoding<int> deletion lengths base-pair deletion lengths
BB encoding<byte[]> stretches of bases bases
QQ encoding<byte[ ]> "stretches of quality
scores" quality scores
BS encoding<byte> "base substitution
codes" base substitution codes
IN encoding<byte[]> insertion inserted bases
RS encoding<int> reference skip length number of skipped bases for the ' N ' read feature
PD encoding<int> padding number of padded bases
HC encoding<int> hard clip number of hard clipped bases
SC encoding<byte[ ]> soft clip soft clipped bases
MQ encoding<int> mapping qualities mapping quality scores
BA encoding<byte> bases bases
QS encoding<byte> quality scores quality scores
TC^(d) N/A legacy field to be ignored
TN^(d) N/A legacy field to be ignored| Key | Value data type | Name | Value |
| :---: | :---: | :---: | :---: |
| BF | encoding<int> | BAM bit flags | see separate section |
| CF | encoding<int> | CRAM bit flags | see specific section |
| RI | encoding<int> | reference id | record reference id from the SAM file header |
| RL | encoding<int> | read lengths | read lengths |
| AP | encoding<int> | in-seq positions | if AP-Delta = true: 0-based alignment start <br> delta from the AP value in the previous record. <br> Note this delta may be negative, for example <br> when switching references in a multi-reference <br> slice. When the record is the first in the slice, the <br> previous position used is the slice alignment-start <br> field (hence the first delta should be zero for <br> single-reference slices, or the AP value itself for <br> multi-reference slices). <br> if AP-Delta = false: encodes the alignment start <br> position directly |
| RG | encoding<int> | read groups | read groups. Special value ' -1 ' stands for no <br> group. |
| $\mathrm{RN}^{\mathrm{a}}$ | encoding<byte[ ]> | read names | read names |
| MF | encoding<int> | next mate bit flags | see specific section |
| NS | encoding<int> | next fragment <br> reference sequence id | reference sequence ids for the next fragment |
| NP | encoding<int> | next mate alignment <br> start | alignment positions for the next fragment |
| TS | encoding<int> | template size | template sizes |
| NF | encoding<int> | distance to next <br> fragment | number of records to skip to the next fragment ${ }^{b}$ |
| $\mathrm{TL}^{\mathrm{C}}$ | encoding<int> | tag ids | list of tag ids, see tag encoding section |
| FN | encoding<int> | number of read <br> features | number of read features in each record |
| FC | encoding<byte> | read features codes | see separate section |
| FP | encoding<int> | in-read positions | positions of the read features; a positive delta to <br> the last position (starting with zero) |
| DL | encoding<int> | deletion lengths | base-pair deletion lengths |
| BB | encoding<byte[]> | stretches of bases | bases |
| QQ | encoding<byte[ ]> | stretches of quality <br> scores | quality scores |
| BS | encoding<byte> | base substitution <br> codes | base substitution codes |
| IN | encoding<byte[]> | insertion | inserted bases |
| RS | encoding<int> | reference skip length | number of skipped bases for the ' N ' read feature |
| PD | encoding<int> | padding | number of padded bases |
| HC | encoding<int> | hard clip | number of hard clipped bases |
| SC | encoding<byte[ ]> | soft clip | soft clipped bases |
| MQ | encoding<int> | mapping qualities | mapping quality scores |
| BA | encoding<byte> | bases | bases |
| QS | encoding<byte> | quality scores | quality scores |
| $\mathrm{TC}^{\mathrm{d}}$ | N/A | legacy field | to be ignored |
| $\mathrm{TN}^{\mathrm{d}}$ | N/A | legacy field | to be ignored |
tag values (names and types are
available in the data series code)| tag values (names and types are |
| :--- |
| available in the data series code) |
dots\ldots
dots\ldots
dots\ldots
标签 ID N:标签类型 N
编码
读取标签 N
dots\ldots
Key Value data type Name Value
TAG ID 1:TAG TYPE 1 encoding<byte[ ]> read tag 1 "tag values (names and types are
available in the data series code)"
dots dots dots
TAG ID N:TAG TYPE N encoding<byte[]> read tag N dots| Key | Value data type | Name | Value |
| :--- | :--- | :--- | :--- |
| TAG ID 1:TAG TYPE 1 | encoding<byte[ ]> | read tag 1 | tag values (names and types are <br> available in the data series code) |
| $\ldots$ | | $\ldots$ | $\ldots$ |
| TAG ID N:TAG TYPE N | encoding<byte[]> | read tag N | $\ldots$ |
带有多个参考标志(-2)设置为头部中的序列 ID 的切片可能包含映射到多个外部参考的读取,包括未映射的 ^(3){ }^{3} 读取(放置在这些参考上或未放置),但无法以这种方式组合多个嵌入的参考。当使用多个参考时,将使用 RI 数据系列来确定每个记录的参考序列 ID。当切片中仅使用单个参考时,此数据系列不存在。
头部中的未映射 (-1) 序列 ID 用于仅包含未放置的未映射 ^(3){ }^{3} 读取的切片。
reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
This value must match that of its enclosing
container.| reference sequence identifier or |
| :--- |
| -1 for unmapped reads |
| -2 for multiple reference sequences. |
| This value must match that of its enclosing |
| container. |
itf8
对齐开始
对齐起始位置
itf8
对齐跨度
对齐的长度
itf8
记录数
切片中的记录数
ltf8
记录计数器
文件/流中记录的基于 1 的顺序索引
1-based sequential index of records in the
file/stream| 1-based sequential index of records in the |
| :--- |
| file/stream |
itf8
块的数量
切片中的块数
itf8[]
嵌入式参考基块内容 ID
块内容的 ID,切片中块的内容 ID,用于嵌入引用序列的碱基,或 -1 表示无
block content ids of the blocks in the slice
block content id for the embedded reference
sequence bases or -1 for none| block content ids of the blocks in the slice |
| :--- |
| block content id for the embedded reference |
| sequence bases or -1 for none |
MD5 checksum of the reference bases within
the slice boundaries. If this slice has
reference sequence id of -1 (unmapped) or -2
(multi-ref) the MD5 should be 16 bytes of \\0.
For embedded references, the MD5 can either
be all-zeros or the MD5 of the embedded
sequence.| MD5 checksum of the reference bases within |
| :--- |
| the slice boundaries. If this slice has |
| reference sequence id of -1 (unmapped) or -2 |
| (multi-ref) the MD5 should be 16 bytes of $\backslash 0$. |
| For embedded references, the MD5 can either |
| be all-zeros or the MD5 of the embedded |
| sequence. |
字节[16]
一系列根据 BAM 辅助字段编码的标签、类型、值元组。
a series of tag,type,value tuples encoded as
per BAM auxiliary fields.| a series of tag,type,value tuples encoded as |
| :--- |
| per BAM auxiliary fields. |
字节[]
可选标签
Data type Name Value
itf8 reference sequence id "reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
This value must match that of its enclosing
container."
itf8 alignment start the alignment start position
itf8 alignment span the length of the alignment
itf8 number of records the number of records in the slice
ltf8 record counter "1-based sequential index of records in the
file/stream"
itf8 number of blocks the number of blocks in the slice
itf8[] embedded reference bases block content id "block content ids of the blocks in the slice
block content id for the embedded reference
sequence bases or -1 for none"
itf8 reference md5 "MD5 checksum of the reference bases within
the slice boundaries. If this slice has
reference sequence id of -1 (unmapped) or -2
(multi-ref) the MD5 should be 16 bytes of \\0.
For embedded references, the MD5 can either
be all-zeros or the MD5 of the embedded
sequence."
byte[16] "a series of tag,type,value tuples encoded as
per BAM auxiliary fields."
byte[] optional tags | Data type | Name | Value |
| :--- | :--- | :--- |
| itf8 | reference sequence id | reference sequence identifier or <br> -1 for unmapped reads <br> -2 for multiple reference sequences. <br> This value must match that of its enclosing <br> container. |
| itf8 | alignment start | the alignment start position |
| itf8 | alignment span | the length of the alignment |
| itf8 | number of records | the number of records in the slice |
| ltf8 | record counter | 1-based sequential index of records in the <br> file/stream |
| itf8 | number of blocks | the number of blocks in the slice |
| itf8[] | embedded reference bases block content id | block content ids of the blocks in the slice <br> block content id for the embedded reference <br> sequence bases or -1 for none |
| itf8 | reference md5 | MD5 checksum of the reference bases within <br> the slice boundaries. If this slice has <br> reference sequence id of -1 (unmapped) or -2 <br> (multi-ref) the MD5 should be 16 bytes of $\backslash 0$. <br> For embedded references, the MD5 can either <br> be all-zeros or the MD5 of the embedded <br> sequence. |
| byte[16] | | a series of tag,type,value tuples encoded as <br> per BAM auxiliary fields. |
| byte[] | optional tags | |
对齐开始和对齐跨度值仅在解码时使用,如果切片已映射数据对齐到单个参考(参考序列 ID >=0>=0 )。对于多参考切片或那些具有未映射数据的切片,建议将这些字段填充为值 0。
MD5 校验和不应在存储的校验和全为零时进行验证。嵌入的引用应遵循与 MD5 校验和计算之前对外部引用应用的相同大小写和字母顺序规则。如果使用了嵌入引用,并不要求它与用于序列比对的引用完全匹配。例如,它可能包含在覆盖缺失时的“N”碱基,或者对于 SNP 变异可能有不同的碱基调用。因此,当使用嵌入序列时,MD5 校验和指的是嵌入序列的校验和,不应与任何外部参考文件进行验证。
Data type Name Value
bit[ ] CRAM record 1 The first CRAM record
dots dots dots
bit[ ] CRAM record N The Nth CRAM record| Data type | Name | Value |
| :--- | :--- | :--- |
| bit[ ] | CRAM record 1 | The first CRAM record |
| $\ldots$ | $\ldots$ | $\ldots$ |
| bit[ ] | CRAM record N | The Nth CRAM record |
"Data series
type" "Data series
name" Field Description
int BF BAM bit flags see BAM bit flags below
int CF CRAM bit flags see CRAM bit flags below
- - Positional data See section 10.2
- - Read names See section 10.3
- - Mate records See section 10.4
- - Auxiliary tags See section 10.5
- - Sequences See sections 10.6 and 10.7| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | BF | BAM bit flags | see BAM bit flags below |
| int | CF | CRAM bit flags | see CRAM bit flags below |
| - | - | Positional data | See section 10.2 |
| - | - | Read names | See section 10.3 |
| - | - | Mate records | See section 10.4 |
| - | - | Auxiliary tags | See section 10.5 |
| - | - | Sequences | See sections 10.6 and 10.7 |
BAM 位标志 (BF 数据系列)
以下标志是从 SAM 和 BAM 规范中复制的,具有相同的含义。然而,请注意,其中一些标志可以在解码过程中推导出来,因此可能在 CRAM 文件中被省略,并且基于同一切片内的成对文库的两个读取计算位。
位标志
评论
描述
0x1
模板具有多个序列段
template having multiple
segments in sequencing| template having multiple |
| :--- |
| segments in sequencing |
0x2
每个段落根据对齐器正确对齐
each segment properly aligned
according to the aligner| each segment properly aligned |
| :--- |
| according to the aligner |
0x4
段未映射 ^(a){ }^{\mathrm{a}}
0x8
计算 ^(b)^{\mathrm{b}} 或存储在 themate 的信息中
calculated ^(b) or stored in the
mate's info| calculated $^{\mathrm{b}}$ or stored in the |
| :--- |
| mate's info |
下一个段落在模板中未映射
next segment in template
unmapped| next segment in template |
| :--- |
| unmapped |
0x10
SEQ 被反向互补
SEQ being reverse
complemented| SEQ being reverse |
| :--- |
| complemented |
0xx200 \times 20
计算 ^(b)^{\mathrm{b}} 或存储在 themate 的信息中
calculated ^(b) or stored in the
mate's info| calculated $^{\mathrm{b}}$ or stored in the |
| :--- |
| mate's info |
模板中下一个片段的 SEQ 被反向互补
SEQ of the next segment in the
template being reverse
complemented| SEQ of the next segment in the |
| :--- |
| template being reverse |
| complemented |
0x40
模板中的第一个部分 ^(c){ }^{\mathrm{c}}
0x80
模板中的最后一个段落 ^(c){ }^{\mathrm{c}}
0x100
次级对齐
0x200
未通过质量控制
0x400
PCT 或光学副本
0x800
补充对齐
Bit flag Comment Description
0x1 "template having multiple
segments in sequencing"
0x2 "each segment properly aligned
according to the aligner"
0x4 segment unmapped ^(a)
0x8 "calculated ^(b) or stored in the
mate's info" "next segment in template
unmapped"
0x10 "SEQ being reverse
complemented"
0xx20 "calculated ^(b) or stored in the
mate's info" "SEQ of the next segment in the
template being reverse
complemented"
0x40 the first segment in the template ^(c)
0x80 the last segment in the template ^(c)
0x100 secondary alignment
0x200 not passing quality controls
0x400 PCT or optical duplicate
0x800 Supplementary alignment| Bit flag | Comment | Description |
| :---: | :---: | :---: |
| 0x1 | | template having multiple <br> segments in sequencing |
| 0x2 | | each segment properly aligned <br> according to the aligner |
| 0x4 | | segment unmapped ${ }^{\mathrm{a}}$ |
| 0x8 | calculated $^{\mathrm{b}}$ or stored in the <br> mate's info | next segment in template <br> unmapped |
| 0x10 | | SEQ being reverse <br> complemented |
| $0 \times 20$ | calculated $^{\mathrm{b}}$ or stored in the <br> mate's info | SEQ of the next segment in the <br> template being reverse <br> complemented |
| 0x40 | | the first segment in the template ${ }^{\mathrm{c}}$ |
| 0x80 | | the last segment in the template ${ }^{\mathrm{c}}$ |
| 0x100 | | secondary alignment |
| 0x200 | | not passing quality controls |
| 0x400 | | PCT or optical duplicate |
| 0x800 | | Supplementary alignment |
^(a){ }^{a} 位 0 x 4 是判断读取是否未映射的唯一可靠位置。如果设置了 0 x 4,则不能对位 0xx2,0xx1000 \times 2,0 \times 100 和 0x 8000 x 800 做出任何假设。
^(b){ }^{\mathrm{b}} 对于同一切片内的段。
^("c "){ }^{\text {c }} 位 0 x 40 和 0 x 80 反映了在所使用的测序技术中每个模板内的读取顺序,这可能与实际的映射方向无关。如果 0xx400 \times 40 和 0xx800 \times 80 都被设置,则该读取是线性模板的一部分(即模板序列预计是线性顺序),但它既不是第一个读取也不是最后一个读取。如果 0 x 40 和 0 x 80 都未设置,则模板中读取的索引是未知的。这可能发生在非线性模板(例如通过拼接其他模板构建的模板)中,或者在数据处理过程中丢失了此信息。
quality scores can be stored as read features or as an
array similar to read bases.| quality scores can be stored as read features or as an |
| :--- |
| array similar to read bases. |
0x2
分离的
配对信息被逐字存储(例如,因为该配对跨越多个切片或字段与 CRAM 计算方法不同)
mate information is stored verbatim (e.g. because the
pair spans multiple slices or the fields differ to the
CRAM computed method)| mate information is stored verbatim (e.g. because the |
| :--- |
| pair spans multiple slices or the fields differ to the |
| CRAM computed method) |
0 x 4
有配偶下游
告诉是否应该在流中进一步期待下一个段落
tells if the next segment should be expected further in
the stream| tells if the next segment should be expected further in |
| :--- |
| the stream |
0x8
解码序列为 "*"
告知解码器该序列未知,并且任何编码的参考差异仅存在于重建 CIGAR 字符串。
informs the decoder that the sequence is unknown and
that any encoded reference differences are present only
to recreate the CIGAR string.| informs the decoder that the sequence is unknown and |
| :--- |
| that any encoded reference differences are present only |
| to recreate the CIGAR string. |
Bit flag Name Description
0x1 quality scores stored as array "quality scores can be stored as read features or as an
array similar to read bases."
0x2 detached "mate information is stored verbatim (e.g. because the
pair spans multiple slices or the fields differ to the
CRAM computed method)"
0 x 4 has mate downstream "tells if the next segment should be expected further in
the stream"
0x8 decode sequence as "*" "informs the decoder that the sequence is unknown and
that any encoded reference differences are present only
to recreate the CIGAR string."| Bit flag | Name | Description |
| :--- | :--- | :--- |
| 0x1 | quality scores stored as array | quality scores can be stored as read features or as an <br> array similar to read bases. |
| 0x2 | detached | mate information is stored verbatim (e.g. because the <br> pair spans multiple slices or the fields differ to the <br> CRAM computed method) |
| 0 x 4 | has mate downstream | tells if the next segment should be expected further in <br> the stream |
| 0x8 | decode sequence as "*" | informs the decoder that the sequence is unknown and <br> that any encoded reference differences are present only <br> to recreate the CIGAR string. |
procedure DECODERECORD
\(B A M \_\)flags \(\leftarrow\) READITEM(BF, Integer)
\(C R A \bar{M} \_\)flags \(\leftarrow\) READITEM \((\mathrm{CF}\), Integer \()\)
DECODEPoSITIONS \(\triangleright\) See section 10.2
DECODENAMES \(\triangleright\) See section 10.3
DECODEMateData \(\triangleright\) See section 10.4
DecoDeTaGData \(\triangleright\) See section 10.5
if \((B F\) AND 4\()=0\) then \(\triangleright\) Unmapped flag
DECODEMAPPEDREAD \(\triangleright\) See section 10.6
else
DECODEUNMAPPEDREAD \(\triangleright\) See section 10.7
end if
对于存储在位置排序切片中的读取,压缩头部保留映射中的 AP-delta 标志应被设置,AP 数据系列将进行增量编码,使用切片对齐起始值作为增量的第一个位置。请注意,对于多参考切片,这可能意味着 AP 系列包含负值,例如在从一个参考序列的对齐末尾移动到下一个参考序列的起始位置或未映射未放置数据时。当 AP-delta 标志未设置时,AP 数据系列将作为普通整数值存储。
数据系列类型
Data series
type| Data series |
| :--- |
| type |
数据系列名称
Data series
name| Data series |
| :--- |
| name |
字段
描述
整数
RI
引用 ID
参考序列 ID(仅在多参考切片中存在)
reference sequence id (only present in
multiref slices)| reference sequence id (only present in |
| :--- |
| multiref slices) |
整数
RL
读取长度
读取的长度
整数
AP
对齐开始
对齐起始位置
整数
RG
读取组
在头部中以 Nh 记录表示的读取组标识符,从 0 开始,-1 表示没有组
the read group identifier expressed as
the Nh record in the header, starting
from 0 with -1 for no group| the read group identifier expressed as |
| :--- |
| the Nh record in the header, starting |
| from 0 with -1 for no group |
"Data series
type" "Data series
name" Field Description
int RI ref id "reference sequence id (only present in
multiref slices)"
int RL read length the length of the read
int AP alignment start the alignment start position
int RG read group "the read group identifier expressed as
the Nh record in the header, starting
from 0 with -1 for no group"| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | RI | ref id | reference sequence id (only present in <br> multiref slices) |
| int | RL | read length | the length of the read |
| int | AP | alignment start | the alignment start position |
| int | RG | read group | the read group identifier expressed as <br> the Nh record in the header, starting <br> from 0 with -1 for no group |
procedure DECODEPOSITIONS
if slice_header.reference_sequence_id \(=-2\) then
reference \(\_i d \leftarrow\) READITEM(RI, Integer)
else
\(r e f e r e n c e \_i d \leftarrow\) slice_header.reference_sequence_id
end if
read_length \(\leftarrow\) READITEM(RL, Integer)
if container_pmap.AP_delta \(\neq 0\) then
if first_record_in_slice then
last_position \(\leftarrow\) slice_header.alignment_start
end if
alignment_position \(\leftarrow\) READITEM(AP, Integer) + last_position
last_position \(\leftarrow\) alignment_position
else
alignment_position \(\leftarrow\) READITEM(AP, Integer)
end if
read_group \(\leftarrow\) READITEM \((\) RG, Integer \()\)
end procedure
"Data series
type" "Data series
name" Field Description
byte[ ] RN read names read names| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| byte[ $]$ | RN | read names | read names |
procedure DECODENAMES
if container_pmap.read_names_included \(=1\) then
read_name \(\leftarrow\) REAd \(\overline{\operatorname{ITEM}}(\mathrm{RN}\), Byte[])
else
read_name \(\leftarrow\) GENERATENAME
end if
end procedure
"Data series
type" Data series name Description
int NF the number of records to skip to the next fragment| Data series <br> type | Data series name | Description |
| :--- | :--- | :--- |
| int | NF | the number of records to skip to the next fragment |
在上述情况下,两个记录的 NS(配对参考名称)、NP(配对位置)和 TS(模板大小)字段应在配对也被解码后得出。配对参考名称和位置显而易见,直接从配对中复制。模板大小是使用 SAM 规范中描述的方法计算的;从最左侧到最右侧映射碱基的包含距离,最左侧记录的符号为正,最右侧记录的符号为负。
"Data series
type" Data series name Description
int MF next mate bit flags, see table below
byte[] RN the read name (if and only if not known already)
int NS mate reference sequence identifier
int NP mate alignment start position
int TS the size of the template (insert size)| Data series <br> type | Data series name | Description |
| :--- | :--- | :--- |
| int | MF | next mate bit flags, see table below |
| byte[] | RN | the read name (if and only if not known already) |
| int | NS | mate reference sequence identifier |
| int | NP | mate alignment start position |
| int | TS | the size of the template (insert size) |
Bit flag Name Description
0x1 mate negative strand bit the bit is set if the mate is on the negative strand
0xx2 mate unmapped bit the bit is set if the mate is unmapped| Bit flag | Name | Description |
| :--- | :--- | :--- |
| 0x1 | mate negative strand bit | the bit is set if the mate is on the negative strand |
| $0 \times 2$ | mate unmapped bit | the bit is set if the mate is unmapped |
bam_flags larr\leftarrow bam_flags OR 0x08 ▹\triangleright 配对未映射 结束如果
如果 container_pmap.read_names_included !=1\neq 1 那么 read_na bar(me)larr larr READITEM(RN, bar(Byte)[])r e a d \_n a \overline{m e} \leftarrow \leftarrow \operatorname{READITEM}(\mathrm{RN}, \overline{B y t e}[])
end if
mate_ref_id \leftarrow READITEM(NS, Integer)
mate_position \leftarrow READITEM(NP, Integer)
template_size \leftarrow READITEM(TS, Integer)
else if CF ANND 4 then }\quad\triangleright\mathrm{ Mate is downstream
if next_frag.bam_flags AND 0x10 then
this.bam_flags \leftarrowthis.bam_flags OR 0x20 \triangleright next segment reverse complemented
end if
if next_frag.bam_flags AND 0x04 then
this.bam_flags \leftarrowthis.bam_flags OR 0x08 \triangleright next segment unmapped
end if
next_frag \leftarrow READITEM(NF,Integer)
next_record \leftarrowthis_record + next_frag + 1
Resolve mate_ref_-id for this_record and next_record once both have been decoded
Resolve mate_position for this_record and next_record once both have been decoded
Find leftmost and rightmost mapped coordinate in records this_record and next_record.
For leftmost of this_record and next_record: template_size \leftarrow rightmost - leftmost + 1
For rightmost of this_record and next_record: template_size }\leftarrow-(\mathrm{ rightmost - leftmost + 1)
end if
end procedure
注意,与 SAM 规范一样,模板可能允许有超过两个的比对记录。在这种情况下,每个记录的“配对”被视为下一个记录,最后一个记录的配对是第一个记录,从而形成一个循环列表。上述算法是一个简化版本,没有处理这种情况。完整的方法需要观察当记录 +NF+N F 也被标记为在下游有额外的配对时。一种推荐的方法是在第二次遍历中解析配对信息,一旦整个切片被解码。配对链中的最后一个段需要根据第一个段相应地设置 bam_flags 字段 0x20 和 0x08。这在上述算法中也没有列出,以简洁为主。
3 character key (2 tag identifier and 1 tag
type ), as specified by the tag dictionary| 3 character key $(2$ tag identifier and 1 tag |
| :--- |
| type $),$ as specified by the tag dictionary |
"Data series
type" "Data series
name" Field Description
int TL tag line an index into the tag dictionary (TD)
** ??? tag name/type "3 character key (2 tag identifier and 1 tag
type ), as specified by the tag dictionary"| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | TL | tag line | an index into the tag dictionary (TD) |
| $*$ | $? ? ?$ | tag name/type | 3 character key $(2$ tag identifier and 1 tag <br> type $),$ as specified by the tag dictionary |
procedure DECODETAGDATA
tag_line \(\leftarrow\) READITEM(TL,Integer)
for all ele \(\in\) container_pmap.tag_dict(tag_line) do
name \(\leftarrow\) first two characters of ele
tag \((\) type \() \leftarrow\) last character of ele
\(\operatorname{tag}(\) name \() \leftarrow\) READITEM \((\) ele, Byte[])
end for
end procedure
number of read
features| number of read |
| :--- |
| features |
读取特征的数量
整数
FP
在读取位置 ^(a)^{\mathrm{a}}
读取特征的增量位置
字节
FC
读取功能代码
请参见下面的功能代码
***
***
读取特征数据 ^(a){ }^{\mathrm{a}}
请参见下面的功能代码
整数
MQ
映射质量
映射质量分数
字节[读取长度]
QS
质量分数
基础品质,如果保持
Data series type "Data series
name" Field Description
int FN "number of read
features" the number of read features
int FP in-read-position ^(a) delta-position of the read feature
byte FC read feature code See feature codes below
** ** read feature data ^(a) See feature codes below
int MQ mapping qualities mapping quality score
byte[read length] QS quality scores the base qualities, if preserved| Data series type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | FN | number of read <br> features | the number of read features |
| int | FP | in-read-position $^{\mathrm{a}}$ | delta-position of the read feature |
| byte | FC | read feature code | See feature codes below |
| $*$ | $*$ | read feature data ${ }^{\mathrm{a}}$ | See feature codes below |
| int | MQ | mapping qualities | mapping quality score |
| byte[read length] | QS | quality scores | the base qualities, if preserved |
A base and associated quality
score| A base and associated quality |
| :--- |
| score |
替代
X (0x58)
字节
BS
基本替换代码,SAM 操作符 X,M\mathrm{X}, \mathrm{M} 和 ==
base substitution codes, SAM
operators X,M and =| base substitution codes, SAM |
| :--- |
| operators $\mathrm{X}, \mathrm{M}$ and $=$ |
插入
我 (0x49)
字节[]
IN
插入的碱基,SAM 操作符 I
inserted bases, SAM operator
I| inserted bases, SAM operator |
| :--- |
| I |
删除
D (0x44)
整数
DL
删除的碱基数量,SAM 操作符 D
number of deleted bases,
SAM operator D| number of deleted bases, |
| :--- |
| SAM operator D |
插入基础
i (0x69)
字节
BA
单插入基,SAM 操作符 I
single inserted base, SAM
operator I| single inserted base, SAM |
| :--- |
| operator I |
质量得分
Q (0x51)
字节
QS
单一质量分数
参考跳过
N (0x4E)
整数
RS
跳过的碱基数,SAM 操作符 N
number of skipped bases,
SAM operator N| number of skipped bases, |
| :--- |
| SAM operator N |
软剪辑
S (0x53)
字节[]
SC
软剪切碱基,SAM 操作符 S
soft clipped bases, SAM
operator S| soft clipped bases, SAM |
| :--- |
| operator S |
填充
P(0xx50)\mathrm{P}(0 \times 50)
整数
PD
填充碱基的数量,SAM 操作符 P
number of padded bases,
SAM operator P| number of padded bases, |
| :--- |
| SAM operator P |
硬剪辑
H (0x48)
整数
HC
硬剪切碱基的数量,SAM 操作符 H
number of hard clipped bases,
SAM operator H| number of hard clipped bases, |
| :--- |
| SAM operator H |
Feature code Id "Data series
type" "Data series
name" Description
Bases b (0x62) byte[ BB a stretch of bases
Scores q (0x71) byte[ QQ a stretch of scores
Read base B (0x42) byte,byte BA,QS "A base and associated quality
score"
Substitution X (0x58) byte BS "base substitution codes, SAM
operators X,M and ="
Insertion I (0x49) byte[] IN "inserted bases, SAM operator
I"
Deletion D (0x44) int DL "number of deleted bases,
SAM operator D"
Insert base i (0x69) byte BA "single inserted base, SAM
operator I"
Quality score Q (0x51) byte QS single quality score
Reference skip N (0x4E) int RS "number of skipped bases,
SAM operator N"
Soft clip S (0x53) byte[] SC "soft clipped bases, SAM
operator S"
Padding P(0xx50) int PD "number of padded bases,
SAM operator P"
Hard clip H (0x48) int HC "number of hard clipped bases,
SAM operator H"| Feature code | Id | Data series <br> type | Data series <br> name | Description |
| :---: | :---: | :---: | :---: | :---: |
| Bases | b (0x62) | byte[ | BB | a stretch of bases |
| Scores | q (0x71) | byte[ | QQ | a stretch of scores |
| Read base | B (0x42) | byte,byte | BA,QS | A base and associated quality <br> score |
| Substitution | X (0x58) | byte | BS | base substitution codes, SAM <br> operators $\mathrm{X}, \mathrm{M}$ and $=$ |
| Insertion | I (0x49) | byte[] | IN | inserted bases, SAM operator <br> I |
| Deletion | D (0x44) | int | DL | number of deleted bases, <br> SAM operator D |
| Insert base | i (0x69) | byte | BA | single inserted base, SAM <br> operator I |
| Quality score | Q (0x51) | byte | QS | single quality score |
| Reference skip | N (0x4E) | int | RS | number of skipped bases, <br> SAM operator N |
| Soft clip | S (0x53) | byte[] | SC | soft clipped bases, SAM <br> operator S |
| Padding | $\mathrm{P}(0 \times 50)$ | int | PD | number of padded bases, <br> SAM operator P |
| Hard clip | H (0x48) | int | HC | number of hard clipped bases, <br> SAM operator H |
关于与 BAM 的兼容性,所有碱基比较应以不区分大小写的方式进行,所有写入 SC、IN 和 BA 数据系列的碱基应为大写。
基本替代代码(BS 数据系列)
碱基替换被定义为从一个核苷酸碱基(参考碱基)更改为另一个(读取碱基),包括 N 作为未知或缺失的碱基。支持 5 种参考碱基(ACGTN),每种碱基有 4 种可能的替换。任何其他碱基类型,例如模糊代码,必须使用 BA 数据系列逐字书写。
BS Code
Ref. base 0 1 2 3
A T C G N
C G A T N
G C T A N
T A G C N
N A C G T| | BS Code | | | |
| :--- | :---: | :---: | :---: | :---: |
| Ref. base | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ |
| A | T | C | G | N |
| C | G | A | T | N |
| G | C | T | A | N |
| T | A | G | C | N |
| N | A | C | G | T |
procedure DECODEMAPPEDREAD
feature_number }\leftarrow\mathrm{ READITEM(FN, Integer)
last_feature_position }\leftarrow
for }i\leftarrow1\mathrm{ to feature_number do
DECODEFEATURE
end for
mapping_quality \leftarrow READITEM(MQ, Integer)
if CF AND 1 then \triangleright Quality stored as an array
for }i\leftarrow1\mathrm{ to read_length do
quality_score \leftarrow READITEM(QS, Integer)
end for
end if
end procedure
procedure DecodeFeature
feature_code }\leftarrow\mathrm{ READITEM(FC, Integer)
feature_position }\leftarrow\mathrm{ READITEM(FP, Integer) + last_feature_position
last_feature_position }\leftarrow\mathrm{ feature_position
if feature_code ='B' then
base }\leftarrow\mathrm{ READITEM(BA, Byte)
quality_score }\leftarrow\mathrm{ READITEM(QS, Byte)
else if feature_code ='X' then
substitution_code \leftarrow READItEM(BS, Byte)
else if feature_code ='I' then
inserted_bases }\leftarrow\mathrm{ READITEM(IN, Byte[])
else if feature_code ='S' then
softclip_bases }\leftarrow\mathrm{ READITEM(SC, Byte[])
else if feature_code \(={ }^{\prime} H\) ' then
hardclip_length \(\leftarrow\) ReAdItEm(HC, Integer)
else if feature_code ='P' then
pad_length \({ }^{-} \leftarrow\) READITEM(PD, Integer)
else if feature_code \(=\) 'D' then
deletion_length \(\leftarrow\) READITEM(DL, Integer)
else if feature_code \(={ }^{\prime} \mathrm{N}\) ' then
ref_skip_length \(\leftarrow\) READITEm(RS, Integer)
else if feature_code \(=\) 'i' then
base \(-\leftarrow\) ReAdItEm(BA, Byte)
else if feature \(\quad\) code \(=' \mathrm{~b}\) ' then
bases \(\leftarrow\) REadItEm(BB, Byte[])
else if feature_code ='q' then
quality_scores \(\leftarrow\) REAdITEM(QQ, Byte[])
else if feature_code \(=\) ' Q ' then
quality_score \(\leftarrow\) READITEM(QS, Byte)
end if
end procedure
10.7 未映射的读数
未映射读取的 CRAM 记录结构具有以下附加字段:
数据系列类型
数据系列名称
Data series
name| Data series |
| :--- |
| name |
字段
描述
字节[读取长度]
BA
基础
读取的碱基
字节[读取长度]
QS
质量分数
基础品质,如果保持
Data series type "Data series
name" Field Description
byte[read length] BA bases the read bases
byte[read length] QS quality scores the base qualities, if preserved| Data series type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| byte[read length] | BA | bases | the read bases |
| byte[read length] | QS | quality scores | the base qualities, if preserved |
procedure DeCoDeUnMAPpedREAD
for \(i \leftarrow 1\) to read_length do
base \(\leftarrow\) READITEM(BA, Byte)
end for
if \(C F\) AND 1 then \(\triangleright\) Quality stored as an array
for \(i \leftarrow 1\) to read_length do
quality_score \(\leftarrow\) READITEM(QS, Byte)
end for
end if
end procedure
外部编码只是将数据逐字存储到具有给定 ID 的外部块中。如果类型是字节,则数据按原样存储;否则,对于整数类型,数据以 ITF8 格式存储。
参数
CRAM 格式定义了 EXTERNAL 编码的以下参数:
数据类型
名称
评论
itf8
外部 ID
外部块的 ID,包含字节流
Data type Name Comment
itf8 external id id of an external block containing the byte stream| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | external id | id of an external block containing the byte stream |
Symbol Code length Codeword
A 1 0
B 3 100
C 3 101
D 3 110
E 4 1110
F 4 1111| Symbol | Code length | Codeword |
| :--- | :--- | :--- |
| A | 1 | 0 |
| B | 3 | 100 |
| C | 3 | 101 |
| D | 3 | 110 |
| E | 4 | 1110 |
| F | 4 | 1111 |
参数
数据类型
名称
评论
itf8[]
字母表
所有编码符号(值)的列表
itf8[]
位长
字母表中每个符号的位长度数组
Data type Name Comment
itf8[] alphabet list of all encoded symbols (values)
itf8[] bit-lengths array of bit-lengths for each symbol in the alphabet| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8[] | alphabet | list of all encoded symbols (values) |
| itf8[] | bit-lengths | array of bit-lengths for each symbol in the alphabet |
an encoding describing how the arrays lengths are
captured| an encoding describing how the arrays lengths are |
| :--- |
| captured |
编码
值编码
一个编码,描述了如何捕获这些值
Data type Name Comment
encoding<int> lengths encoding "an encoding describing how the arrays lengths are
captured"
encoding<byte> values encoding an encoding describing how the values are captured| Data type | Name | Comment |
| :--- | :--- | :--- |
| encoding<int> | lengths encoding | an encoding describing how the arrays lengths are <br> captured |
| encoding<byte> | values encoding | an encoding describing how the values are captured |
Data type Name Comment
byte stop byte a special byte treated as a delimiter
itf8 external id id of an external block containing the byte stream| Data type | Name | Comment |
| :--- | :--- | :--- |
| byte | stop byte | a special byte treated as a delimiter |
| itf8 | external id | id of an external block containing the byte stream |
offset is subtracted from each
value during decode| offset is subtracted from each |
| :--- |
| value during decode |
itf8
长度
使用的位数
Data type Name Comment
itf8 offset "offset is subtracted from each
value during decode"
itf8 length the number of bits used| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset is subtracted from each <br> value during decode |
| itf8 | length | the number of bits used |
Data type Name Comment
itf8 offset offset is subtracted from each value during decode
itf8 k the order of the subexponential coding| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset is subtracted from each value during decode |
| itf8 | k | the order of the subexponential coding |
13.7 伽马编码:编解码器 ID 9
可以编码类型整数。
定义
Elias gamma 码是正整数的前缀编码。这是单一编码和贝塔编码的组合。前者用于捕获贝塔编码所需的位数,以捕获值。
offset to subtract from each
value after decode| offset to subtract from each |
| :--- |
| value after decode |
Data type Name Comment
itf8 offset "offset to subtract from each
value after decode"| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset to subtract from each <br> value after decode |
the golomb parameter (number
of bins)| the golomb parameter (number |
| :--- |
| of bins) |
Data type Name Comment
itf8 offset offset is added to each value
itf8 M "the golomb parameter (number
of bins)"| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset is added to each value |
| itf8 | M | the golomb parameter (number <br> of bins) |