这是用户在 2024-9-5 14:09 为 https://app.immersivetranslate.com/pdf-pro/dbaa0e9d-9b69-4c62-8a41-371818dde2f9 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_09_05_2152cfd992695f73eb33g

Sequence Alignment/Map Format Specification
序列比对/映射格式规范

The SAM/BAM Format Specification Working Group
SAM/BAM 格式规范工作组

16 Nov 2023 2023 年 11 月 16 日

Abstract 摘要

The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 346a94a from that repository, last modified on the date shown above.
该文档的主版本可以在 https://github.com/samtools/hts-specs 找到。此打印版本为该仓库的 346a94a,最后修改日期如上所示。

1 The SAM Format Specification
1 SAM 格式规范

SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with ' ', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information.
SAM 代表序列比对/映射格式。它是一种以制表符分隔的文本格式,包含一个可选的头部部分和一个比对部分。如果存在,头部必须位于比对之前。头部行以' '开头,而比对行则不以此开头。每个比对行有 11 个强制字段,用于基本的比对信息,如映射位置,以及可变数量的可选字段,用于灵活或特定于比对工具的信息。
This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAM file may optionally specify the version being used via the @HD VN tag. For full version history see Appendix B.
此规范适用于 SAM 和 BAM 格式的 1.6 版本。每个 SAM 和 BAM 文件可以选择通过@HD VN 标签指定所使用的版本。有关完整的版本历史,请参见附录 B。
SAM file contents are 7-bit US-ASCII, except for certain field values as individually specified which may contain other Unicode characters encoded in UTF-8. Alternatively and equivalently, SAM files are encoded in UTF-8 but non-ASCII characters are permitted only within certain field values as explicitly specified in the descriptions of those fields.
SAM 文件内容为 7 位 US-ASCII,除了某些字段值外,这些字段值可能包含以 UTF-8 编码的其他 Unicode 字符。或者,SAM 文件以 UTF-8 编码,但非 ASCII 字符仅在某些字段值中被允许,这些字段的描述中有明确说明。
Where it makes a difference, SAM file contents should be read and written using the POSIX / C locale. For example, floating-point values in SAM always use '.' for the decimal-point character.
在有差异的地方,SAM 文件内容应使用 POSIX / C 区域设置进行读取和写入。例如,SAM 中的浮点值始终使用 '.' 作为小数点字符。
The regular expressions in this specification are written using the POSIX / IEEE Std 1003.1 extended syntax.
本规范中的正则表达式使用 POSIX / IEEE Std 1003.1 扩展语法编写。

1.1 An example 1.1 示例

Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment.
假设我们有以下对齐,其中小写字母表示从对齐中剪切的碱基。读取 r001/1 和 r001/2 组成一个读取对;r003 是一个嵌合读取;r004 代表一个分裂对齐。
The corresponding SAM format is:
对应的 SAM 格式是:
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1

1.2 Terminologies and Concepts
1.2 术语和概念

Template A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences.
模板 A DNA/RNA 序列的一部分是在测序仪上测序的或从原始序列组装而成。
Segment A contiguous sequence or subsequence.
段 A 连续序列或子序列。

Read A raw sequence that comes off a sequencing machine. A read may consist of multiple segments. For sequencing data, reads are indexed by the order in which they are sequenced.
读取来自测序机器的原始序列。一个读取可能由多个片段组成。对于测序数据,读取按其测序的顺序进行索引。
Linear alignment An alignment of a read to a single reference sequence that may include insertions, deletions, skips and clipping, but may not include direction changes (i.e., one portion of the alignment on forward strand and another portion of alignment on reverse strand). A linear alignment can be represented in a single SAM record.
线性比对 将读取比对到单个参考序列的比对,可能包括插入、缺失、跳过和剪切,但可能不包括方向变化(即,比对的一部分在正链上,另一部分在反链上)。线性比对可以在单个 SAM 记录中表示。
Chimeric alignment An alignment of a read that cannot be represented as a linear alignment. A chimeric alignment is represented as a set of linear alignments that do not have large overlaps. Typically, one of the linear alignments in a chimeric alignment is considered the "representative" alignment, and the others are called "supplementary" and are distinguished by the supplementary alignment flag. All the SAM records in a chimeric alignment have the same QNAME and the same values for and flags (see Section 1.4). The decision regarding which linear alignment is representative is arbitrary.
嵌合比对 无法表示为线性比对的读取的比对。嵌合比对表示为一组没有大重叠的线性比对。通常,嵌合比对中的一个线性比对被视为“代表性”比对,其他的被称为“补充”比对,并通过补充比对标志进行区分。嵌合比对中的所有 SAM 记录具有相同的 QNAME 和相同的 标志值(见第 1.4 节)。关于哪个线性比对是代表性的决定是任意的。
Read alignment A linear alignment or a chimeric alignment that is the complete representation of the alignment of the read.
读取比对 线性比对或嵌合比对,是对读取的比对的完整表示。
Multiple mapping The correct placement of a read may be ambiguous, e.g., due to repeats. In this case, there may be multiple read alignments for the same read. One of these alignments is considered primary. All the other alignments have the secondary alignment flag set in the SAM records that represent them. All the SAM records have the same QNAME and the same values for and 0x80 flags. Typically the alignment designated primary is the best alignment, but the decision may be arbitrary.
多重比对 读取的正确放置可能会模糊不清,例如,由于重复。在这种情况下,可能会有多个相同读取的比对。其中一个比对被视为主要比对。所有其他比对在表示它们的 SAM 记录中都有次要比对标志。所有 SAM 记录具有相同的 QNAME 和相同的 及 0x80 标志值。通常,指定为主要的比对是最佳比对,但该决定可能是任意的。

1-based coordinate system A coordinate system where the first base of a sequence is one. In this coordinate system, a region is specified by a closed interval. For example, the region between the 3rd and the 7 th bases inclusive is . The SAM, VCF, GFF and Wiggle formats are using the 1-based coordinate system.
1 基坐标系统 一种序列的第一个基数为 1 的坐标系统。在该坐标系统中,区域由闭区间指定。例如,第 3 个和第 7 个基数之间的区域(包括)是 。SAM、VCF、GFF 和 Wiggle 格式使用 1 基坐标系统。

0-based coordinate system A coordinate system where the first base of a sequence is zero. In this coordinate system, a region is specified by a half-closed-half-open interval. For example, the region between the 3rd and the 7 th bases inclusive is . The BAM, BCFv2, BED, and PSL formats are using the 0 -based coordinate system.
0 基础坐标系统 一种序列的第一个基数为零的坐标系统。在该坐标系统中,区域由半闭半开区间指定。例如,第 3 个和第 7 个基数之间的区域(包括)是 。BAM、BCFv2、BED 和 PSL 格式使用 0 基础坐标系统。
Phred scale Given a probability , the phred scale of equals , rounded to the closest integer.
Phred 评分 给定一个概率 ,则 的 phred 评分等于 ,四舍五入到最接近的整数。

1.2.1 Character set restrictions
1.2.1 字符集限制

Reference sequence names, CIGAR strings, and several other field types are used as values or parts of values of other fields in SAM and related formats such as VCF. To ensure that these other fields' representations are unambiguous, these field types disallow particular delimiter characters.
参考序列名称、CIGAR 字符串以及其他几种字段类型在 SAM 及相关格式(如 VCF)中用作其他字段的值或部分值。为了确保这些其他字段的表示是明确的,这些字段类型不允许使用特定的分隔符字符。
Query or read names may contain any printable ASCII characters in the range [!- ] apart from ' ', so that SAM alignment lines can be easily distinguished from header lines. (They are also limited in length.)
查询或读取的名称可以包含范围为 [!- ] 的任何可打印 ASCII 字符,除了 ' ',以便 SAM 对齐行可以与头部行轻松区分。(它们的长度也有限制。)
Reference sequence names may contain any printable ASCII characters in the range [!- ] apart from backslashes, commas, quotation marks, and brackets-i.e., apart from ',"' () [] {} <>'—and may not start with ' ' or ' '.
参考序列名称可以包含范围内的任何可打印 ASCII 字符 [!- ],但不包括反斜杠、逗号、引号和括号,即不包括 ',"' () [] {} <> '—并且不能以 ' ' 或 ' ' 开头。
Thus they match the following regular expression:
因此它们匹配以下正则表达式:

For clarity, elsewhere in this specification we write this set of allowed characters as a character class [:rname:] and extend the POSIX regular expression notation to use to indicate the omission of ' ' and ' ' from the character class. Thus this regular expression can be written more clearly as [:rname ] [:rname:]*.
为了清晰起见,在本规范的其他地方,我们将这组允许的字符写为字符类 [:rname:],并扩展 POSIX 正则表达式符号以使用 来表示从字符类中省略 ' ' 和 ' '。因此,这个正则表达式可以更清晰地写为 [:rname ] [:rname:]*。

1.3 The header section
1.3 头部部分

Each header line begins with the character ' ' followed by one of the two-letter header record type codes defined in this section. In the header, each line is TAB-delimited and, apart from @CO lines, each data field follows a format 'TAG:VALUE' where TAG is a two-character string that defines the format and content of VALUE. Thus header lines match /^ @(HD|SQ|RG|PG) ( or /^ @CO t .*/. Within each (non-@CO) header line, no field tag may appear more than once and the order in which the fields appear is not significant.
每个标题行以字符 ' ' 开头,后面跟着本节定义的两字母标题记录类型代码之一。在标题中,每行以制表符分隔,除了 @CO 行外,每个数据字段遵循格式 'TAG:VALUE',其中 TAG 是一个定义 VALUE 格式和内容的两字符字符串。因此,标题行匹配 /^ @(HD|SQ|RG|PG) ( 或 /^ @CO t .*/。在每个(非 @CO)标题行中,字段标签不得出现超过一次,字段出现的顺序并不重要。
The following table describes the header record types that may be used and their predefined tags. Tags listed with are required; e.g., every @SQ header line must have SN and LN fields. As with alignment optional fields (see Section 1.5), you can freely add new tags for further data fields. Tags containing lowercase letters are reserved for local use and will not be formally defined in any future version of this specification.
下表描述了可能使用的头记录类型及其预定义标签。带有 的标签是必需的;例如,每个 @SQ 头行必须具有 SN 和 LN 字段。与比对可选字段(见第 1.5 节)一样,您可以自由添加新标签以用于进一步的数据字段。包含小写字母的标签保留供本地使用,未来版本的此规范中将不会正式定义。
Tag 标签 Description 描述
@HD

文件级元数据。可选。如果存在,必须只有一行 @HD,并且它必须是文件的第一行。
File-level metadata. Optional. If present, there must be only one @HD line and it must be the
first line of the file.
VN* Format version. Accepted format: /^ .
格式版本。接受的格式:/^ .
SO

比对的排序顺序。有效值:unknown(默认)、unsorted、queryname 和 coordinate。对于坐标排序,主要排序键是 RNAME 字段,顺序由头部 @SQ 行的顺序定义。次要排序键是 POS 字段。对于 RNAME 和 POS 相等的比对,顺序是任意的。所有在 RNAME 字段中包含 ' ' 的比对在某些其他值之后,但其他情况下是任意顺序。对于 queryname 排序,除了要求在整个文件中一致应用外,没有明确的排序要求。
Sorting order of alignments. Valid values: unknown (default), unsorted, queryname and
coordinate. For coordinate sort, the major sort key is the RNAME field, with order defined
by the order of @SQ lines in the header. The minor sort key is the POS field. For alignments
with equal RNAME and POS, order is arbitrary. All alignments with ' ' in RNAME field follow
alignments with some other value but otherwise are in arbitrary order. For queryname sort, no
explicit requirement is made regarding the ordering other than that it be applied consistently
throughout the entire file.
GO

对齐的分组,表示相似的对齐记录被分组在一起,但文件不一定是整体排序的。有效值:none(默认),query(对齐按 QNAME 分组)和 reference(对齐按 RNAME/POS 分组)。
Grouping of alignments, indicating that similar alignment records are grouped together but the
file is not necessarily sorted overall. Valid values: none (default), query (alignments are grouped
by QNAME), and reference (alignments are grouped by RNAME/POS).
SS

对齐的子排序顺序。有效值的形式为 sort-order: sub-sort,其中 sort-order 是存储在 SO 标签中的相同值,sub-sort 是一个依赖于实现的以冒号分隔的字符串,进一步描述排序顺序,但有一些在第 1.3.1 节中定义的预定义术语。例如,如果一个算法依赖于坐标排序,在每个坐标上进一步按查询名称排序,则头部可以包含 @HD SO:coordinate SS: coordinate:queryname。 如果主要排序不是预定义的主要排序顺序之一,则应使用 unsorted,子排序实际上是主要排序。例如,如果按辅助标签 MI 排序,然后按坐标排序,则头部可以包含 @HD SO:unsorted SS:unsorted:MI: coordinate。正则表达式:(coordinate। queryname|unsorted) (: [A-Za-z0-9_-]+) +
Sub-sorting order of alignments. Valid values are of the form sort-order: sub-sort, where sort-
order is the same value stored in the SO tag and sub-sort is an implementation-dependent
colon-separated string further describing the sort order, but with some predefined terms de-
fined in Section 1.3.1. For example, if an algorithm relies on a coordinate sort that, at each
coordinate, is further sorted by query name then the header could contain @HD SO:coordinate
SS: coordinate:queryname. If the primary sort is not one of the predefined primary sort orders,
then unsorted should be used and the sub-sort is effectively the major sort. For example, if
sorted by an auxiliary tag MI then by coordinate then the header could contain @HD SO:unsorted
SS:unsorted:MI: coordinate.
Regular expression: (coordinate। queryname|unsorted) (: [A-Za-z0-9_-]+) +
@SQ Reference sequence dictionary. The order of @SQ lines defines the alignment sorting order.
参考序列字典。@SQ 行的顺序定义了比对排序顺序。
SN*

参考序列名称。所有 @SQ 行中的 SN 标签和所有单独的 AN 名称必须是唯一的。此字段的值用于 RNAME 和 RNEXT 字段中的比对记录。正则表达式:[:rname: [:rname:]*
Reference sequence name. The SN tags and all individual AN names in all @SQ lines must be
distinct. The value of this field is used in the alignment records in RNAME and RNEXT fields.
Regular expression: [:rname: [:rname:]*
LN* LN* LN* Reference sequence length. Range:
参考序列长度。范围:
AH

指示该序列是一个替代位点。 该值是主组装中该序列的替代位点,格式为 'chr: start-end','chr'(如果已知),或 '*'(如果未知),其中 ' ' 是主组装中的一个序列。不得出现在主组装中的序列上。
Indicates that this sequence is an alternate locus. The value is the locus in the primary assembly
for which this sequence is an alternative, in the format 'chr: start-end', 'chr' (if known), or '*' (if
unknown), where ' ' is a sequence in the primary assembly. Must not be present on sequences
in the primary assembly.
AN

替代参考序列名称。一个以逗号分隔的替代名称列表,工具在引用此参考序列时可能会使用这些名称。 这些替代名称在 SAM 文件的其他地方不使用;特别是,它们不得出现在比对记录的 RNAME 或 RNEXT 字段中。正则表达式:name (, name)* 其中 name 是[:rname: ] [:rname:]*
Alternative reference sequence names. A comma-separated list of alternative names that tools
may use when referring to this reference sequence. These alternative names are not used
elsewhere within the SAM file; in particular, they must not appear in alignment records' RNAME
or RNEXT fields. Regular expression: name (, name)* where name is [:rname: ] [:rname:]*
AS Genome assembly identifier.
基因组组装标识符。
DS Description. UTF-8 encoding may be used.
描述。可以使用 UTF-8 编码。
M5 MD5 checksum of the sequence. See Section 1.3.2
序列的 MD5 校验和。请参见第 1.3.2 节。
SP Species. 物种。
TP Molecule topology. Valid values: linear (default) and circular.
分子拓扑。有效值:线性(默认)和环状。
UR

序列的 URI。该值可以以标准协议之一开头,例如 'http:' 或 'ftp:'。如果它不是以这些协议之一开头,则假定它是一个文件系统路径。
URI of the sequence. This value may start with one of the standard protocols, e.g., 'http:' or
' ftp :'. If it does not start with one of these protocols, it is assumed to be a file-system path.
@RG Read group. Unordered multiple @RG lines are allowed.
读取组。允许多个无序的 @RG 行。
ID*

读取组标识符。每个 @RG 行必须具有唯一的 ID。ID 的值用于比对记录的 RG 标签。必须在头部部分的所有读取组中唯一。当合并 SAM 文件时,读取组 ID 可能会被修改以处理冲突。
Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG
tags of alignment records. Must be unique among all read groups in header section. Read group
IDs may be modified when merging SAM files in order to handle collisions.
BC

条形码序列用于识别样本或文库。该值是测序机器在没有错误的情况下读取的预期条形码碱基。如果样本/文库有多个条形码(例如,模板的每一端都有一个),建议的实现方式是将所有条形码连接在一起,用连字符(' - ')分隔。
Barcode sequence identifying the sample or library. This value is the expected barcode bases
as read by the sequencing machine in the absence of errors. If there are several barcodes for
the sample/library (e.g., one on each end of the template), the recommended implementation
concatenates all the barcodes separating them with hyphens (' - ').
CN Name of sequencing center producing the read.
产生读取的测序中心名称。
DS Description. UTF-8 encoding may be used.
描述。可以使用 UTF-8 编码。
DT Date the run was produced (ISO8601 date or date/time).
运行生成的日期(ISO8601 日期或日期/时间)。
FO

流顺序。与每个读取的每个流使用的核苷酸对应的核苷酸碱基数组。多碱基流以 IUPAC 格式编码,非核苷酸流则由各种其他字符表示。格式: ACMGRSVTWYHKDBN]+/
Flow order. The array of nucleotide bases that correspond to the nucleotides used for each
flow of each read. Multi-base flows are encoded in IUPAC format, and non-nucleotide flows by
various other characters. Format: ACMGRSVTWYHKDBN]+/
KS The array of nucleotide bases that correspond to the key sequence of each read.
每个读取的关键序列对应的核苷酸碱基数组。
LB Library. 库。
PG Programs used for processing the read group.
用于处理读取组的程序。
PI Predicted median insert size, rounded to the nearest integer.
预测的中位插入大小,四舍五入到最接近的整数。
PL

用于生成读取的 платформ/技术。有效值:CAPILLARY, DNBSEQ (MGI/BGI), ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Bio-sciences), SINGULAR, SOLID 和 ULTIMA。当技术不在此列表中(尽管在这种情况下 PM 字段仍然可能存在)或未知时,应省略此字段。
Platform/technology used to produce the reads. Valid values: CAPILLARY, DNBSEQ (MGI/BGI),
ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Bio-
sciences), SINGULAR, SOLID, and ULTIMA. This field should be omitted when the technology is
not in this list (though the PM field may still be present in this case) or is unknown.
PM Platform model. Free-form text providing further details of the platform/technology used.
平台模型。提供有关所使用的平台/技术的进一步细节的自由格式文本。
PU Platform unit (e.g., flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identifier.
平台单元(例如,Illumina 的 flowcell-barcode.lane 或 SOLiD 的 slide)。唯一标识符。
SM Sample. Use pool name where a pool is being sequenced.
示例。使用正在进行测序的池名称。
@PG Program. 程序。
ID*

程序记录标识符。每个 @PG 行必须具有唯一的 ID。ID 的值用于其他 @PG 行的对齐 PG 标签和 PP 标签。在合并 SAM 文件时,PG ID 可能会被修改以处理冲突。
Program record identifier. Each @PG line must have a unique ID. The value of ID is used in the
alignment PG tag and PP tags of other @PG lines. PG IDs may be modified when merging SAM
files in order to handle collisions.
PN Program name 程序名称
CL Command line. UTF-8 encoding may be used.
命令行。可以使用 UTF-8 编码。
PP

先前的 @PG-ID。必须与另一个 @PG 头的 ID 标签匹配。@PG 记录可以使用 PP 标签链接,链中的最后一条记录没有 PP 标签。此链定义了已应用于比对的程序顺序。在合并 SAM 文件时,可以修改 PP 值以处理 PG ID 的冲突。链中的第一个 PG 记录(即 SAM 记录中 PG 标签所引用的记录)描述了对 SAM 记录进行操作的最新程序。链中的下一个 PG 记录描述了对 SAM 记录进行操作的下一个最新程序。SAM 记录上的 PG ID 并不要求引用链中的最新 PG 记录。它可以引用链中的任何 PG 记录,意味着 SAM 记录已被该 PG 记录中的程序以及通过 PP 标签引用的程序操作。
Previous @PG-ID. Must match another @PG header's ID tag. @PG records may be chained using PP
tag, with the last record in the chain having no PP tag. This chain defines the order of programs
that have been applied to the alignment. PP values may be modified when merging SAM files
in order to handle collisions of PG IDs. The first PG record in a chain (i.e., the one referred to
by the PG tag in a SAM record) describes the most recent program that operated on the SAM
record. The next PG record in the chain describes the next most recent program that operated
on the SAM record. The PG ID on a SAM record is not required to refer to the newest PG record
in a chain. It may refer to any PG record in a chain, implying that the SAM record has been
operated on by the program in that PG record, and the program(s) referred to via the PP tag.
DS Description. UTF-8 encoding may be used.
描述。可以使用 UTF-8 编码。
VN Program version 程序版本
@CO

单行文本评论。允许多个无序的 @CO 行。可以使用 UTF-8 编码。
One-line text comment. Unordered multiple @CO lines are allowed. UTF-8 encoding may be
used.

1.3.1 Defined sub-sort terms
1.3.1 定义的子排序术语

While the SS sub-sort field allows implementation-defined keywords, some terms are predefined with specific meanings.
虽然 SS 子排序字段允许实现定义的关键字,但某些术语是预定义的,具有特定含义。

lexicographical sort order is defined as a character-based dictionary sort with the character order as defined by the POSIX C locale. For example "abc", "abc17", "abc5", "abc59" and "abcd" are in lexicographical order.
字典序排序定义为基于字符的字典排序,字符顺序由 POSIX C 区域设置定义。例如,“abc”、“abc17”、“abc5”、“abc59”和“abcd”是按字典序排列的。

natural sort order is similar to lexicographical order except that runs of adjacent digits are considered to be numbers embedded within the text string, ordered numerically when compared to each other and ordered as single digits when compared to the surrounding non-digit characters. Runs that differ only in the number of leading zeros (thus are numerically tied) are ordered by more-zeros coming before fewer-zeros. The characters '-' and '.' are considered as ordinary characters, so apparently negative or fractional values are not treated as part of an embedded number. For example, "abc", "abc+5", "abc , "abc.d", "abc03", "abc5", "abc008", "abc08", "abc8", "abc17", "abc17.+", "abc17.2", "abc17.d", "abc59" and "abcd" are in natural order.
自然排序与字典序相似,不同之处在于相邻数字的连续部分被视为嵌入在文本字符串中的数字,在相互比较时按数值排序,而在与周围的非数字字符比较时按单个数字排序。仅在前导零的数量上有所不同(因此在数值上是平局)的部分,按前导零更多的排在前面。字符 '-' 和 '.' 被视为普通字符,因此明显的负值或分数值不被视为嵌入数字的一部分。例如,"abc"、"abc+5"、"abc "、"abc.d"、"abc03"、"abc5"、"abc008"、"abc08"、"abc8"、"abc17"、"abc17.+"、"abc17.2"、"abc17.d"、"abc59" 和 "abcd" 是自然顺序。

umi is a lexicographical sort by the UMI tag. The MI tag should be used for comparing UMIs. The RX tag may be used in its absence but is not guaranteed to be unique across multiple libraries.
umi 是按 UMI 标签进行的词典排序。MI 标签应用于比较 UMIs。在缺少 MI 标签的情况下,可以使用 RX 标签,但不能保证在多个库中是唯一的。

1.3.2 Reference MD5 calculation
1.3.2 参考 MD5 计算

The M5 tag on @SQ lines allows reference sequences to be uniquely identified through the MD5 digest of the sequence itself. As the digest is based on the sequence and nothing else, it can help resolve ambiguities with reference naming. For example, it allows a quick way of checking that references named ' 1 ', ' Chr 1 ' and 'chr1' in different files are in fact the same.
@SQ 行上的 M5 标签允许通过序列本身的 MD5 摘要唯一识别参考序列。由于摘要是基于序列而非其他内容,它可以帮助解决参考命名中的歧义。例如,它提供了一种快速检查不同文件中名为 ' 1 '、' Chr 1 ' 和 'chr1' 的参考实际上是相同的方式。
The reference sequence must be in the 7-bit US-ASCII character set. All valid reference bases can be represented in this set, and it avoids the problem of determining exactly which 8 -bit representation may have been used. Padding characters (See Section 3.2) must be represented only using the '*' character.
参考序列必须使用 7 位 US-ASCII 字符集。所有有效的参考碱基都可以在此集合中表示,并且避免了确定可能使用的确切 8 位表示的问题。填充字符(见第 3.2 节)必须仅使用'*'字符表示。
The digest is calculated as follows:
摘要的计算如下:
  • All characters outside of the inclusive range 33 ('!') to are stripped out. This removes all unprintable and whitespace characters including spaces and new lines. Everything else is retained, even if not a legal nucleotide code.
    所有在包含范围 33 ('!') 到 之外的字符都会被剔除。这将移除所有不可打印和空白字符,包括空格和换行符。其他所有内容都会被保留,即使不是合法的核苷酸代码。
  • All lowercase characters are converted to uppercase. This operation is equivalent to calling toupper() on characters in the POSIX locale.
    所有小写字符都被转换为大写。此操作相当于在 POSIX 区域中对字符调用 toupper()。
  • The MD5 digest is calculated as described in RFC 1321 and presented as a 32 character lowercase hexadecimal number.
    MD5 摘要的计算方法如 RFC 1321 所述,并以 32 个字符的小写十六进制数字表示。
As an example, if the reference contains the following characters (including spaces):
作为一个例子,如果引用包含以下字符(包括空格):

ACGT ACGT ACGT
acgt acgt acgt
... 12345 !!!
then the digest is that of the string ACGTACGTACGTACGTACGTACGT...12345!!! and the resulting tag would be M5: dfabdbb36e239a6da88957841f32b8e4.
然后摘要是字符串 ACGTACGTACGTACGTACGTACGT...12345!!! 的摘要,生成的标签将是 M5: dfabdbb36e239a6da88957841f32b8e4。
In padded SAM files, the padding bases should be inserted into the reference as ' characters. Taking the example in Section 3.2, the padded version of the reference is
在填充的 SAM 文件中,填充碱基应作为 '字符插入到参考中。以第 3.2 节中的示例为例,参考的填充版本是
AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
and the corresponding tag is M5: caad65b937c4bc0b33c08f62a9fb5411.
和相应的标签是 M5: caad65b937c4bc0b33c08f62a9fb5411。

1.4 The alignment section: mandatory fields
1.4 对齐部分:必填字段

In the SAM format, each alignment line typically represents the linear alignment of a segment. Each line consists of 11 or more TAB-separated fields. The first eleven fields are always present and in the order shown below; if the information represented by any of these fields is unavailable, that field's value will be a placeholder, either ' 0 ' or ' ' as determined by the field's type. The following table gives an overview of these mandatory fields in the SAM format:
在 SAM 格式中,每个比对行通常表示一个片段的线性比对。每行由 11 个或更多以 TAB 分隔的字段组成。前十一个字段总是存在,并按下面所示的顺序排列;如果任何字段所表示的信息不可用,则该字段的值将是一个占位符,可能是' 0 '或' ',具体取决于字段的类型。下表概述了 SAM 格式中的这些必填字段:
Col  Field 字段 Type 类型 Regexp/Range 正则表达式/范围 Brief description 简要描述
1 QNAME String 字符串 Query template NAME 查询模板名称
2 FLAG Int 整数 bitwise FLAG 位运算标志
3 RNAME String 字符串 rname:   名称: Reference sequence NAME
参考序列名称
4 POS Int 整数 1-based leftmost mapping POSition
1 基于左侧最左映射位置
5 MAPQ Int 整数 MAPping Quality 映射质量
6 CIGAR String 字符串 MIDNSHP CIGAR string CIGAR 字符串
7 RNEXT String 字符串 rname: rname: Reference name of the mate/next read
参考名称的配偶/下一个阅读
8 PNEXT Int 整数 Position of the mate/next read
配对/下一个读取的位置
9 TLEN Int 整数 observed Template LENgth
观察到的模板长度
10 SEQ String 字符串 . segment SEQuence 段 SEQuence
11 QUAL String 字符串 ASCII of Phred-scaled base QUALity +33
Phred 缩放的基本质量的 ASCII +33
All mapped segments in alignment lines are represented on the forward genomic strand. For segments that have been mapped to the reverse strand, the recorded SEQ is reverse complemented from the original unmapped sequence and CIGAR, QUAL, and strand-sensitive optional fields are reversed and thus recorded consistently with the sequence bases as represented.
所有在比对行中映射的片段都表示在正向基因组链上。对于已映射到反向链的片段,记录的 SEQ 是从原始未映射序列反向互补而来,CIGAR、QUAL 和链敏感的可选字段被反转,因此与表示的序列碱基一致地记录。
  1. QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. A QNAME '*' indicates the information is unavailable. In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.
    QNAME:查询模板名称。具有相同 QNAME 的读取/片段被视为来自同一模板。QNAME '*' 表示信息不可用。在 SAM 文件中,一个读取可能占据多个比对行,当其比对是嵌合的或给出了多个映射时。
  2. FLAG: Combination of bitwise FLAGs. Each bit is explained in the following table:
    标志:按位标志的组合。 每个位在下表中解释:
Bit 比特 Description 描述
1 template having multiple segments in sequencing
模板具有多个序列段
2 each segment properly aligned according to the aligner
每个片段根据对齐器正确对齐
4 segment unmapped 段未映射
8 next segment in the template unmapped
模板中的下一个段落未映射
16 SEQ being reverse complemented
SEQ 被反向互补
32 SEQ of the next segment in the template being reverse complemented
模板中下一个片段的 SEQ 被反向互补
64 the first segment in the template
模板中的第一部分
128 the last segment in the template
模板中的最后一个部分
256 secondary alignment 次级对齐
512 not passing filters, such as platform/vendor quality controls
未通过过滤器,例如平台/供应商质量控制
1024 PCR or optical duplicate
PCR 或光学重复
2048 supplementary alignment 补充对齐
  • For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies 'FLAG & '. This line is called the primary line of the read.
    对于 SAM 文件中的每个读取/拼接,要求与该读取关联的行中只有一行满足 'FLAG & '。这一行称为读取的主行。
  • Bit 0x100 marks the alignment not to be used in certain analyses when the tools in use are aware of this bit. It is typically used to flag alternative mappings when multiple mappings are presented in a SAM.
    位 0x100 标记在某些分析中不应使用的对齐,当使用的工具意识到此位时。它通常用于在 SAM 中呈现多个映射时标记替代映射。
  • Bit indicates that the corresponding alignment line is part of a chimeric alignment. A line flagged with 0x800 is called as a supplementary line.
    Bit 表示相应的比对行是嵌合比对的一部分。标记为 0x800 的行称为补充行。
  • Bit is the only reliable place to tell whether the read is unmapped. If is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits , and .
    Bit 是唯一可靠的地方来判断读取是否未映射。如果设置了 ,则无法对 RNAME、POS、CIGAR、MAPQ 和位 以及 做出任何假设。
  • Bit 0x10 indicates whether SEQ has been reverse complemented and QUAL reversed. When bit 0 x 4 is unset, this corresponds to the strand to which the segment has been mapped: bit 0 x 10 unset indicates the forward strand, while set indicates the reverse strand. When 0 x 4 is set, this indicates whether the unmapped read is stored in its original orientation as it came off the sequencing machine.
    位 0x10 表示 SEQ 是否已被反向互补,QUAL 是否被反转。当位 0x4 未设置时,这对应于段被映射到的链:位 0x10 未设置表示正链,而设置则表示反链。当 0x4 被设置时,这表示未映射的读取是否以其原始方向存储,即从测序仪上读取时的方向。
  • Bits and reflect the read ordering within each template inherent in the sequencing technology used. If and are both set, the read is part of a linear template, but it is neither the first nor the last read. If both and are unset, the index of the read in the template is unknown. This may happen for a non-linear template or when this information is lost during data processing.
    反映了所使用的测序技术中每个模板内的读取顺序。 如果 都被设置,则该读取是线性模板的一部分,但既不是第一个也不是最后一个读取。如果 都未设置,则模板中读取的索引是未知的。这可能发生在非线性模板中,或者在数据处理过程中丢失了该信息。
  • If is unset, no assumptions can be made about and .
    如果 未设置,则无法对 做出任何假设。
  • Bits that are not listed in the table are reserved for future use. They should not be set when writing and should be ignored on reading by current software.
    未在表中列出的位保留供将来使用。在写入时不应设置,在当前软件读取时应忽略。
  1. RNAME: Reference sequence NAME of the alignment. If @SQ header lines are present, RNAME (if not ) must be present in one of the SQ-SN tag. An unmapped segment without coordinate has a , at
    RNAME:比对的参考序列名称。如果存在 @SQ 头行,RNAME(如果不是 )必须出现在其中一个 SQ-SN 标签中。没有坐标的未映射片段具有 ,在
this field. However, an unmapped segment may also have an ordinary coordinate such that it can be placed at a desired position after sorting. If RNAME is , no assumptions can be made about POS and CIGAR.
此字段。然而,未映射的片段也可能具有普通坐标,以便在排序后可以放置在所需位置。如果 RNAME 是 ,则无法对 POS 和 CIGAR 做出任何假设。

4. POS: 1-based leftmost mapping POSition of the first CIGAR operation that "consumes" a reference base (see table below). The first base in a reference sequence has coordinate 1 . POS is set as 0 for an unmapped read without coordinate. If POS is 0 , no assumptions can be made about RNAME and CIGAR.
4. POS:第一个“消耗”参考碱基的 CIGAR 操作的 1 基于左侧的映射位置(见下表)。参考序列中的第一个碱基坐标为 1。对于没有坐标的未映射读取,POS 设置为 0。如果 POS 为 0,则无法对 RNAME 和 CIGAR 做出任何假设。

5. MAPQ: MAPping Quality. It equals mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.
5. MAPQ:映射质量。它等于 映射位置错误,四舍五入到最接近的整数。值 255 表示映射质量不可用。

6. CIGAR: CIGAR string. The CIGAR operations are given in the following table (set ' ' if unavailable):
6. CIGAR: CIGAR 字符串。CIGAR 操作在下表中给出(如果不可用,请设置 ' '):
Op 操作 BAM Description 描述
 消耗查询
Consumes
query
 消耗引用
Consumes
reference
M 0 alignment match (can be a sequence match or mismatch)
对齐匹配(可以是序列匹配或不匹配)
yes  yes 
I 1 insertion to the reference
插入到参考文献
yes  no 
D 2 deletion from the reference
从引用中删除
no  yes 
N 3 skipped region from the reference
跳过的区域来自参考
no  yes 
S 4 soft clipping (clipped sequences present in SEQ)
软剪切(SEQ 中存在的剪切序列)
yes  no 
H 5 hard clipping (clipped sequences NOT present in SEQ)
硬裁剪(裁剪的序列不在 SEQ 中)
no  no 
P 6 padding (silent deletion from padded reference)
填充(从填充参考中静默删除)
yo  no 
= 7 sequence match 序列匹配 yes  yes 
X 8 sequence mismatch 序列不匹配 yes  yes 
  • "Consumes query" and "consumes reference" indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.
    “消耗查询”和“消耗参考”指示 CIGAR 操作是否导致比对沿查询序列和参考序列分别移动。
  • H can only be present as the first and/or last operation.
    H 只能作为第一个和/或最后一个操作出现。
  • S may only have H operations between them and the ends of the CIGAR string.
    S 之间只能有 H 操作和 CIGAR 字符串的两端。
  • For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined.
    对于 mRNA 与基因组的比对,N 操作表示一个内含子。对于其他类型的比对,N 的解释未定义。
  • Sum of lengths of the operations shall equal the length of SEQ.
    操作的长度总和应等于 SEQ 的长度。
  1. RNEXT: Reference sequence name of the primary alignment of the NEXT read in the template. For the last read, the next read is the first read in the template. If @SQ header lines are present, RNEXT (if not , or ' ') must be present in one of the SQ-SN tag. This field is set as '*' when the information is unavailable, and set as ' ' if RNEXT is identical RNAME. If not ' ' and the next read in the template has one primary mapping (see also bit in FLAG), this field is identical to RNAME at the primary line of the next read. If RNEXT is , no assumptions can be made on PNEXT and bit .
    RNEXT:模板中 NEXT 读取的主要比对的参考序列名称。对于最后一个读取,下一读取是模板中的第一个读取。如果存在@SQ 头行,则 RNEXT(如果不是 ,或' ')必须出现在一个 SQ-SN 标签中。当信息不可用时,此字段设置为'*',如果 RNEXT 与 RNAME 相同,则设置为' '。如果不是' ',并且模板中的下一读取有一个主要比对(另见 FLAG 中的位 ),则此字段与下一读取的主要行中的 RNAME 相同。如果 RNEXT 是 ,则无法对 PNEXT 和位 做出假设。
  2. PNEXT: 1-based Position of the primary alignment of the NEXT read in the template. Set as 0 when the information is unavailable. This field equals POS at the primary line of the next read. If PNEXT is 0 , no assumptions can be made on RNEXT and bit .
    PNEXT:模板中 NEXT 读取的主比对的基于 1 的位置。当信息不可用时设置为 0。此字段在下一个读取的主行中等于 POS。如果 PNEXT 为 0,则无法对 RNEXT 和位 做出任何假设。
  3. TLEN: signed observed Template LENgth. For primary reads where the primary alignments of all reads in the template are mapped to the same reference sequence, the absolute value of TLEN equals the distance between the mapped end of the template and the mapped start of the template, inclusively (i.e., end - start +1 ). Note that mapped base is defined to be one that aligns to the reference as described by CIGAR, hence excludes soft-clipped bases. The TLEN field is positive for the leftmost segment of the template, negative for the rightmost, and the sign for any middle segment is undefined. If segments cover the same coordinates then the choice of which is leftmost and rightmost is arbitrary, but the two ends must still have differing signs. It is set as 0 for a single-segment template or when
    TLEN:签名观察到的模板长度。对于主读取,其中模板中所有读取的主比对都映射到同一参考序列,TLEN 的绝对值等于模板的映射结束与映射开始之间的距离,包括在内(即,结束 - 开始 + 1)。 请注意,映射的碱基被定义为与参考序列对齐的碱基,如 CIGAR 所描述,因此排除了软剪切的碱基。模板的最左侧段的 TLEN 字段为正值,最右侧段为负值,任何中间段的符号是未定义的。如果段覆盖相同的坐标,则选择哪个是最左侧和最右侧是任意的,但两个端点仍必须具有不同的符号。对于单段模板或当
the information is unavailable (e.g., when the first or last segment of a multi-segment template is unmapped or when the two are mapped to different reference sequences).
信息不可用(例如,当多段模板的第一个或最后一个片段未映射,或当这两个片段映射到不同的参考序列时)。

The intention of this field is to indicate where the other end of the template has been aligned without needing to read the remainder of the SAM file. Unfortunately there has been no clear consensus on the definitions of the template mapped start and end. Thus the exact definitions are implementationdefined.
该字段的意图是指示模板的另一端已对齐的位置,而无需读取 SAM 文件的其余部分。不幸的是,对于模板映射的开始和结束的定义尚无明确共识。因此,确切的定义是实现定义的。

10. SEQ: segment SEQuence. This field can be a when the sequence is not stored. If not a , the length of the sequence must equal the sum of lengths of operations in CIGAR. An ' ' denotes the base is identical to the reference base. No assumptions can be made on the letter cases.
10. SEQ:片段序列。该字段可以是一个 当序列未存储时。如果不是 ,则序列的长度必须等于 CIGAR 中