bam 文件介绍
https://samtools.github.io/hts-specs/SAMv1.pdf
http://bio-bwa.sourceforge.net/bwa.shtml
首先看看bam
Col | Field | Description |
---|---|---|
1 | QNAME | Query (pair) NAME |
2 | FLAG | bitwise FLAG |
3 | RNAME | Reference sequence NAME |
4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
5 | MAPQ | MAPping Quality (Phred-scaled) |
6 | CIAGR | extended CIGAR string |
7 | MRNM | Mate Reference sequence NaMe (‘=’ if same as RNAME) |
8 | MPOS | 1-based Mate POSistion |
9 | ISIZE | Inferred insert SIZE |
10 | SEQ | query SEQuence on the same strand as the reference |
11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
介绍如下:
-
QNAME 序列name,来源于fastq文件
-
flag 见:samtools flags 有解释
SE测序,flag为0好像也表示匹配
https://www.biostars.org/p/7374/
When the flag field is 0, it means none of the bitwise flags specified in the SAM spec (on page 4) are set. That means that your reads with flag 0 are unpaired (because the first flag, 0x1, is not set), successfully mapped to the reference (because 0x4 is not set) and mapped to the forward strand (because 0x10 is not set). Summarizing your data, the reads with flag 4 are unmapped, the reads with flag 0 are mapped to the forward strand and the reads with flag 16 are mapped to the reverse strand.
SE 的flag:
samtools view HG2492W/HG2492W.bam | awk '{print $2}' | sort | uniq -c
1597504 0
1599538 16
5794 256
5756 272
219660 4
samtools view HG2492H/HG2492H.bam | awk '{print $2}' | sort | uniq -c
424935 0
429939 16
1278 256
1114 272
3316 4
PE的flag 比较复杂,一般是组合的情况:
cat flag | sort | uniq -c
2912 113
415 117
679 121
2836 129
700 133
373 137
2357 141
5073 145
1188332 147
5035 161
1172704 163
2912 177
679 181
415 185
132 321
29 323
38 329
294 337
99 339
278 353
62 355
184 369
39 371
42 377
146 385
47 387
45 393
292 401
69 403
259 417
70 419
193 433
50 435
37 441
2836 65
373 69
700 73
2357 77
5035 81
1172704 83
5073 97
1188332 99
查flags网站:https://broadinstitute.github.io/picard/explain-flags.html
参考解释:https://www.cnblogs.com/xudongliang/p/5437850.html
3.RNAME 表示参考序列的名称,比如基因组的染色体编号等,如果没有比对上则显示为*
4.POS 表示比对的起始位置,以1开始计数,如果没有比对上则显示为0;从5’开始
5.MAPQ
该值的计算方法是mapping的错误率的-10log10值,之后四舍五入得到的整数,如果值为255表示mapping值是不可用的,如果是unmapped read则MAPQ为0,一般在使用bwa mem或bwa aln(bwa 0.7.12-r1039版本)生成的sam文件,第五列为60表示mapping率最高,一般结果是这一列的数值是从0到60,且0和60这两个数字出现次数最多。该值越大说明越可靠。
R 计算MAPQ
已知P计算MAPQ
-10log10(0.01)
[1] 20
-10log10(0.1)
[1] 10
已知MAPQ,计算 Pr
10(0/-10)
[1] 1
10(1/-10)
[1] 0.7943282
10(2/-10)
[1] 0.6309573
10(3/-10)
[1] 0.5011872
10(4/-10)
[1] 0.3981072
10(5/-10)
[1] 0.3162278
10(6/-10)
[1] 0.2511886
10(7/-10)
[1] 0.1995262
10(8/-10)
[1] 0.1584893
10(9/-10)
[1] 0.1258925
10**(10/-10)
[1] 0.1</pre>
Pr(mapping position is wrong) | MAPQ |
---|---|
1 | 0 |
0.7943282 | 1 |
0.3162278 | 5 |
0.1 | 10 |
0.01 | 20 |
0.001 | 30 |
0.0001 | 40 |
6.CIGAR 即比对的详细情况,简要比对信息表达式(Compact Idiosyncratic Gapped Alignment Report),其以参考序列为基础,使用数字加字母表示比对结果,比如3S6M1P1I4M,前三个碱基被剪切去除了,然后6个比对上了,然后打开了一 个缺口,有一个碱基插入,最后是4个比对上了,是按照顺序的;
7.RNEXT PE测序中下一个reads比对的参考系列的名称,如果没有则用 " * " 表示,如果和前一个reads比对到同一个参考序列则用" = "表示,SE 是" * "
8.PNEXT PE测序下一个reads比对到参考序列上的位置,如果没有则用0表示;
9.TLEM query序列的模板长度或者插入长度,Template的长度,最左边得为正+,最右边的为负 -,中间的不用 定义正负,不分区段(single-segment)的比对上,或者不可用时,表示为0;
-
SEQ reads的序列信息;
-
QUAL reads的序列质量信息,同FASTQ。
-
可选字段(optional fields),格式如:TAG:TYPE:VALUE,其中TAG有两个大写字母组成,每个TAG代表一类信息,每一行一个TAG只能出现一次,TYPE表示TAG对应值的类型,可以是字符串、整数、字节、数组等。
TPNB500301:64:HM3T7AFXY:3:21404:5781:1306 16 chr1 9996 1 4S37M * 0 0 CTCTTCCGATAACCCTAACCCTAACCCTAACCCTAGCCCTA E42DC3=3=@@2DD26@>DD6>@>DD3@@>CD6>D9@><<< BD:Z:KKJJJJJKJJJKIKJJJKIKJJJKIKJJJKIKJKMLJKJJJ MD:Z:31A5 PG:Z:MarkDuplicates RG:Z:FAMILY_HG2493W BI:Z:OPOPPOOPPOOPNOOOOPNOOOOPNOOOOPNOOPPPNOOOO NM:i:1 AS:i:32 XS:i:30
TPNB500301:64:HM3T7AFXY:3:21401:22875:18010 16 chr1 10547 0 11S64M * 0 0 TCCCCCCGACCTCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACG ?44D4D=3<41CE=D3EC@DDC3D2@>=EEDEECD=EDDEC=E=CD=EECECD=CCDCE=D=EE<ECCECB<77< BD:Z:JIIIIJJKKKKKLJJLLKJJJKJJKJKJLKLMKJJJKIKKKJJJKJLMKKKJJJIKLKLJJLMLJJJKKKLJKJJ MD:Z:13C50 PG:Z:MarkDuplicates RG:Z:FAMILY_HG2493W BI:Z:PNNNNOOPPOOPQNOPPPONPONOPOOOPPPPOPOOPNOOOOOOPOPPOPOPOOOPPPQNOPPQPNPONOPOOOO NM:i:1 AS:i:59 XS:i:59
网友评论