mpileup是samtools的一个命令,用来生存bcf文件,然后再用bcftools进行SNP和Indel的分析。另外,bcftools是samtools的附带软件。
1用法
samtools mpileup [options] in1.bam [in2.bam [...]]
Input options:
-6, --illumina1.3+ quality is in the Illumina-1.3+ encoding
-A, --count-orphans do not discard anomalous read pairs
-b, --bam-list FILE list of input BAM filenames, one per line
-B, --no-BAQ disable BAQ (per-Base Alignment Quality)
-C, --adjust-MQ INT adjust mapping quality; recommended:50, disable:0 [0]
-d, --max-depth INT max per-file depth; avoids excessive memory usage [8000]
-E, --redo-BAQ recalculate BAQ on the fly, ignore existing BQs
-f, --fasta-ref FILE faidx indexed reference sequence file
-G, --exclude-RG FILE exclude read groups listed in FILE
-l, --positions FILE skip unlisted positions (chr pos) or regions (BED)
-q, --min-MQ INT skip alignments with mapQ smaller than INT [0]
-Q, --min-BQ INT skip bases with baseQ/BAQ smaller than INT [13]
-r, --region REG region in which pileup is generated
-R, --ignore-RG ignore RG tags (one BAM = one sample)
--rf, --incl-flags STR|INT required flags: skip reads with mask bits unset []
--ff, --excl-flags STR|INT filter flags: skip reads with mask bits set
[UNMAP,SECONDARY,QCFAIL,DUP]
-x, --ignore-overlaps disable read-pair overlap detection
Output options:
-o, --output FILE write output to FILE [standard output]
-O, --output-BP output base positions on reads
-s, --output-MQ output mapping quality
--output-QNAME output read names
-a output all positions (including zero depth)
-a -a (or -aa) output absolutely all positions, including unused ref. sequences
最常用的是
最常用的参数有两个:
-f
用samtools faidx对参考序列建index.fai
文件,其他软件也可以
-g
输出到bcf格,否则生成文本格式文件。用法和最简单的例子如下
u
输出不压缩的bcf文件
$ samtools mpileup -f genome.fasta abc.bam > abc.txt
$ samtools mpileup -gSDf genome.fasta abc.bam > abc.bcf
$ samtools mpileup -guSDf genome.fasta abc.bam |bcftools view -cvNg - > abc.vcf
2结果解释
2.1结果共6列
[mpileup] 1 samples in 1 input files
chr1 10105 N 8 AAAAcAAA kuuu>6uK
chr1 10106 N 10 CCCCcCCCC^$c uuuukAuuAK
chr1 10107 N 10 CCCCcCCCCc upukaFfuKF
chr1 10108 N 9 CCCCcCCCC ukup7K\uK
chr1 10109 N 10 AA$AAaAAAA$a kpuuBKppFA
chr1 10110 N 7 AA$A$aAAA kupkFkp
chr1 10111 N 6 CcCCCc QuApuK
chr1 10112 N 6 CcCCCc \uKppK
chr1 10113 N 6 C$cCCCc \uKkpK
chr1 10114 N 5 tTTTt k7fpF
分别是
- 参考序列名(染色体)
- 位置
- 参考序列的碱基
- 比对上的read数目
- 比对的情况
- 比对上的碱基的质量
2.2其中第五列相对复杂,具体解释如下:
chr1 10056 N 7 AAAA*AA kfuufKK
chr1 2003928 N 5 CCC+6GGGCCGC+6GGGCCGC uupu<
chr1 10106 N 10 CCCCcCCCC^$c uuuukAuuAK
chr1 737247 N 3 Cc^5c KKK
chr1 1197931 N 3 T-6NNNNNNT-6NNNNNNt-6nnnnnn uuK
- 1
*
表示模糊碱基 - 2 大写表示在正链不匹配
- 3 小写表示在负链不匹配
- 4
^
表示匹配的碱基是一个reads的开始,^
后紧跟的ascii码减去33代表比对质量,修饰的是后面的碱基,后面紧跟的碱基代表该read的第一个碱基 - 5
$
代表一个read的结束,该符号修饰前面的碱基 - 6 正则表达式式
+[0-9]+[ACGTNacgtn]+
代表在该位点后插入的碱基。举例中chr1的2003928A后面有个+6GGGCCG
,很可能是indel - 7 正则表达式’-[0-9]+[ACGTNacgtn]+’代表在该位点后缺失的碱基;
网友评论