71. 《Bioinformatics Data Skills》

作者: DataScience | 来源:发表于2021-10-07 21:26 被阅读0次

71. 《Bioinformatics Data Skills》
【shell笔记>生信|专项】生信数据处理技能手札(3)：
Bioinformatics Data Skills
《Bioinformatics Data Skills 2015
《Bioinformatics Data Skills 2015
Bioinformatics Data Skills - Pip
28.《Bioinformatics-Data-Skills》之
18.《Bioinformatics-Data-Skills》之
19.《Bioinformatics-Data-Skills》之
17.《Bioinformatics-Data-Skills》之

现在我们已经具备采用pysam模块统计比对信息的知识了，这里我们编写一个程序来统计BAM/SAM文件的信息并使用samtools flagstat进行验证。

NA12891_CEU_sample.bam文件下载

程序代码如下：

import pysam
import sys
from collections import Counter ## 1

if len(sys.argv) < 2: ## 2
    sys.exit("usage: alnsta.py bamfile")

bamfile = pysam.AlignmentFile(sys.argv[1]) ## 3
stat = Counter()
for read in bamfile:
    stat['total'] += 1
    stat['qcfail'] += int(read.is_qcfail) ## 4
    stat['paired'] += int(read.is_paired)
    stat['read1'] += int(read.is_read1)
    stat['read2'] += int(read.is_read2)

    if read.is_unmapped:
        stat['unmapped'] += 1
        continue ## 5
    
    stat['mapped'] += 1
    if read.mapping_quality <= 30:
        stat['mapping quality <= 30'] += 1
    stat["proper pair"] += int(read.is_proper_pair)

output_order = ("total", "mapped", "unmapped", "paired", "read1", "read2", "proper pair", "qcfail", "mapping quality <= 30") ## 6
for key in output_order:
    format = (key, stat[key], 100*float(stat[key]/stat['total']))
    sys.stdout.write(("%s: %d (%0.2f%%)\n") % format) ## 7

解释：

## 1，Counter对象在统计信息的时候很常用，类似于一个字典

## 2，简单判断是否存在输入文件名

## 3，pysam.AlignmentFile用于读取BAM/SAM文件，自动判断文件类型不需要指定

## 4，根据read的属性判断read是否符合条件，使用int函数将逻辑值转为0,1值

## 5，如果一个read没有比对到参考基因组上，跳过下面的内容统计

## 6，由于Counter对象存储的元素没有顺序，这里指定输出顺序

## 7，通过控制台以给定的格式顺序输出内容

该代码的运行结果为：

$ python3 alnsta.py NA12891_CEU_sample.bam

total: 636207 (100.00%)
mapped: 630875 (99.16%)
unmapped: 5332 (0.84%)
paired: 435106 (68.39%)
read1: 217619 (34.21%)
read2: 217487 (34.18%)
proper pair: 397247 (62.44%)
qcfail: 0 (0.00%)
mapping quality <= 30: 90982 (14.30%)

使用samtools flagstat进行检验：

$ samtools flagstat NA12891_CEU_sample.bam
636207 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
29826 + 0 duplicates
630875 + 0 mapped (99.16% : N/A)
435106 + 0 paired in sequencing
217619 + 0 read1
217487 + 0 read2
397247 + 0 properly paired (91.30% : N/A)
424442 + 0 with itself and mate mapped
5332 + 0 singletons (1.23% : N/A)
5383 + 0 with mate mapped to a different chr
2190 + 0 with mate mapped to a different chr (mapQ>=5)

可见我们的统计结果基本都是正确的，不过mapping quality <= 30数目没有统计到。我们可以这样验证：mapping quality <= 30数目 = 能够比对到参考基因组的read数目 - 比对质量大于30的read数目。

由于unmap的flag数字为4：

$ samtools flags UNMAP
0x4     4       UNMAP

所以mapped read数目为（注：这里-F表示取反）：

$ samtools view -c -F 4 NA12891_CEU_sample.bam
630875

通过-q参数可以查看比对质量大于30的read数目：

$ samtools view -c -q 31 NA12891_CEU_sample.bam
539893

综合这两个信息，可知read比对质量小于等于30的read数目为：630875 - 539893 = 90982，与脚本的结果一致。

网友评论

生物信息数据科学

本文标题：71. 《Bioinformatics Data Skills》

本文链接：https://www.haomeiwen.com/subject/gdaznltx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

71. 《Bioinformatics Data Skills》

相关文章