美文网首页
cutadapt -- 结果文件

cutadapt -- 结果文件

作者: QXPLUS | 来源:发表于2022-05-12 14:54 被阅读0次

以A01样本为例:

1_rawdata/A01_R1.fq.gz 1_rawdata/A01_R2.fq.gz

  • os.popen(command)
    python 借助于os.popen()运行shell命令
from os import popen
from os.path import join
import os

comm = '/opt/software/python/Python-3.7.0/bin/cutadapt \
--pair-filter=both -j 8  \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-o 1_rawdata/A01_cut_R1.fq \
-p 1_rawdata/A01_cut_R2.fq \
1_rawdata/A01_R1.fq.gz \
1_rawdata/A01_R2.fq.gz --quiet'

pid=popen(comm).read().strip()
print(pid)
# '28311'
  • 结果文件
1_rawdata/A01_cut_R1.fq 1_rawdata/A01_cut_R2.fq summary28311.xls
  • summary28311.xls
    在处理的时候可以将reads切分成许多部分,同时进行处理,可以提高效率。所有,我们得到的也是各段的统计值,需要对这些值进行求和,才是整个样本A01的统计情况。
[login TESTA01]$ less -S summary28311.xls

ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 67      1630039 3146494 3001147
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 26      1626928 3153637 3008651
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 41      1631128 3149262 3000306
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 36      1631434 3148880 3001510
ReadsCount      BasesCount      N       GC      Q20     Q30
1400    210000  2       105028  203581  193802
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 36      1631763 3145608 2994869
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 27      1620728 3093790 2890052
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 19      1625818 3122916 2945342
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 32      1626506 3140477 2982700
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 21      1627278 3141294 2987421
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 32      1629499 3142636 2990793
  • 用Excel打开 summary28311.xls


    summary28311.xls
  • summary28311.xls 按列求和

    with open("summary{}.xls".format(pid)) as s:
        summary = [0] * 6                       # [0, 0, 0, 0, 0, 0]
        for line in s:
            try:
                nums = int(line[0])             # reads count:21710
                row = line.strip().split("\t")  # [21710 3256500   67    1630039 3146494 3001147]
                row = [int(value) for value in row]
                # [21710 3256500   67    1630039 3146494 3001147]
                summary = [x+y for x, y in zip(summary, row)] 
            except Exception as e:
                continue
image.png
>>> summary
[218500, 32775000, 339, 16386149, 31588575, 29996593]
  • 基于输出结果,计算N GC Q20 Q30的占比
N_percent = 1.0*summary[2]/summary[1]*100
GC_percent = 1.0*summary[3]/(summary[1]-summary[2])*100
Q20_percent = 1.0*summary[4]/summary[1]*100
Q30_percent = 1.0*summary[5]/summary[1]*100 

with open(join(argv[4], argv[3] + "_summary1.xls"), "w") as out1:
    out1.write("SampleID\tLib\tReadsCount\tBasesCount(bp)\tN(%)\tGC(%)\tQ20(%)\tQ30(%)\n")
    out1.write("%s\tSPE\t%d\t%d\t%.5f\t%.2f\t%.2f\t%.2f\n" % ('A01', summary[0], summary[1], N_percent, GC_percent, Q20_percent, Q30_percent))
[login TESTA01]$ less -S 2_cleandata/A01_summary1.xls

SampleID        Lib     ReadsCount      BasesCount(bp)  N(%)    GC(%)   Q20(%)  Q30(%)
A01     SPE     218500  32775000        0.00103 50.00   96.38   91.52

相关文章

网友评论

      本文标题:cutadapt -- 结果文件

      本文链接:https://www.haomeiwen.com/subject/bygsurtx.html