美文网首页
cutadapt -- 结果文件

cutadapt -- 结果文件

作者: QXPLUS | 来源:发表于2022-05-12 14:54 被阅读0次

    以A01样本为例:

    1_rawdata/A01_R1.fq.gz 1_rawdata/A01_R2.fq.gz

    • os.popen(command)
      python 借助于os.popen()运行shell命令
    from os import popen
    from os.path import join
    import os
    
    comm = '/opt/software/python/Python-3.7.0/bin/cutadapt \
    --pair-filter=both -j 8  \
    -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
    -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
    -o 1_rawdata/A01_cut_R1.fq \
    -p 1_rawdata/A01_cut_R2.fq \
    1_rawdata/A01_R1.fq.gz \
    1_rawdata/A01_R2.fq.gz --quiet'
    
    pid=popen(comm).read().strip()
    print(pid)
    # '28311'
    
    • 结果文件
    1_rawdata/A01_cut_R1.fq 1_rawdata/A01_cut_R2.fq summary28311.xls
    
    • summary28311.xls
      在处理的时候可以将reads切分成许多部分,同时进行处理,可以提高效率。所有,我们得到的也是各段的统计值,需要对这些值进行求和,才是整个样本A01的统计情况。
    [login TESTA01]$ less -S summary28311.xls
    
    ReadsCount      BasesCount      N       GC      Q20     Q30
    21710   3256500 67      1630039 3146494 3001147
    ReadsCount      BasesCount      N       GC      Q20     Q30
    21710   3256500 26      1626928 3153637 3008651
    ReadsCount      BasesCount      N       GC      Q20     Q30
    21710   3256500 41      1631128 3149262 3000306
    ReadsCount      BasesCount      N       GC      Q20     Q30
    21710   3256500 36      1631434 3148880 3001510
    ReadsCount      BasesCount      N       GC      Q20     Q30
    1400    210000  2       105028  203581  193802
    ReadsCount      BasesCount      N       GC      Q20     Q30
    21710   3256500 36      1631763 3145608 2994869
    ReadsCount      BasesCount      N       GC      Q20     Q30
    21710   3256500 27      1620728 3093790 2890052
    ReadsCount      BasesCount      N       GC      Q20     Q30
    21710   3256500 19      1625818 3122916 2945342
    ReadsCount      BasesCount      N       GC      Q20     Q30
    21710   3256500 32      1626506 3140477 2982700
    ReadsCount      BasesCount      N       GC      Q20     Q30
    21710   3256500 21      1627278 3141294 2987421
    ReadsCount      BasesCount      N       GC      Q20     Q30
    21710   3256500 32      1629499 3142636 2990793
    
    • 用Excel打开 summary28311.xls


      summary28311.xls
    • summary28311.xls 按列求和

        with open("summary{}.xls".format(pid)) as s:
            summary = [0] * 6                       # [0, 0, 0, 0, 0, 0]
            for line in s:
                try:
                    nums = int(line[0])             # reads count:21710
                    row = line.strip().split("\t")  # [21710 3256500   67    1630039 3146494 3001147]
                    row = [int(value) for value in row]
                    # [21710 3256500   67    1630039 3146494 3001147]
                    summary = [x+y for x, y in zip(summary, row)] 
                except Exception as e:
                    continue
    
    image.png
    >>> summary
    [218500, 32775000, 339, 16386149, 31588575, 29996593]
    
    • 基于输出结果,计算N GC Q20 Q30的占比
    N_percent = 1.0*summary[2]/summary[1]*100
    GC_percent = 1.0*summary[3]/(summary[1]-summary[2])*100
    Q20_percent = 1.0*summary[4]/summary[1]*100
    Q30_percent = 1.0*summary[5]/summary[1]*100 
    
    with open(join(argv[4], argv[3] + "_summary1.xls"), "w") as out1:
        out1.write("SampleID\tLib\tReadsCount\tBasesCount(bp)\tN(%)\tGC(%)\tQ20(%)\tQ30(%)\n")
        out1.write("%s\tSPE\t%d\t%d\t%.5f\t%.2f\t%.2f\t%.2f\n" % ('A01', summary[0], summary[1], N_percent, GC_percent, Q20_percent, Q30_percent))
    
    [login TESTA01]$ less -S 2_cleandata/A01_summary1.xls
    
    SampleID        Lib     ReadsCount      BasesCount(bp)  N(%)    GC(%)   Q20(%)  Q30(%)
    A01     SPE     218500  32775000        0.00103 50.00   96.38   91.52
    

    相关文章

      网友评论

          本文标题:cutadapt -- 结果文件

          本文链接:https://www.haomeiwen.com/subject/bygsurtx.html