cutadapt -- 结果文件

作者: QXPLUS | 来源:发表于2022-05-12 14:54 被阅读0次

cutadapt -- 结果文件
【生信】cutadapt①基本使用方法
三代HiFi测序数据去除接头序列
cutadapt
利用cutadapt去除miRNA-seq接头
安装cutadapt
宏基因组--简单流程(代码)
【生物学信息笔记】cutadapt 使用
cutadapt 使用详解
【RNA-Seq 实战】二、软件安装

以A01样本为例：

1_rawdata/A01_R1.fq.gz 1_rawdata/A01_R2.fq.gz

os.popen(command)
python 借助于os.popen()运行shell命令

from os import popen
from os.path import join
import os

comm = '/opt/software/python/Python-3.7.0/bin/cutadapt \
--pair-filter=both -j 8  \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-o 1_rawdata/A01_cut_R1.fq \
-p 1_rawdata/A01_cut_R2.fq \
1_rawdata/A01_R1.fq.gz \
1_rawdata/A01_R2.fq.gz --quiet'

pid=popen(comm).read().strip()
print(pid)
# '28311'

结果文件

1_rawdata/A01_cut_R1.fq 1_rawdata/A01_cut_R2.fq summary28311.xls

summary28311.xls
在处理的时候可以将reads切分成许多部分，同时进行处理，可以提高效率。所有，我们得到的也是各段的统计值，需要对这些值进行求和，才是整个样本A01的统计情况。

[login TESTA01]$ less -S summary28311.xls

ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 67      1630039 3146494 3001147
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 26      1626928 3153637 3008651
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 41      1631128 3149262 3000306
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 36      1631434 3148880 3001510
ReadsCount      BasesCount      N       GC      Q20     Q30
1400    210000  2       105028  203581  193802
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 36      1631763 3145608 2994869
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 27      1620728 3093790 2890052
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 19      1625818 3122916 2945342
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 32      1626506 3140477 2982700
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 21      1627278 3141294 2987421
ReadsCount      BasesCount      N       GC      Q20     Q30
21710   3256500 32      1629499 3142636 2990793

用Excel打开 summary28311.xls

summary28311.xls
summary28311.xls 按列求和

    with open("summary{}.xls".format(pid)) as s:
        summary = [0] * 6                       # [0, 0, 0, 0, 0, 0]
        for line in s:
            try:
                nums = int(line[0])             # reads count:21710
                row = line.strip().split("\t")  # [21710 3256500   67    1630039 3146494 3001147]
                row = [int(value) for value in row]
                # [21710 3256500   67    1630039 3146494 3001147]
                summary = [x+y for x, y in zip(summary, row)] 
            except Exception as e:
                continue

image.png

>>> summary
[218500, 32775000, 339, 16386149, 31588575, 29996593]

基于输出结果，计算N GC Q20 Q30的占比

N_percent = 1.0*summary[2]/summary[1]*100
GC_percent = 1.0*summary[3]/(summary[1]-summary[2])*100
Q20_percent = 1.0*summary[4]/summary[1]*100
Q30_percent = 1.0*summary[5]/summary[1]*100 

with open(join(argv[4], argv[3] + "_summary1.xls"), "w") as out1:
    out1.write("SampleID\tLib\tReadsCount\tBasesCount(bp)\tN(%)\tGC(%)\tQ20(%)\tQ30(%)\n")
    out1.write("%s\tSPE\t%d\t%d\t%.5f\t%.2f\t%.2f\t%.2f\n" % ('A01', summary[0], summary[1], N_percent, GC_percent, Q20_percent, Q30_percent))

[login TESTA01]$ less -S 2_cleandata/A01_summary1.xls

SampleID        Lib     ReadsCount      BasesCount(bp)  N(%)    GC(%)   Q20(%)  Q30(%)
A01     SPE     218500  32775000        0.00103 50.00   96.38   91.52