以A01样本为例:
1_rawdata/A01_R1.fq.gz 1_rawdata/A01_R2.fq.gz
- os.popen(command)
python 借助于os.popen()运行shell命令
from os import popen
from os.path import join
import os
comm = '/opt/software/python/Python-3.7.0/bin/cutadapt \
--pair-filter=both -j 8 \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
-o 1_rawdata/A01_cut_R1.fq \
-p 1_rawdata/A01_cut_R2.fq \
1_rawdata/A01_R1.fq.gz \
1_rawdata/A01_R2.fq.gz --quiet'
pid=popen(comm).read().strip()
print(pid)
# '28311'
- 结果文件
1_rawdata/A01_cut_R1.fq 1_rawdata/A01_cut_R2.fq summary28311.xls
- summary28311.xls
在处理的时候可以将reads切分成许多部分,同时进行处理,可以提高效率。所有,我们得到的也是各段的统计值,需要对这些值进行求和,才是整个样本A01的统计情况。
[login TESTA01]$ less -S summary28311.xls
ReadsCount BasesCount N GC Q20 Q30
21710 3256500 67 1630039 3146494 3001147
ReadsCount BasesCount N GC Q20 Q30
21710 3256500 26 1626928 3153637 3008651
ReadsCount BasesCount N GC Q20 Q30
21710 3256500 41 1631128 3149262 3000306
ReadsCount BasesCount N GC Q20 Q30
21710 3256500 36 1631434 3148880 3001510
ReadsCount BasesCount N GC Q20 Q30
1400 210000 2 105028 203581 193802
ReadsCount BasesCount N GC Q20 Q30
21710 3256500 36 1631763 3145608 2994869
ReadsCount BasesCount N GC Q20 Q30
21710 3256500 27 1620728 3093790 2890052
ReadsCount BasesCount N GC Q20 Q30
21710 3256500 19 1625818 3122916 2945342
ReadsCount BasesCount N GC Q20 Q30
21710 3256500 32 1626506 3140477 2982700
ReadsCount BasesCount N GC Q20 Q30
21710 3256500 21 1627278 3141294 2987421
ReadsCount BasesCount N GC Q20 Q30
21710 3256500 32 1629499 3142636 2990793
-
用Excel打开 summary28311.xls
summary28311.xls -
summary28311.xls 按列求和
with open("summary{}.xls".format(pid)) as s:
summary = [0] * 6 # [0, 0, 0, 0, 0, 0]
for line in s:
try:
nums = int(line[0]) # reads count:21710
row = line.strip().split("\t") # [21710 3256500 67 1630039 3146494 3001147]
row = [int(value) for value in row]
# [21710 3256500 67 1630039 3146494 3001147]
summary = [x+y for x, y in zip(summary, row)]
except Exception as e:
continue
image.png
>>> summary
[218500, 32775000, 339, 16386149, 31588575, 29996593]
- 基于输出结果,计算N GC Q20 Q30的占比
N_percent = 1.0*summary[2]/summary[1]*100
GC_percent = 1.0*summary[3]/(summary[1]-summary[2])*100
Q20_percent = 1.0*summary[4]/summary[1]*100
Q30_percent = 1.0*summary[5]/summary[1]*100
with open(join(argv[4], argv[3] + "_summary1.xls"), "w") as out1:
out1.write("SampleID\tLib\tReadsCount\tBasesCount(bp)\tN(%)\tGC(%)\tQ20(%)\tQ30(%)\n")
out1.write("%s\tSPE\t%d\t%d\t%.5f\t%.2f\t%.2f\t%.2f\n" % ('A01', summary[0], summary[1], N_percent, GC_percent, Q20_percent, Q30_percent))
[login TESTA01]$ less -S 2_cleandata/A01_summary1.xls
SampleID Lib ReadsCount BasesCount(bp) N(%) GC(%) Q20(%) Q30(%)
A01 SPE 218500 32775000 0.00103 50.00 96.38 91.52
网友评论