导读
Fastp能检测和去除adapter,PE序列overlap区碱基矫正,slide window修剪头尾,polyG/X尾修剪,UMI预处理。多功能合一,速度快,结果好,生成可读报表。Fastp完全可以代替Trimmomatic, FastQC, Cutadapt, AfterQC, SOAPnuke。
Fastp文章
标题:fastp: an ultra-fast all-in-one FASTQ preprocessor
中文:超快的多合一fastq数据预处理器
杂志:Bioinformatics
引用:2414 (谷歌学术2021.11.18)
工作流程
速度快
去adapter更准确、高效
匹配hg19人参考基因组mismatch base, clip read, single-read map 最少
高效预处理UMI
fastp地址
Github: https://github.com/OpenGene/fastp
安装Fastp
conda create -n readqc
conda install fastp
fastp --version
# fastp 0.23.1
运行Fastp
conda activate readqc
time fastp \
--in1 ./input/E100032181_L01_29_1.fq.gz \
--in2 ./input/E100032181_L01_29_2.fq.gz \
--out1 ./fastp/E100032181_L01_29_1.fq.gz \
--out2 ./fastp/E100032181_L01_29_2.fq.gz \
--json ./fastp/fastp.json \
--html ./fastp/fastp.html \
--trim_poly_g --poly_g_min_len 10 \
--trim_poly_x --poly_x_min_len 10 \
--cut_front --cut_tail --cut_window_size 4 \
--qualified_quality_phred 15 \
--low_complexity_filter \
--complexity_threshold 30 \
--length_required 30 \
--thread 4
参数
--trim_poly_g 切ployG
--poly_g_min_len 10 最短为10bp
--trim_poly_x 切ployX
--poly_x_min_len 10 最短为10bp
--cut_front 从5端扫描
--cut_tail 从3端扫描
--cut_window_size 4 窗口设为4bp
--cut_mean_quality 20 窗口内最低平均碱基质量值为20
--qualified_quality_phred 15 最低碱基质量值15
--low_complexity_filter 启动过滤低复杂序列
--complexity_threshold 30 复杂度阈值为30%
--length_required 30 切后最短长度阈值30bp
过程
Read1 before filtering:
total reads: 68871423
total bases: 6887142300
Q20 bases: 6788565208(98.5687%)
Q30 bases: 6516393608(94.6168%)
Read2 before filtering:
total reads: 68871423
total bases: 6887142300
Q20 bases: 6752497708(98.045%)
Q30 bases: 6459072061(93.7845%)
Read1 after filtering:
total reads: 68870151
total bases: 6579451130
Q20 bases: 6490255475(98.6443%)
Q30 bases: 6233038928(94.7349%)
Read2 after filtering:
total reads: 68870151
total bases: 6570653779
Q20 bases: 6449906989(98.1623%)
Q30 bases: 6173217216(93.9513%)
Filtering result:
reads passed filter: 137740302
reads failed due to low quality: 32
reads failed due to too many N: 936
reads failed due to too short: 1480
reads failed due to low complexity: 96
reads with adapter trimmed: 24272074
bases trimmed due to adapters: 604687721
reads with polyX in 3' end: 698520
bases trimmed in polyX tail: 6954246
Duplication rate: 69.3962%
Insert size peak (evaluated by paired-end reads): 141
JSON report: ./fastp/fastp.json
HTML report: ./fastp/fastp.html
fastp --in1 ./input/E100032181_L01_29_1.fq.gz --in2 ./input/E100032181_L01_29_2.fq.gz --out1 ./fastp/E100032181_L01_29_1.fq.gz --out2 ./fastp/E100032181_L01_29_2.fq.gz --json ./fastp/fastp.json --html ./fastp/fastp.html --trim_poly_x --poly_x_min_len 10 --cut_front --cut_tail --cut_window_size 4 --qualified_quality_phred 15 --low_complexity_filter --complexity_threshold 30 --length_required 30 --thread 4
fastp v0.23.1, time used: 567 seconds
real 9m28.522s
user 39m31.517s
sys 0m37.690s
Fastp结果
结果html例:http://opengene.org/fastp/fastp.html
结果json例:http://opengene.org/fastp/fastp.json
更多:
2000+引用的fastp推出重磅更新,再提速一倍!
生信软件工具-fastp
测序数据质控和预处理之fastp
UMI的处理
UMI-unique molecular identifiers
网友评论