Fastp:过滤二代测序数据

作者: 胡童远 | 来源:发表于2021-11-18 17:34 被阅读0次

    导读

    Fastp能检测和去除adapter,PE序列overlap区碱基矫正,slide window修剪头尾,polyG/X尾修剪,UMI预处理。多功能合一,速度快,结果好,生成可读报表。Fastp完全可以代替Trimmomatic, FastQC, Cutadapt, AfterQC, SOAPnuke。

    Fastp文章

    标题:fastp: an ultra-fast all-in-one FASTQ preprocessor
    中文:超快的多合一fastq数据预处理器
    杂志:Bioinformatics
    引用:2414 (谷歌学术2021.11.18)

    工作流程

    速度快

    去adapter更准确、高效

    匹配hg19人参考基因组mismatch base, clip read, single-read map 最少

    高效预处理UMI

    fastp地址

    Github: https://github.com/OpenGene/fastp

    安装Fastp

    conda create -n readqc
    conda install fastp
    fastp --version
    # fastp 0.23.1
    

    运行Fastp

    conda activate readqc
    time fastp \
    --in1 ./input/E100032181_L01_29_1.fq.gz \
    --in2 ./input/E100032181_L01_29_2.fq.gz \
    --out1 ./fastp/E100032181_L01_29_1.fq.gz \
    --out2 ./fastp/E100032181_L01_29_2.fq.gz \
    --json ./fastp/fastp.json \
    --html ./fastp/fastp.html \
    --trim_poly_g --poly_g_min_len 10 \
    --trim_poly_x --poly_x_min_len 10 \
    --cut_front --cut_tail --cut_window_size 4 \
    --qualified_quality_phred 15 \
    --low_complexity_filter \
    --complexity_threshold 30 \
    --length_required 30 \
    --thread 4
    

    参数

    --trim_poly_g  切ployG
    --poly_g_min_len 10  最短为10bp
    --trim_poly_x  切ployX
    --poly_x_min_len 10 最短为10bp
    --cut_front  从5端扫描
    --cut_tail  从3端扫描
    --cut_window_size 4  窗口设为4bp
    --cut_mean_quality 20 窗口内最低平均碱基质量值为20
    --qualified_quality_phred 15  最低碱基质量值15
    --low_complexity_filter  启动过滤低复杂序列
    --complexity_threshold 30  复杂度阈值为30%
    --length_required 30  切后最短长度阈值30bp
    

    过程

    Read1 before filtering:
    total reads: 68871423
    total bases: 6887142300
    Q20 bases: 6788565208(98.5687%)
    Q30 bases: 6516393608(94.6168%)
    
    Read2 before filtering:
    total reads: 68871423
    total bases: 6887142300
    Q20 bases: 6752497708(98.045%)
    Q30 bases: 6459072061(93.7845%)
    
    Read1 after filtering:
    total reads: 68870151
    total bases: 6579451130
    Q20 bases: 6490255475(98.6443%)
    Q30 bases: 6233038928(94.7349%)
    
    Read2 after filtering:
    total reads: 68870151
    total bases: 6570653779
    Q20 bases: 6449906989(98.1623%)
    Q30 bases: 6173217216(93.9513%)
    
    Filtering result:
    reads passed filter: 137740302
    reads failed due to low quality: 32
    reads failed due to too many N: 936
    reads failed due to too short: 1480
    reads failed due to low complexity: 96
    reads with adapter trimmed: 24272074
    bases trimmed due to adapters: 604687721
    reads with polyX in 3' end: 698520
    bases trimmed in polyX tail: 6954246
    
    Duplication rate: 69.3962%
    
    Insert size peak (evaluated by paired-end reads): 141
    
    JSON report: ./fastp/fastp.json
    HTML report: ./fastp/fastp.html
    
    fastp --in1 ./input/E100032181_L01_29_1.fq.gz --in2 ./input/E100032181_L01_29_2.fq.gz --out1 ./fastp/E100032181_L01_29_1.fq.gz --out2 ./fastp/E100032181_L01_29_2.fq.gz --json ./fastp/fastp.json --html ./fastp/fastp.html --trim_poly_x --poly_x_min_len 10 --cut_front --cut_tail --cut_window_size 4 --qualified_quality_phred 15 --low_complexity_filter --complexity_threshold 30 --length_required 30 --thread 4
    fastp v0.23.1, time used: 567 seconds
    
    real    9m28.522s
    user    39m31.517s
    sys     0m37.690s
    

    Fastp结果

    结果html例:http://opengene.org/fastp/fastp.html
    结果json例:http://opengene.org/fastp/fastp.json

    更多:
    2000+引用的fastp推出重磅更新,再提速一倍!
    生信软件工具-fastp
    测序数据质控和预处理之fastp
    UMI的处理
    UMI-unique molecular identifiers

    相关文章

      网友评论

        本文标题:Fastp:过滤二代测序数据

        本文链接:https://www.haomeiwen.com/subject/npudtrtx.html