美文网首页
文章复现-全外显子数据分析学习4 去接头与质控

文章复现-全外显子数据分析学习4 去接头与质控

作者: jiarf | 来源:发表于2022-04-21 08:52 被阅读0次

    教程在:肿瘤外显子数据处理系列教程(二)质控与去接头 (qq.com)

    去接头

    1.安装TrimGalore

    curl https://codeload.github.com/FelixKrueger/TrimGalore/zip/0.6.1 TrimGalore.zip
    unzip TrimGalore.zip
    ln -s ~/tools/TrimGalore-0.6.1/trim_galore ~/tools/bin/trim_galore
    

    在查找这个软件的时候发现了一个博主介绍的很好,此处复制一点自己觉得有用的

    Trim Galore是对FastQCCutadapt的包装。适用于所有高通量测序,包括RRBS(Reduced Representation Bisulfite-Seq ), Illumina、Nextera 和smallRNA测序平台的双端和单端数据。主要功能包括两步:
    第一步首先去除低质量碱基,然后去除3' 末端的adapter, 如果没有指定具体的adapter,程序会自动检测前1million的序列,然后对比前12-13bp的序列是否符合以下类型的adapter:

    • Illumina: AGATCGGAAGAGC
    • Small RNA: TGGAATTCTCGG
    • Nextera: CTGTCTCTTATA

    但是这个作者并没有用conda安装成功

    (wes) 16:45:33 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
    $
    conda install trim-galore
    Collecting package metadata (current_repodata.json): -
     fontconfig         conda-forge/linux-64::fontconfig-2.14.0-h8e229c2_0
      freetype           conda-forge/linux-64::freetype-2.10.4-h0708190_1
      libnsl             conda-forge/linux-64::libnsl-2.0.0-h7f98852_0
      libpng             conda-forge/linux-64::libpng-1.6.37-h21135ba_2
      libuuid            conda-forge/linux-64::libuuid-2.32.1-h7f98852_1000
      openjdk            conda-forge/linux-64::openjdk-8.0.312-h7f98852_0
      perl               conda-forge/linux-64::perl-5.32.1-2_h7f98852_perl5
      pigz               conda-forge/linux-64::pigz-2.6-h27826a3_0
      trim-galore        bioconda/noarch::trim-galore-0.6.7-hdfd78af_0
      xopen              bioconda/noarch::xopen-0.7.3-py_0
    
    
    Proceed ([y]/n)? y
    Downloading and Extracting Packages
    pigz-2.6             | 87 KB     | ##################################### | 100%
    fontconfig-2.14.0    | 305 KB    | ##################################### | 100%
    xopen-0.7.3          | 11 KB     | ##################################### | 100%
    fastqc-0.11.9        | 9.7 MB    | ##################################### | 100%
    perl-5.32.1          | 14.4 MB   | ##################################### | 100%
    freetype-2.10.4      | 890 KB    | ##################################### | 100%
    libnsl-2.0.0         | 31 KB     | ##################################### | 100%
    libpng-1.6.37        | 306 KB    | ##################################### | 100%
    cutadapt-1.18        | 199 KB    | ##################################### | 100%
    expat-2.4.8          | 187 KB    | ##################################### | 100%
    openjdk-8.0.312      | 97.6 MB   | ##################################### | 100%
    bz2file-0.98         | 9 KB      | ##################################### | 100%
    libuuid-2.32.1       | 28 KB     | ##################################### | 100%
    Preparing transaction: done
    Verifying transaction: done
    Executing transaction: done
    

    可以看到依赖的环境很多,要一个一个下载

    (wes) 17:26:08 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
    $
    trim_galore --help
    
    ···
     within the other read).
                            NOTE: If you are planning to use Bowtie2, BWA etc. you don't need to specify this option.
    
    --retain_unpaired       If only one of the two paired-end reads became too short, the longer
                            read will be written to either '.unpaired_1.fq' or '.unpaired_2.fq'
                            output files. The length cutoff for unpaired single-end reads is
                            governed by the parameters -r1/--length_1 and -r2/--length_2. Default: OFF.
    
    -r1/--length_1 <INT>    Unpaired single-end read length cutoff needed for read 1 to be written to
                            '.unpaired_1.fq' output file. These reads may be mapped in single-end mode.
                            Default: 35 bp.
    
    -r2/--length_2 <INT>    Unpaired single-end read length cutoff needed for read 2 to be written to
                            '.unpaired_2.fq' output file. These reads may be mapped in single-end mode.
                            Default: 35 bp.
    
    Last modified on 07 October 2020.
    

    安装成功

    (wes) 17:26:08 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
    $
    cutadapt --help
    cutadapt version 1.18
    
    Copyright (C) 2010-2018 Marcel Martin <marcel.martin@scilifelab.se>
    
    cutadapt removes adapter sequences from high-throughput sequencing reads.
    
    Usage:
        cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
    
    For paired-end reads:
        cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
    
    Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
    characters are supported. The reverse complement is *not* automatically
    searched. All reads from input.fastq will be written to output.fastq with the
    adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
    sequences can be given (use further -a options), but only the best-matching
    adapter will be removed.
    
    Input may also be in FASTA format. Compressed input and output is supported and
    auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
    standard input/output. Without the -o option, output is sent to standard output.
    
    Citation:
    
    Marcel Martin. Cutadapt removes adapter sequences from high-throughput
    sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
    http://dx.doi.org/10.14806/ej.17.1.200
    
    Run "cutadapt --help" to see all command-line options.
    See https://cutadapt.readthedocs.io/ for full documentation.
    
    Options:
      --version             show program's version number and exit
      -h, --help            show this help message and exit
      --debug               Print debugging information.
      ···
    

    但是这个版本不得行

    (wes) 17:28:33 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
    $
    cutadapt --version
    1.18
    

    作者那时候2019年都2.6了,现在会更新
    更新一下

    (wes) 17:29:28 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
    $
    pip3 install --user --upgrade cutadapt
    Looking in indexes: http://pypi.douban.com/simple
    

    更新失败了,算了算了,我的pip总感觉这个环境的有点问题

    term_galore运行的参数


    image.png

    2.去接头

    这是教程的代码:

    cd 1.raw_fq
    touch ../3.clean/trimgalore.log
    
    ## trim_galore.sh
    cat ../tmp | while read id; do
        fq1=${id}_1.fastq.gz
        fq2=${id}_2.fastq.gz
        trim_galore  --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../3.clean  $fq1  $fq2 >> trimgalore.log 2>&1
    done
    
    nohup bash trim_galore.sh &
    
    multiqc ./ -n trim_galore_report -p -i " Trim REPORT OF SRP070662" -o multiqc
    
    nohup fastqc --outdir ../2.qc/post --threads 16 *.fq.gz > ../2.qc/post/fastqc.log 2>&1
    
    multiqc ./ -n post_qc_report -p -i " QC REPORT OF SRP070662" -o multiqc
    

    这是教程的,自己改脚本

    cat trim_galore.sh
    cd /data1/jiarongf/learning/cancer-WES/0.sra/raw_data
    
    ## trim_golore.sh
    cat ../../data/case | while read id; do
        fq1=${id}_1.fastq.gz
        fq2=${id}_2.fastq.gz
        trim_galore  --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../../3.clean  $fq1  $fq2 > ../../log/trimgalore.log 2>&1
    done
    
    

    这里报错了,试一下跑一个,发现试python的问题,之前创建的wes环境是2的版本,所以跑完之后会报错

    Writing final adapter and quality trimmed output to case3_techrep_2_WES_1_trimmed.fq.gz
    
    
      >>> Now performing quality (cutoff '-q 28') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file case3_techrep_2_WES_1.fastq.gz <<<
    ERROR: Running in parallel is not supported on Python 2
    
    
    Cutadapt terminated with exit signal: '256'.
    Terminating Trim Galore run, please check error message(s) to get an idea what went wrong...
    

    现在更改已经创建的环境的python版本


    image.png
    conda install python==版本号
    

    但是有一个奇怪的就是再这个环境里面的conda show config是python是3.8.8的,为什么默认的python的版本就是2的,哎可能是创建环境的问题吧
    网有点不好,慢慢等待


    image.png
    Proceed ([y]/n)? y
    
    
    Downloading and Extracting Packages
    cutadapt-4.0         | 208 KB    | ##################################### | 100%
    pbzip2-1.1.13        | 114 KB    | ##################################### | 100%
    certifi-2021.10.8    | 145 KB    | ##################################### | 100%
    dnaio-0.8.1          | 76 KB     | ##################################### | 100%
    python-isal-0.11.1   | 141 KB    | ##################################### | 100%
    setuptools-62.1.0    | 1.2 MB    | ##################################### | 100%
    xopen-1.5.0          | 25 KB     | ##################################### | 100%
    isa-l-2.30.0         | 192 KB    | ##################################### | 100%
    python-3.8.8         | 26.1 MB   | ##################################### | 100%
    pip-22.0.4           | 1.5 MB    | ##################################### | 100%
    python_abi-3.8       | 4 KB      | ##################################### | 100%
    Preparing transaction: done
    Verifying transaction: done
    Executing transaction: done
    

    下载成功
    并且也更新了cutadapt

    (wes) 14:46:05 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
    $
    cutadapt -v
    This is cutadapt 4.0 with Python 3.8.8
    Command line parameters: -v
    Run "cutadapt --help" to see command-line options.
    See https://cutadapt.readthedocs.io/ for full documentation.
    
    cutadapt: error: unrecognized arguments: -v
    (wes) 14:46:11 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
    $
    python --version
    Python 3.8.8
    

    再试着运行一下

    (wes) 14:47:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
    $
    trim_galore  --paired -q 28 --phred33 --length 30 --stringency 3 --gzip --cores 8 -o ../../3.clean  case3_techrep_2_WES_1.fastq.gz  case3_techrep_2_WES_2.fastq.gz
    
    Number of cores used for trimming: 8
    Quality Phred score cutoff: 28
    Quality encoding type selected: ASCII+33
    Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
    Maximum trimming error rate: 0.1 (default)
    Minimum required adapter overlap (stringency): 3 bp
    Minimum required sequence length for both reads before a sequence pair gets removed: 30 bp
    Output file(s) will be GZIP compressed
    
    Cutadapt seems to be fairly up-to-date (version 4.0). Setting -j 8
    Writing final adapter and quality trimmed output to case3_techrep_2_WES_1_trimmed.fq.gz
    
    
      >>> Now performing quality (cutoff '-q 28') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file case3_techrep_2_WES_1.fastq.gz <<<
    10000000 sequences processed
    ^C
    (wes) 14:47:29 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/0.sra/raw_data
    
    

    全部拿去跑试试
    这里提醒,每次拿去跑的时候都要删掉之前的,不然不知道新生成的是哪个

    (wes) 14:50:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
    $
    nohup sh trim_galore.sh > trim_galore.sh.log 2>&1 &
    (wes) 14:50:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
    $
    top
    
    
    image.png

    好了再后台跑起来了
    去看看log有没有问题,这里我生成了两个log,一个是运行sh的log一个是运行trim_galore的log
    先去检查sh的log


    image.png

    嗯呢果然很简洁,没啥
    在检查另一个log


    image.png
    成功的标志啊,哈哈哈哈哈,下面有再检查每一个adapter了

    也要跑很久,趁着跑的这段时间,去学习一下term_galore 发现这里不同的测序平台会有不同的adaper,那么如何知道这批数据是什么平台呢,生成fq文件之后不是做了一次fastqc质控嘛,随意打开一个


    image.png

    就可以看到这个样本的encoding了,

    学习trim_galore

    教程:使用trim_galore对数据进行质量控制-过滤 - 简书 (jianshu.com)

    image.png image.png
    参数说明

    --quality/-q<INT>:设定Phred quality score阈值,默认为Phred 20 切除质量得分低于设定值的序列
    --phred33/64:使用ASCII+33/64质量得分作为Phred得分选择-phred33或者-phred64,表示-测序平台使用的Phred quality score。
    (需要确认:anger/Illumina 1.9+ encoding为33; Illumina 1.5 encoding为64)
    --adapter/-a :输入adapter序列。也可以不输入,Trim Galore!会自动寻找可能性最高的平台对应的adapter。自动搜选的平台三个,也直接显式输入这三种平台,即--illumina、--nextera和--small_rna。
    -s/--stringency<INT>:设定可以忍受的前后adapter重叠的碱基数,默认为1(非常苛刻)。可以适度放宽,因为后一个adapter几乎不可能被测序仪读到。
    --length<INT>:设定输出reads长度阈值小于设定值会被抛弃,默认值20bp;即小于20bp的被去除。注意,在pe150下,不要设置太高,可以50或36(默认20)
    --max_length : 设置长度大于此值被丢弃
    -e <ERROR RATE> :默认0.1
    --paired:对于双端测序结果,一对reads中,如果有一个被剔除,那么另一个会被同样抛弃,而不管是否达到标准。
    --retain_unpaired:对于双端测序结果,一对reads中,如果一个read达到标准,但是对应的另一个要被抛弃,达到标准的read会被单独保存为一个文件。
    --gzip和--dont_gzip:清洗后的数据zip打包或者不打包。
    -o/--output_dir:输入目录 [需要提前建立目录,否则运行会报错]。
    -- trim-n : 移除read一端的reads
    -j :使用线程数, 注意假设已使用Python 3并安装了Pigz,那么内核设置为4,实际使用内核是15,因此,最高设为4.
    --fastqc #当分析结束后,使用默认选项对结果文件进行fastqc分析

    fasqc的结果

    image.png

    Per base sequence quality : 每个read各位置碱基的测序质量。
    横轴碱基的位置,纵轴是质量分数,
    Quality score= -10log10p
    (p代表错误率),所以当质量分数为40的时候,p就是0.0001,质量算高了。
    [红色线]代表中位数,
    [蓝色线]代表平均数
    [黄色]是25%-75%区间,
    [权]是10%-90%区间。
    若任一位置的下 [四分位数] 低于10或者 [中位数] 低于25---->出现 “警告”
    若任一位置的下 [四分位数] 低于5或者 [中位数] 低于20,出现“失败,Fail”

    跑完啦

    昨天跑了一下去接头和质控,质控的脚本:

    (wes) 08:38:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
    $
    cat post_fastqc.sh
    nohup fastqc --outdir ../2.qc/post --threads 16 ../3.clean/*.fq.gz > ../log/post.fastqc.log 2>&1 &
    
    image.png

    multi一下

    (wes) 08:41:12 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
    $
    cat post_qc_multiqc.sh
    multiqc /data1/jiarongf/learning/cancer-WES/2.qc/post  -n post_qc_report -p -i " QC REPORT OF SRP070662" -o /data1/jiarongf/learning/cancer-WES/2.qc/post/
    (wes) 08:41:21 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
    $
    sh post_qc_multiqc.sh
    
      /// MultiQC 🔍 | v1.13.dev0
    
    |           multiqc | Report title:  QC REPORT OF SRP070662
    |           multiqc | Search path : /data1/jiarongf/learning/cancer-WES/2.qc/post
    |         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 120/120
    |            fastqc | Found 60 reports
    |           multiqc | Compressing plot data
    |           multiqc | Report      : ../2.qc/post/post_qc_report.html
    |           multiqc | Data        : ../2.qc/post/post_qc_report_data
    |           multiqc | Plots       : ../2.qc/post/post_qc_report_plots
    |           multiqc | MultiQC complete
    (wes) 08:42:07 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/run
    $
    
    image.png

    结果分析:

    去接头前qc
    去接头后qc

    有看到dup的比例确实有减少
    M seqs也有相应的降低




    看到质量有明显的提升











    相关文章

      网友评论

          本文标题:文章复现-全外显子数据分析学习4 去接头与质控

          本文链接:https://www.haomeiwen.com/subject/bmosertx.html