美文网首页
SOAPnuke—Fastq过滤程序指南

SOAPnuke—Fastq过滤程序指南

作者: 大号在这里 | 来源:发表于2020-09-04 15:44 被阅读0次

    1. 简介

    SOAPnuke是华大自主开发的一款针对fastq文件的过滤软件,主要功能有adapter过滤、低quality过滤和高比例N过滤。基本的过滤功能集中在filter模块中,filter模块适用于大部分fastq格式下机数据过滤。针对特定数据类型的处理,可以使用filtersRNA、filterDGE或filterMeta模块。

    2. 适用范围

    测序平台:HiSeq 2000,HiSeq 2500,HiSeq 4000, HiSeq X Ten, Zebra
    测序策略:PE/SE
    数据类型:
    filter:过滤RNA-seq、RNA-ref、BS、MeDIP、CHIP、RNAdenovo以及DNA测序产生的下机fastq原始数据。
    filtersRNA:短序列SE测序的小RNA(成熟长度一般在2123个nt左右,tag的长度在1830个nt),同样试用于小RNA降解组流程。
    filterDGE:本文不作详细说明
    filterMeta:本文不作详细说明

    3. 功能列表

    3.1 filter功能介绍

    1)去除含有adapter的reads(去除) (一个错配,比对比例)或者截掉reads中的adapter序列

    2)去除质量值小于10(5 )的碱基占超过整条reads碱基数50%

    3)去除N的比例大于5%(默认)的reads

    4)去除poly A(RNA)(都是A的序列 100%)

    5)去除index (序列ID中)

    6)截取指定数据量

    7)输出clean data和raw data (截取数据的情况下才会输出raw data)

    8)去除平均质量值过低的reads

    9)去除来自PCR重复片断的reads

    10)去除插入片断长度过小的reads (read1和read2的overlap >=10bp, mismatch <=10%),针对DNAdenovo 默认不做

    11)fastq文件质量质量体系转换

    4.2 filter sRNA功能介绍

    1. 检测低质量(剔除质量值不好的reads)。

    2. 检测并修饰3’接头(剔除3’端缺失(没检测到3’接头)和空载(修饰后tag过短)这两种情况)。

    3. 检测5’接头(剔除5’污染)。

    4. 检测polyA tag(剔除ployA tag,含A 70%,用于小RNA),或者检测polyN tag(剔除polyN tag,含[A或C或T或G] 70%,用于小RNA降解组)。

    5. 检测小片段(剔除小片段,<18 nt)。

    6. 若输出结果为fq格式,将质量值转换为sanger体系。

    5 参数说明

    5.1 filter参数介绍

    SOAPnuke filter [OPTION]…

    -f, –adapter1  : <s> 3′ adapter sequence of fq1 file
    
    -r, –adapter2  : <s> 5′ adapter sequence of fq2 file [only for PE reads]
    
    -1, –fq1       : <s> fq1 file
    
    -2, –fq2       : <s> fq2 file, used to pe
    
    –tile          : <s> tile number to ignore reads , such as [1101-1104,1205]
    

    the next two options only for adapter sequence:

    -M, –misMatch  : <i> the max mismatch number when match the adapter (default: [1])
    
    -A, –matchRatio: <f> adapter’s shortest match ratio(default: [0.5])
    
    -l, –lowQual   : <i> low quality threshold (default: [5])
    
    -q, –qualRate  : <f> low quality rate (default: [0.5])
    
    -n, –nRate     : <f> N rate threshold (default: [0.05])
    
    -m, –mean      : <f> filter reads with low average quality, (<)
    
    -p, –polyA     : <f> filter poly A, percent of A, 0 means do not filter (default: [ 0 ])
    
    -d, –rmdup     : <b> remove PCR duplications
    
    -i, –index     : <b> remove index
    
    -c, –cut       : <f> the read number you want to keep in each clean fq file (unit:1024*1024, 0 means not cut reads)
    
    -t, –trim      : <s> trim some bp of the read’s head and tail, they means:
    
    read1’s head and tail and read2’s head and tail(default: [ 0,0,0,0 ])
    
    -S, –small     : <b> filter the small insert size
    

    the next two options only for filter the small insert size

    -O, –overlap   : <i> minimun match length (default: [ 10 ])
    
    -P, –mis       : <f> the maximum miss match ratio (default: [ 0.1 ])
    
    -Q, –qualSys   : <i> quality system 1:illumina, 2:sanger (default: [ 1 ])
    
    -L, –read1Len  : <i> read1 max length (default: all read1’s length are equal, and auto acquire)
    
    -I, –read2Len  : <i> read2 max length (default: all read2’s length are equal, and auto acquire)
    
    -G, –sanger    : <b> set clean data qualtiy system to sanger (default: illumina)
    
    -a, –append    : <s> the log’s output place : console or file (default: [console])
    
    -o, –outDir    : <s> output directory, directory must exists (default: current directory)
    
    -C, –cleanFq1  : <s> clean fq1 file name
    
    -D, –cleanFq2  : <s> clean fq2 file name
    
    -E, –cutAdaptor: <i> cut sequence from adaptor index,unless performed -f/-r also in use,discard the read when the adaptor index of the read is less than INT
    
    -b, –BaseNum   : <i> the base number you want to keep in each clean fq file,unless performed -E also in use
    
    -R, –rawFq1    : <s> raw fq1 file name
    
    -W, –rawFq2    : <s> raw fq2 file name
    
    -5, –seqType   : <i> Sequence fq name type, 0->old fastq name, 1->new fastq name[default: 0]
    
    old fastq name: [@FCD1PB1ACXX](https://weibo.com/n/FCD1PB1ACXX):4:1101:1799:2201#GAAGCACG/2
    
    new fastq name: [@HISEQ](https://weibo.com/n/HISEQ):310:C5MH9ANXX:1:1101:3517:2043 2:N:0:TCGGTCAC
    
    -6, –polyAType : <i> filter poly A type, 0->both two reads are poly a, 1->at least one reads is poly a, then filter, [default: 0]
    
    -7, –outType: <i> Add /1, /2 at the end of fastq name, 0:not add, 1:add [default: 0]
    
    -h, –help      : <b> help
    
    -v, –version   : <b> show version
    

    6.2 filtersRNA参数介绍

    SOAPnuke filtersRNA

    -f, –fq         <string> :  fastq file
    
    usual args:
    
    -m, –mrna       <switch> :  mrna filter(default: off)
    
    -n, –polyN      <float>  :  remove polyN[A, T, G, C], 0 means do not filter, (default: 0.7)
    
    -F, –outfq      <string> :  prefix of out orignal fq name, Eg. if set -F XXX, will print out XXX.fq.gz, otherwise will not print
    
    -3, –adapter3   <string> :  3′ adaptor sequence (default: TCGTATGCCGTCTTCTGCTTG)
    
    -5, –adapter5   <string> :  5′ adaptor sequence (default: GTTCAGAGTTCTACAGTCCGACGATC)
    
    –tile           <string> :  tile number to ignore reads , such as [1101-1104,1205]
    
    -o, –outDir     <string> :  out directory (default: current directory)
    
    -x, –outPfx     <string> :  out file prefix (default: clean)
    
    -s, –strict     <switch> :  filter low quality reads strictly (default: off)
    
    -z, –minSize    <int>    :  small insert size (default: 18)
    
    -p, –polyA      <float>  :  filter poly A, percent of A, 0 means do not filter, (default: 0.7)
    
    -Q, –qualSys    <int>    :  quality system, 1:illumina, 2:sanger (default: 1)
    
    -q, –fastq      <switch> :  out file type: on:fastq, off:fasta (default: off)
    
    -i, –index      <switch> :  remove index
    
    -G, –sanger     <switch> :  out put sanger quality score system fq. (defaul: off illumina)
    
    -u, –untrim     <switch> :  do not trim 3′ adapter (default: off)
    
    -w, –unlowQ     <switch> :  do not filter low quality reads (default: off)
    
    -L, –readLen    <int>    :  Max read length in fq file (default: 49)
    
    -t, –trim       <string> :  trim some bp of the read’s head and tail (default: [0,0])
    
    -c, –cut        <float>  :  the read number you want to keep in each orignal fq file.
    
    Eg.: if set -c N, read number = N * 1024; default: N = 0, means reserve whole orignal fq;
    
    -y, –seqType   : <i> Sequence fq name type, 0->old fastq name, 1->new fastq name HighSeq4000[default: 0]
    
    help args:
    
    -a, –append     <string> :  logger’s appender: console or file (defualt: console)
    
    -h, –help       <switch> :  help
    
    -v, –version    <switch> :  version information
    
    unusual arg:
    
    find 5′ adapter
    
    -C, –continuous <int>    :  mini 5′ adapter continuous alignment length (default: 6)
    
    -A, –alignRate  <float>  :  mini alignment rate when find 5′ adapter: alignment/tag (default: 0.8)
    
    find 3′ adapter
    
    -l, –miniAlign  <int>    :  mini alignment length when find 3′ adapter (default: 5)
    
    -E, –errorRate  <float>  :  Max error rate when find 3′ adapter (mismatch/match) (dfault: 0.4)
    
    -M, –misMatch   <int>    :  Max mismatch number when find 3′ adapter (dfault: 4)
    

    7. 修改

    2015/11/11 lishengkang@genomics.cn:升级至1.5.3版本,新版本增加了

    –cutAdaptor 参数,选择此参数后,SOAPnuke会截掉reads中的接头序列,而不是直接丢弃含有接头的reads。截短后的reads长度要求至少INT bp,否则整个reads丢弃。

    增加了–BaseNum参数,此参数的值是截取数据所需要保留的数据量,只有在选择了–cutAdaptor的情况下能生效。

    选择–cutAdaptor后,同样是截取数据功能的–cut参数失效。本版本开始–adapter1和–adapter2的值只能是接头序列,不再对adapter list进行支持。

    参考:

    https://weibo.com/p/1001603908643614550165

    相关文章

      网友评论

          本文标题:SOAPnuke—Fastq过滤程序指南

          本文链接:https://www.haomeiwen.com/subject/jpbzsktx.html