1. 简介
SOAPnuke是华大自主开发的一款针对fastq文件的过滤软件,主要功能有adapter过滤、低quality过滤和高比例N过滤。基本的过滤功能集中在filter模块中,filter模块适用于大部分fastq格式下机数据过滤。针对特定数据类型的处理,可以使用filtersRNA、filterDGE或filterMeta模块。
2. 适用范围
测序平台:HiSeq 2000,HiSeq 2500,HiSeq 4000, HiSeq X Ten, Zebra
测序策略:PE/SE
数据类型:
filter:过滤RNA-seq、RNA-ref、BS、MeDIP、CHIP、RNAdenovo以及DNA测序产生的下机fastq原始数据。
filtersRNA:短序列SE测序的小RNA(成熟长度一般在2123个nt左右,tag的长度在1830个nt),同样试用于小RNA降解组流程。
filterDGE:本文不作详细说明
filterMeta:本文不作详细说明
3. 功能列表
3.1 filter功能介绍
1)去除含有adapter的reads(去除) (一个错配,比对比例)或者截掉reads中的adapter序列
2)去除质量值小于10(5 )的碱基占超过整条reads碱基数50%
3)去除N的比例大于5%(默认)的reads
4)去除poly A(RNA)(都是A的序列 100%)
5)去除index (序列ID中)
6)截取指定数据量
7)输出clean data和raw data (截取数据的情况下才会输出raw data)
8)去除平均质量值过低的reads
9)去除来自PCR重复片断的reads
10)去除插入片断长度过小的reads (read1和read2的overlap >=10bp, mismatch <=10%),针对DNAdenovo 默认不做
11)fastq文件质量质量体系转换
4.2 filter sRNA功能介绍
-
检测低质量(剔除质量值不好的reads)。
-
检测并修饰3’接头(剔除3’端缺失(没检测到3’接头)和空载(修饰后tag过短)这两种情况)。
-
检测5’接头(剔除5’污染)。
-
检测polyA tag(剔除ployA tag,含A 70%,用于小RNA),或者检测polyN tag(剔除polyN tag,含[A或C或T或G] 70%,用于小RNA降解组)。
-
检测小片段(剔除小片段,<18 nt)。
-
若输出结果为fq格式,将质量值转换为sanger体系。
5 参数说明
5.1 filter参数介绍
SOAPnuke filter [OPTION]…
-f, –adapter1 : <s> 3′ adapter sequence of fq1 file
-r, –adapter2 : <s> 5′ adapter sequence of fq2 file [only for PE reads]
-1, –fq1 : <s> fq1 file
-2, –fq2 : <s> fq2 file, used to pe
–tile : <s> tile number to ignore reads , such as [1101-1104,1205]
the next two options only for adapter sequence:
-M, –misMatch : <i> the max mismatch number when match the adapter (default: [1])
-A, –matchRatio: <f> adapter’s shortest match ratio(default: [0.5])
-l, –lowQual : <i> low quality threshold (default: [5])
-q, –qualRate : <f> low quality rate (default: [0.5])
-n, –nRate : <f> N rate threshold (default: [0.05])
-m, –mean : <f> filter reads with low average quality, (<)
-p, –polyA : <f> filter poly A, percent of A, 0 means do not filter (default: [ 0 ])
-d, –rmdup : <b> remove PCR duplications
-i, –index : <b> remove index
-c, –cut : <f> the read number you want to keep in each clean fq file (unit:1024*1024, 0 means not cut reads)
-t, –trim : <s> trim some bp of the read’s head and tail, they means:
read1’s head and tail and read2’s head and tail(default: [ 0,0,0,0 ])
-S, –small : <b> filter the small insert size
the next two options only for filter the small insert size
-O, –overlap : <i> minimun match length (default: [ 10 ])
-P, –mis : <f> the maximum miss match ratio (default: [ 0.1 ])
-Q, –qualSys : <i> quality system 1:illumina, 2:sanger (default: [ 1 ])
-L, –read1Len : <i> read1 max length (default: all read1’s length are equal, and auto acquire)
-I, –read2Len : <i> read2 max length (default: all read2’s length are equal, and auto acquire)
-G, –sanger : <b> set clean data qualtiy system to sanger (default: illumina)
-a, –append : <s> the log’s output place : console or file (default: [console])
-o, –outDir : <s> output directory, directory must exists (default: current directory)
-C, –cleanFq1 : <s> clean fq1 file name
-D, –cleanFq2 : <s> clean fq2 file name
-E, –cutAdaptor: <i> cut sequence from adaptor index,unless performed -f/-r also in use,discard the read when the adaptor index of the read is less than INT
-b, –BaseNum : <i> the base number you want to keep in each clean fq file,unless performed -E also in use
-R, –rawFq1 : <s> raw fq1 file name
-W, –rawFq2 : <s> raw fq2 file name
-5, –seqType : <i> Sequence fq name type, 0->old fastq name, 1->new fastq name[default: 0]
old fastq name: [@FCD1PB1ACXX](https://weibo.com/n/FCD1PB1ACXX):4:1101:1799:2201#GAAGCACG/2
new fastq name: [@HISEQ](https://weibo.com/n/HISEQ):310:C5MH9ANXX:1:1101:3517:2043 2:N:0:TCGGTCAC
-6, –polyAType : <i> filter poly A type, 0->both two reads are poly a, 1->at least one reads is poly a, then filter, [default: 0]
-7, –outType: <i> Add /1, /2 at the end of fastq name, 0:not add, 1:add [default: 0]
-h, –help : <b> help
-v, –version : <b> show version
6.2 filtersRNA参数介绍
SOAPnuke filtersRNA
-f, –fq <string> : fastq file
usual args:
-m, –mrna <switch> : mrna filter(default: off)
-n, –polyN <float> : remove polyN[A, T, G, C], 0 means do not filter, (default: 0.7)
-F, –outfq <string> : prefix of out orignal fq name, Eg. if set -F XXX, will print out XXX.fq.gz, otherwise will not print
-3, –adapter3 <string> : 3′ adaptor sequence (default: TCGTATGCCGTCTTCTGCTTG)
-5, –adapter5 <string> : 5′ adaptor sequence (default: GTTCAGAGTTCTACAGTCCGACGATC)
–tile <string> : tile number to ignore reads , such as [1101-1104,1205]
-o, –outDir <string> : out directory (default: current directory)
-x, –outPfx <string> : out file prefix (default: clean)
-s, –strict <switch> : filter low quality reads strictly (default: off)
-z, –minSize <int> : small insert size (default: 18)
-p, –polyA <float> : filter poly A, percent of A, 0 means do not filter, (default: 0.7)
-Q, –qualSys <int> : quality system, 1:illumina, 2:sanger (default: 1)
-q, –fastq <switch> : out file type: on:fastq, off:fasta (default: off)
-i, –index <switch> : remove index
-G, –sanger <switch> : out put sanger quality score system fq. (defaul: off illumina)
-u, –untrim <switch> : do not trim 3′ adapter (default: off)
-w, –unlowQ <switch> : do not filter low quality reads (default: off)
-L, –readLen <int> : Max read length in fq file (default: 49)
-t, –trim <string> : trim some bp of the read’s head and tail (default: [0,0])
-c, –cut <float> : the read number you want to keep in each orignal fq file.
Eg.: if set -c N, read number = N * 1024; default: N = 0, means reserve whole orignal fq;
-y, –seqType : <i> Sequence fq name type, 0->old fastq name, 1->new fastq name HighSeq4000[default: 0]
help args:
-a, –append <string> : logger’s appender: console or file (defualt: console)
-h, –help <switch> : help
-v, –version <switch> : version information
unusual arg:
find 5′ adapter
-C, –continuous <int> : mini 5′ adapter continuous alignment length (default: 6)
-A, –alignRate <float> : mini alignment rate when find 5′ adapter: alignment/tag (default: 0.8)
find 3′ adapter
-l, –miniAlign <int> : mini alignment length when find 3′ adapter (default: 5)
-E, –errorRate <float> : Max error rate when find 3′ adapter (mismatch/match) (dfault: 0.4)
-M, –misMatch <int> : Max mismatch number when find 3′ adapter (dfault: 4)
7. 修改
2015/11/11 lishengkang@genomics.cn:升级至1.5.3版本,新版本增加了
–cutAdaptor 参数,选择此参数后,SOAPnuke会截掉reads中的接头序列,而不是直接丢弃含有接头的reads。截短后的reads长度要求至少INT bp,否则整个reads丢弃。
增加了–BaseNum参数,此参数的值是截取数据所需要保留的数据量,只有在选择了–cutAdaptor的情况下能生效。
选择–cutAdaptor后,同样是截取数据功能的–cut参数失效。本版本开始–adapter1和–adapter2的值只能是接头序列,不再对adapter list进行支持。
网友评论