美文网首页
Trimmomatic 数据过滤处理

Trimmomatic 数据过滤处理

作者: 线断木偶人 | 来源:发表于2019-05-28 10:29 被阅读0次

1.数据过滤,

Trimmomatic
网站:http://www.usadellab.org/cms/index.php?page=trimmomatic

参数PDF:http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf

Usage: 
       PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
   or: 
       SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
   or: 
       -version
image.png
说明:
This will perform the following:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
Drop reads below the 36 bases long (MINLEN:36)

1.删除adapters
2.从 reads 的开头切除质量值低于阈值的碱基
3.从 reads 的末尾开始切除质量值低于阈值的碱基。
4.从 reads 的 5' 端开始,进行4个碱基的滑窗质量过滤,切掉碱基质量平均值低于阈值(15)的滑窗。
5.如果经过剪切后 reads 的长度低于阈值(36)则丢弃这条 reads。

其他参数描述

ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
LEADING: Cut bases off the start of a read, if below a threshold quality
TRAILING: Cut bases off the end of a read, if below a threshold quality
CROP: Cut the read to a specified length
HEADCROP: Cut the specified number of bases from the start of the read
AVGQUAL: Drop the read if the average quality is below the specified level
MINLEN: Drop the read if it is below a specified length
TOPHRED33: Convert quality scores to Phred-33
TOPHRED64: Convert quality scores to Phred-64


ILLUMINACLIP: 过滤 reads 中的 Illumina 测序接头和引物序列,
SLIDINGWINDOW:  从 reads 的 5' 端开始,进行滑窗质量过滤,切掉碱基质量平均值低于阈值的滑窗。
LEADING:  从reads 的开头切除质量值低于阈值的碱基。
TRAILING:  从reads 的末尾开始切除质量值低于阈值的碱基。
CROP: 从 reads 的末尾切掉部分碱基使得 reads 达到指定长度。
HEADCROP: 从 reads 的开头切掉指定数量的碱基。
AVGQUAL: reads的平均质量低于阈值弃掉。
MINLEN:  如果经过剪切后 reads 的长度低于阈值则丢弃这条 reads。
TOPHRED33: 将 reads 的碱基质量值体系转为 phred-33。
TOPHRED64: 将 reads 的碱基质量值体系转为 phred-64。

If no quality score is specified, phred-64 is the default. This will be changed to an 'autodetected' quality score in a future version.

日志文件:

Specifying a trimlog file creates a log of all read trimmings, indicating the following details:

the read name
the surviving sequence length
the location of the first surviving base, aka. the amount trimmed from the start
the location of the last surviving base in the original read
the amount trimmed from the end


-trimlog 参数指定了过滤日志文件名,日志中包含以下四列内容:

1. read ID
2. 过滤之后剩余序列长度
3. 过滤之后的序列起始碱基位置(序列开头处被切掉了多少个碱基)
4. 过滤之后的序列末端碱基位置
5. 序列末端处被剪切掉的碱基数

其他参数


image.png
ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>

            fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc.
            The naming of the various sequences within this file determines how they are used. See below.

            seedMismatches: specifies the maximum mismatch count which will still allow a full match to be performed

            palindromeClipThreshold: specifies how accurate the match between the two 'adapter ligated' reads 
                                     must be for PE palindrome read alignment.

            simpleClipThreshold: specifies how accurate the match between any adapter etc. 
                                  sequence must be against a read.

fastaWithAdaptersEtc:指定包含接头和引物序列(所有被视为污染的序列)的 fasta 文件路径。

seedMismatches:指定第一步 seed 搜索时允许的错配碱基个数,例如 2

palindrome clip threshold:指定针对 PE 的 palindrome clip 模式下,需要 R1 和 R2 之间至少多少比对分值才会进行接头切除,例如 30。

simple clip threshold:指定切除接头序列的最低比对分值,通常 7-15 之间。

minAdapterLength:只对 PE 测序的 palindrome clip 模式有效,指定 palindrome 模式下可以切除的接头序列最短长度,由于历史的原因,默认值是 8,
但实际上 palindrome 模式可以切除短至 1bp 的接头污染,所以可以设置为 1 。

keepBothReads:只对 PE 测序的 palindrome clip 模式有效, R1 和 R2 在去除了接头序列之后剩余的部分是完全反向互补的,
默认参数 false,意味着整条去除与 R1 完全反向互补的 R2,
当做重复去除掉,但在有些情况下,例如需要用到 paired reads 的 bowtie2 流程,就要将这个参数改为 true,否则会损失一部分 paired reads。

SLIDINGWINDOW:<windowSize>:<requiredQuality>
       windowSize: specifies the number of bases to average across
       requiredQuality: specifies the average quality required.
       
       指定窗口大小;要求的平均质量       

LEADING:<quality>
        quality: Specifies the minimum quality required to keep a base.
         设定碱基质量值阈值,低于这个阈值将被切除。

TRAILING:<quality>
        quality: Specifies the minimum quality required to keep a base.
         设定碱基质量值阈值,低于这个阈值将被切除。

CROP:<length>
        length: The number of bases to keep, from the start of the read.
        不管碱基质量,从 reads 的起始开始保留设定长度的碱基,其余全部切除。一刀切,把所有 reads 切成相同的长度。

HEADCROP:<length>
        length: The number of bases to remove from the start of the read.
        不管碱基质量,从 reads 的起始开始直接切除部分碱基。

MINLEN:<length>
        length: Specifies the minimum length of reads to be kept.
        设定长度,低于这个长度将被切除。即可被保留的最短 read 长度

ILLUMINACLIP 对数据的处理

对fastq 进行处理,其他参数一致

trimmomatic SE -phred33 -threads 10 -trimlog trimmomatic.log1 /workdir/test_soap2/out/359/HG2337H/HG2337H.raw.fastq.gz HG2337H1.fastq.gz ILLUMINACLIP:/workdir/Biosoft/anaconda/pkgs/trimmomatic-0.38-1/share/trimmomatic-0.38-1/adapters/TruSeq3-SE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:30:20 MINLEN:36

trimmomatic SE -phred33 -threads 10 -trimlog trimmomatic.log2 /workdir/test_soap2/out/359/HG2337H/HG2337H.raw.fastq.gz HG2337H2.fastq.gz LEADING:20 TRAILING:20 SLIDINGWINDOW:30:20 MINLEN:36

看看结果有什么不同


image.png

统计下:大概2~3K的reads

[xmxjy@xmxjy HG2337H]$ wc -l zzzz_uniq 
2646 zzzz_uniq

看下被切割的部分片段:

GCAATACATATAGGAAGAGCACA
ACATCGGAAGAGCACACGTCACAACTCCAGACACAATACTCGATGTCGTAT
CCTGGGGTAGAGCACACGTCTGAACTCCAGTCACATTACTCGATCTCGTATGCCGTCTTC
AGATCGGAAGAGCACACGGTCTGAAC
AGATCGGAAGAGCACACGAAAGAACTCC
AGATCGGAAGAGCACACTGAAC
AGATCGGAAGAGCACACCCTCAGA
AGATCGGAAGAGCACACGTCAGCACTAAAGTCACATTACTCGCTGTCGTATGCCGTCTTCTG
ATATCGGAAGAGAAACAGTCTGAACTCCAGTC
AGATAGGAAGAGCACACGTCTGAAACT
CAGAACGGAAGAGC
AGATCGGAAGAGCACACGTCTCCGCCAACGCC
TGATCGGATGAGCACACGTCTGAAC
TGACAGATAGTAAGAGCACACGTCTGAAATCAATTCACATAACTCGATAAAGTATGCCGTCTTCAGCTT
AGATCGGAAGAGCACACTGA
ATATCGGAA
TAGATCGGAAGAGCACACCACAGA
TATATCGGTAGAGCACACGTCTGAACGACA
TTATGGG
AGCAGATCGGAACACACGTCTGAACTCCAGTAACATTACTCGATATCGT
AGATCGGAAGAGCACACGTCAGAAC
GATCCGGAAGAGCACACGTCTGAACTCCAAGT
TAGATCGGAAGAGCACACGTGAGACCTACAGTCACATTACTCGATGTCGTATGCCGTCTTCTGCTTGAAAAAA
GACCTCAAGGAGCACACGTCTGAACTCCAGT
AGATCGGAAGAGCACACGTCAGAA
TCAGATCGGAAGAGCACACGTCACCACTCCAGACACA
AGAGACCCGGTTTCACAGATCG
AGCTCGGAAGCGCACACGTCTG
TACATCTGAAGAGCACACGTCTGAA
GTACATCGGAA
AGATCGGAAGAGCACACCTCTGAACACCACAC
AGATAGGAAGAGCACACGTCTGAACAACACTCACATTACTCGAGCT
CTCTGGTCTAAGCACACGTCTGAACTCAAGTCACAT
GATCGGGAAGAGCACACGTCTGAAC
TAGATCGGAAGAGCACACTGAACTCCAGACACATTACTCGATAT
AAAACGGAAGAGC
TGATATCGGAAGAGCACACGTCAGAAATACAGTCACAT
GAAGA
AGATCCGAAGAGAACACGTCTGAACTCCAGAAACATAACTCGAT
AGATCTCAATAGCACACGTCTGAACTCCAGAAACATAAC
TAGAGAGCGGAAGCGAGCCCGTCTGAACTCCAGTCAC
ATATCGGAAGCTCACACGTCTGAACTCCAGGCACATTAC
AGATCGGAAGAGCACACGACTGAACGCCA
GAGATCGGAAGAGCACACTTCAGAACTCCATACAGAT
AGATCGGAAGAGCACACGTCACAAC
ACAGCGGAAGAGCACACGTC
TAGATC
TGATGGGAAGAGCACACG

相关文章

网友评论

      本文标题:Trimmomatic 数据过滤处理

      本文链接:https://www.haomeiwen.com/subject/ojskzqtx.html