RNA-seq汇总（分析+工具+GATK流程）

作者: 晓佥 | 来源:发表于2019-06-03 10:49 被阅读239次

RNA-seq分析流程

不得不提的一篇文献
Sahraeian, Sayed Mohammad Ebrahim, et al. "Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis." Nature communications 8.1 (2017): 59.
文献地址：https://www.nature.com/articles/s41467-017-00050-4

简单描述：

15个样品（具有各种种系，癌症和干细胞数据集），39个分析工具，约120种组合，约490次分析。
针对各个工具，作者都描述了其性能表现，做出了庞大的比较分析。并在此基础上，作者构建了一个综合性的RNA-seq analysis protocol，即RNACocktail（https://bioinform.github.io/rnacocktail/），囊括了这些工具，免费提供给他人下载使用，以帮助研究人员更好地进行生物学分析。

文献流程：

1.比对
2.有参转录本组装
3.无参转录本组装
4.三代长reads获取转录本
5.转录本定量
6.差异表达
7.突变分析
8.RNA编辑
9.融合基因检测

再详细的内容可以看文献，这里就不在赘述了。

文献工具总结：

注：其中包含了一些软件或工具的重要参数设定

1.比对工具

1.TopHat2： –no-coverage-search http://ccb.jhu.edu/software/tophat/index.shtml

2.STAR: -twopassMode Basic –outFilterType BySJout https://github.com/alexdobin/STAR/releases

3.HISAT2 2.0.1-beta –dta (or –dta-cufflinks) http://www.ccb.jhu.edu/software/hisat/index.shtml

4.RASER 0.52 -b 0.03 https://www.ibp.ucla.edu/research/xiao/RASER.html

2.有参考转录本组装工具

1.Cufflinks 2.2.1 –frag-bias-correct http://cole-trapnell-lab.github.io/cufflinks/

2.StringTie 1.2.1 -v -B http://www.ccb.jhu.edu/software/stringtie/

3.无参考转录本组装工具

1.SOAPdenovoTrans 1.04 -K 25 https://github.com/aquaskyline/SOAPdenovo-Trans/

2.Oases 0.2.09(velveth haslength: 25) http://www.ebi.ac.uk/~zerbino/oases/

(需依赖软件Velvetv1.2.10) (velvetg options: -read trkg yes)

Trinity 2.1.1 –normalize reads https://github.com/trinityrnaseq/trinityrnaseq/wiki
Trimmomatic 0.35 http://www.usadellab.org/cms/index.php?page=trimmomatic

4.三代长read分析工具

1.LoRDEC 0.6 -k 23 -s 3 http://atgc.lirmm.fr/lordec/

2.GMAP 12/31/15 -f 1 http://research-pub.gene.com/gmap/

3. STARlong 2.5.1b https://github.com/alexdobin/STAR/releases

Followed the recommended options :

–outSAMattributes NH HI NM MD –readNameSeparator space

–outFilterMultimapScoreRange 1 –outFilterMismatchNmax 2000 –scoreGapNoncan -20

–scoreGapGCAG -4 –scoreGapATAC -8 –scoreDelOpen -1 –scoreDelBase -1

–scoreInsOpen -1 –scoreInsBase -1 –alignEndsType Local –seedSearchStartLmax 50

–seedPerReadNmax 100000 –seedPerWindowNmax 1000

–alignTranscriptsPerReadNmax 100000 –alignTranscriptsPerWindowNmax 10000

–outSAMstrandField intronMotif –outSAMunmapped Within

4. IDP 0.1.9 https://www.healthcare.uiowa.edu/labs/au/IDP/

5.定量工具

1. eXpress 1.5.1 (bowtie2 v2.2.7) https://pachterlab.github.io/eXpress/index.html

(bowtie2 options: -a -X 600 –rdg 6,5 –rfg 6,5 –score-min L,-.6,-.4 –no-discordant –no-mixed)

2. kallisto 0.42.4 http://pachterlab.github.io/kallisto/about.html

3. Sailfish 0.9.0 http://www.cs.cmu.edu/~ckingsf/software/sailfish/

4. Salmon-Aln 0.6.1 https://github.com/COMBINE-lab/salmon

5. Salmon-SMEM 0.6.1 index: –type fmd quant: -k,19 https://github.com/COMBINE-lab/salmon

6. Salmon-Quasi 0.6.1 index: –type quasi -k 31 https://github.com/COMBINE-lab/salmon

7. featureCounts 1.5.0-p1 -p -B -C http://subread.sourceforge.net/

6.差异表达分析工具

1. DESeq2 1.14.1 http://bioconductor.org/packages/release/bioc/html/DESeq2.html

2. edgeR 3.16.5 http://www.bioconductor.org/packages/release/bioc/html/edgeR.html

3. limma 3.30.7 http://bioconductor.org/packages/release/bioc/html/limma.html

4. Cuffdiff 2.2.1 –frag-bias-correct –emit-count-tables http://cole-trapnell-lab.github.io/cufflinks/

Ballgown 2.6.0 https://github.com/alyssafrazee/ballgown

6. sleuth 0.28.1 https://github.com/pachterlab/sleuth

7. 变异分析工具

1. SAMtools 1.2 samtools mpileup -C50 -d 100000 https://github.com/samtools/samtools

(bcftools v1.2) bcftools filter -s LowQual -e ’%QUAL<20 -DP>10000’ https://github.com/samtools/bcftools

2.GATK v3.5-0-g36282e4 (picard 1.129) https://software.broadinstitute.org/gatk/download/

Picard AddOrReplaceReadGroups: SO=coordinate

Picard MarkDuplicates: CREATE INDEX=true VALIDATION STRINGENCY=SILENTGATK

SplitNCigarReads: -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW N CIGAR READSGATK

HaplotypeCaller: -stand call conf 20.0 -stand emit conf 20.0 -A StrandBiasBySample -A StrandAlleleCountsBySampleGATK

VariantFiltration: -window 35 -cluster 3 -filterName FS -filter“FS >30.0” -filterName QD -filter “QD <2.0”

8.RNA编辑

1. GIREMI 0.2.1 https://github.com/zhqingit/giremi

2. Varsim 0.5.1 https://github.com/bioinform/varsim http://bioinform.github.io/varsim/

9.基因融合

1.FusionCatcher 0.99.5a beta https://github.com/ndaniel/fusioncatcher

2.JAFFA 1.0.6 https://github.com/Oshlack/JAFFA

3.SOAPfuse 1.27 http://soap.genomics.org.cn/soapfuse.html

4.STAR-Fusion 0.7.0 https://github.com/STAR-Fusion/STAR-Fusion

5.TopHat-Fusion 2.0.14 http://ccb.jhu.edu/software/tophat/fusion_index.shtml

RNA-seq GATK Workflow

1. Mapping to the reference

评估了所有专门用于RNAseq比对的主要软件包，发现使用STAR能够对SNP和重要的indels实现最高的灵敏度.同时使用STAR 2-pass方法

下面是STAR 2-pass 比对步骤:

STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19\ mkdir $genomeDir
**STAR** --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa --runThreadN <n>

Alignment jobs were executed as follows:

runDir=/path/to/1pass\ mkdir $runDir \cd $runDir
**STAR** --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass\ mkdir $genomeDir
**STAR** --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>
注：sjdbOverhang 100
int>0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)

The resulting index is then used to produce the final alignments as follows:

runDir=/path/to/2pass\ mkdir $runDir\ cd $runDir
**STAR** --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

2. Add read groups, sort, mark duplicates, and create index

上面的步骤生成一个SAM文件，然后我们通过常规的Picard处理步骤：添加read组信息，排序，标记重复和索引.

java -jar picard.jar **AddOrReplaceReadGroups**I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample
java -jar picard.jar **MarkDuplicates** I=rg_added_sorted.bam O=dedupped.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics

3. Split'N'Trim and reassign mapping qualities

接下来，我们使用专为RNAseq开发的名为SplitNCigarReads的新GATK工具，该工具将read分成外显子片段（除去Ns但保持分组信息）并硬切割任何突出到内含子区域的序列。

在这一步，我们还添加了一个重要的调整：我们需要重新分配映射质量，因为STAR指定了良好的对齐，MAPQ为255（技术上意味着“未知”，因此对GATK毫无意义）。所以我们使用GATK的ReassignOneMappingQuality读取过滤器将所有良好的比对重新分配到默认值60.这不是理想的，我们希望将来RNAseq的映射器会发出有意义的质量分数，但同时这是我们能做的最好的。实际上，我们通过将ReassignOneMappingQuality读取过滤器添加到splitter命令来实现此目的。

最后，请务必指定允许使用ALLOW_N_CIGAR_READS。目前这仍然被归类为“不安全”选项，但这种分类将改变以反映这一事实，现在这是RNAseq处理的支持选项.

java -jar GenomeAnalysisTK.jar -T **SplitNCigarReads** -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
注：-U,--unsafe <unsafe>
Enable unsafe operations: nothing will be checked at runtime (ALLOW_N_CIGAR_READS|ALLOW_UNINDEXED_BAM|ALLOW_UNSET_BAM_SORT_ORDER|NO_READ_ORDER_VERIFICATION|ALLOW_SEQ_DICT_INCOMPATIBILITY|LENIENT_VCF_PROCESSING|ALL)

4. Indel Realignment (optional)

可选项.
java -jar GenomeAnalysisTK.jar -T **RealignerTargetCreator** -I split.bam -Rref.fasta-o split.intervals
java -jar GenomeAnalysisTK.jar -T **IndelRealigner** -I split.bam-R ref.fasta-targetIntervals split.intervals -o split.intervals.bam

5. Base Recalibration

可选项，建议运行BQSR.
java -jar GenomeAnalysisTK.jar -T **BaseRecalibrator** -R ref.fasta-I split.intervals.bam -knownSites dbsnp_138.hg19.vcf -mte -nct 4 -o split.recal.table
java -jar GenomeAnalysisTK.jar -T **PrintReads -**R ref.fasta-BQSR split.recal.table -I split.intervals.bam -o split.intervals.real.bam

6. Variant calling

我们为变体调用代码添加了一些功能，它将智能地考虑由SplitNCigarReads嵌入在BAM文件中的内含子 - 外显子分裂区域的信息。简而言之，新代码将执行“悬空头合并”操作，并避免使用软剪辑基础（这是一个临时解决方案），以尽量减少假阳和假阴。要调用此新功能，只需将-dontUseSoftClippedBases添加到常规HC命令行即可。此外，我们发现如果我们设置最小phred-scaled置信度阈值20去Call突变，我们会得到更好的结果，但如果需要，可以降低此值以提高灵敏度。.

java -jar GenomeAnalysisTK.jar -T **HaplotypeCaller** -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o output.vcf

7. Variant filtering

用hard filters过滤生成的突变，因为还没有RNAseq运行VQSR所需的training/truth资源.

java -jar GenomeAnalysisTK.jar -T **VariantFiltration** -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf

如果更关心灵敏度并且愿意容忍更多假阳，可以选择不进行过滤（或使用限制较少的阈值）.

网友评论

本文标题：RNA-seq汇总（分析+工具+GATK流程）

本文链接：https://www.haomeiwen.com/subject/uywqxctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

RNA-seq汇总（分析+工具+GATK流程）

RNA-seq分析流程

简单描述：

文献流程：

文献工具总结：

RNA-seq GATK Workflow

1. Mapping to the reference

2. Add read groups, sort, mark duplicates, and create index

3. Split'N'Trim and reassign mapping qualities

4. Indel Realignment (optional)

5. Base Recalibration

6. Variant calling

7. Variant filtering

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

生信学习笔记

生信工具

高质量生信文章收录

生信专题

转录组实战

BSA