GATK Best Practices — step4 转录组S

作者: 小灰灰不会飞那又怎样 | 来源:发表于2019-05-23 15:13 被阅读143次
一、转录组SNP/INDEL(RNAseq SNPs + Indels)的介绍

GATK官网介绍:RNAseq short variant discovery (SNPs + Indels)
这一步主要是对单个样品分别进行分析。

官方给出了一套流程,并指出这个流程并不完善,说明文档等还有待开发:
RNAseq short variant per-sample calling

git clone https://github.com/gatk-workflows/gatk3-4-rnaseq-germline-snps-indels.git
二、WDL中各个task的介绍

总共包含了16个task,这些task中有一些重复,展示了GATK3和GATK4对同一步骤的使用:

1. SamToFastq

功能:ubam2fq

${gatk_path} \
SamToFastq \
--INPUT ${unmapped_bam} \
--VALIDATION_STRINGENCY SILENT \
--FASTQ ${base_name}.1.fastq.gz \
--SECOND_END_FASTQ ${base_name}.2.fastq.gz

忽略

2、3两步是比对, 用到了STAR这个软件。STAR是GATK官方推荐转录组比对软件。
比对也可以替换成其他软件如bowite2,后面的具体执行步骤会省略STAR这一步,直接用bowite2得到的bam进行call变异。

2. StarGenerateReferences

功能:建立Index

STAR \
--runMode genomeGenerate \
--genomeDir STAR2_5 \
--genomeFastaFiles ${ref_fasta} \
--sjdbGTFfile ${annotations_gtf} \
${"--sjdbOverhang "+(read_length-1)} \
--runThreadN ${threads}

示例:

STAR --runMode genomeGenerate --genomeDir Genome --genomeFastaFiles ref.fa  --runThreadN 6  --genomeChrBinNbits 9 --genomeSAindexNbases 7
3. StarAlign

功能:比对

STAR \
--genomeDir STAR2_5 \
--runThreadN ${threads} \
--readFilesIn ${fastq1} ${fastq2} \
--readFilesCommand "gunzip -c" \
${"--sjdbOverhang "+(read_length-1)} \
--outSAMtype BAM SortedByCoordinate \
--twopassMode Basic \
--limitBAMsortRAM ${star_mem+"000000000"} \
--limitOutSJcollapsed ${default=1000000 star_limitOutSJcollapsed} \
--outFileNamePrefix ${base_name}.

示例:

cd Alignment/C1R1/ && STAR --genomeDir Genome --readFilesIn C1R1_1.fq C1R1_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang  125  --outSAMstrandField intronMotif  --alignSoftClipAtReferenceEnds No  --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM  60000000000 --twopassMode Basic
cd Alignment/C1R2/ && STAR --genomeDir Genome --readFilesIn C1R2_1.fq C1R2_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang  125  --outSAMstrandField intronMotif  --alignSoftClipAtReferenceEnds No  --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM  60000000000 --twopassMode Basic
cd Alignment/C2R1/ && STAR --genomeDir Genome --readFilesIn C2R1_1.fq C2R1_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang  125  --outSAMstrandField intronMotif  --alignSoftClipAtReferenceEnds No  --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM  60000000000 --twopassMode Basic
cd Alignment/C2R2/ && STAR --genomeDir Genome --readFilesIn C2R2_1.fq C2R2_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang  125  --outSAMstrandField intronMotif  --alignSoftClipAtReferenceEnds No  --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM  60000000000 --twopassMode Basic
cd Alignment/C3R1/ && STAR --genomeDir Genome --readFilesIn C3R1_1.fq C3R1_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang  125  --outSAMstrandField intronMotif  --alignSoftClipAtReferenceEnds No  --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM  60000000000 --twopassMode Basic
cd Alignment/C3R2/ && STAR --genomeDir Genome --readFilesIn C3R2_1.fq C3R2_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang  125  --outSAMstrandField intronMotif  --alignSoftClipAtReferenceEnds No  --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM  60000000000 --twopassMode Basic
4. MergeBamAlignment

功能:合并bam

${gatk_path} \
MergeBamAlignment \
--REFERENCE_SEQUENCE ${ref_fasta} \
--UNMAPPED_BAM ${unaligned_bam} \
--ALIGNED_BAM ${star_bam} \
--OUTPUT ${base_name}.bam \
--INCLUDE_SECONDARY_ALIGNMENTS false \
--PAIRED_RUN False \
--VALIDATION_STRINGENCY SILENT
5. MarkDuplicates

功能:dup的标记

${gatk_path} \
MarkDuplicates \
--INPUT ${input_bam} \
--OUTPUT ${base_name}.bam  \
--CREATE_INDEX true \
--VALIDATION_STRINGENCY SILENT \
--METRICS_FILE ${base_name}.metrics

6、7步骤只是GATK3/4版本不同

6. SplitNCigarReads

功能:它会将落在内含子区间的 reads 片段直接切除,并对 MAPQ 进行调整。
参考:https://www.jianshu.com/p/b400dc7c5eea
官网介绍:Calling variants in RNAseq

java -jar /usr/gitc/GATK35.jar \
-T SplitNCigarReads \
-R ${ref_fasta} \
-I ${input_bam} \
-o ${base_name}.bam \
-rf ReassignOneMappingQuality \
-RMQF 255 \
-RMQT 60 \
-U ALLOW_N_CIGAR_READS

GATK3示例:

java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C1R1.bam.dedupped.bam -o C1R1.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C1R2.bam.dedupped.bam -o C1R2.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C2R1.bam.dedupped.bam -o C2R1.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C2R2.bam.dedupped.bam -o C2R2.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C3R1.bam.dedupped.bam -o C3R1.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C3R2.bam.dedupped.bam -o C3R2.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
7. SplitNCigarReads_GATK4
${gatk_path} \
SplitNCigarReads \
-R ${ref_fasta} \
-I ${input_bam} \
-O ${base_name}.bam

BQSR对应8、9两步:对于碱基质量较好的忽略此步骤。
如果学习可以看之前的介绍:GATK Best Practices — step1 数据预处理(Data Pre-processing)

8. BaseRecalibrator
${gatk_path} --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal \
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails \
-Xloggc:gc_log.log -Xms4000m" \
BaseRecalibrator \
-R ${ref_fasta} \
-I ${input_bam} \
--use-original-qualities \
-O ${recal_output_file} \
-known-sites ${dbSNP_vcf} \
-known-sites ${sep=" --known-sites " known_indels_sites_VCFs}
9. ApplyBQSR
${gatk_path} \
--java-options "-XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps \
-XX:+PrintGCDetails -Xloggc:gc_log.log \
-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xms3000m" \
ApplyBQSR \
--add-output-sam-program-record \
-R ${ref_fasta} \
-I ${input_bam} \
--use-original-qualities \
-O ${base_name}.bam \
--bqsr-recal-file ${recalibration_report}
10. HaplotypeCaller

功能:变异检测

java -jar /usr/gitc/GATK35.jar \
-T HaplotypeCaller \
-R ${ref_fasta} \
-I ${input_bam} \
-L ${interval_list} \
-dontUseSoftClippedBases \
-stand_call_conf ${default=20 stand_call_conf} \
--dbsnp ${dbSNP_vcf} \
-o ${base_name}.vcf.gz

GATK3示例:

这个例子拆开bam跑:

java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=HaplotypeCaller -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fa -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -nct 5 -mte --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o C1R1.bam.split.bam.1.gvcf -L chr_allocation.1.list  -I C1R1.bam.split.bam
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=HaplotypeCaller -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fa -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -nct 5 -mte --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o C1R1.bam.split.bam.2.gvcf -L chr_allocation.2.list  -I C1R1.bam.split.bam

然后vcf合并:

java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=CombineGVCFs -jar GenomeAnalysisTK.jar -T CombineGVCFs -R ref.fa -V C1R1.bam.split.bam.1.gvcf -V C1R1.bam.split.bam.2.gvcf -V C1R1.bam.split.bam.3.gvcf -V C1R1.bam.split.bam.4.gvcf -V C1R1.bam.split.bam.5.gvcf -V C1R1.bam.split.bam.6.gvcf -V C1R1.bam.split.bam.7.gvcf -V C1R1.bam.split.bam.8.gvcf -V C1R1.bam.split.bam.9.gvcf -V C1R1.bam.split.bam.10.gvcf -V C1R1.bam.split.bam.11.gvcf -V C1R1.bam.split.bam.12.gvcf -V C1R1.bam.split.bam.13.gvcf -V C1R1.bam.split.bam.14.gvcf -V C1R1.bam.split.bam.15.gvcf -V C1R1.bam.split.bam.16.gvcf -V C1R1.bam.split.bam.17.gvcf -V C1R1.bam.split.bam.18.gvcf -V C1R1.bam.split.bam.19.gvcf -V C1R1.bam.split.bam.20.gvcf -V C1R1.bam.split.bam.21.gvcf -V C1R1.bam.split.bam.22.gvcf -V C1R1.bam.split.bam.23.gvcf -V C1R1.bam.split.bam.24.gvcf -V C1R1.bam.split.bam.25.gvcf -V C1R1.bam.split.bam.26.gvcf -V C1R1.bam.split.bam.27.gvcf -V C1R1.bam.split.bam.28.gvcf -V C1R1.bam.split.bam.29.gvcf -V C1R1.bam.split.bam.30.gvcf -o C1R1.bam.split.bam.gvcf
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=CombineGVCFs -jar GenomeAnalysisTK.jar -T CombineGVCFs -R ref.fa -V C1R2.bam.split.bam.1.gvcf -V C1R2.bam.split.bam.2.gvcf -V C1R2.bam.split.bam.3.gvcf -V C1R2.bam.split.bam.4.gvcf -V C1R2.bam.split.bam.5.gvcf -V C1R2.bam.split.bam.6.gvcf -V C1R2.bam.split.bam.7.gvcf -V C1R2.bam.split.bam.8.gvcf -V C1R2.bam.split.bam.9.gvcf -V C1R2.bam.split.bam.10.gvcf -V C1R2.bam.split.bam.11.gvcf -V C1R2.bam.split.bam.12.gvcf -V C1R2.bam.split.bam.13.gvcf -V C1R2.bam.split.bam.14.gvcf -V C1R2.bam.split.bam.15.gvcf -V C1R2.bam.split.bam.16.gvcf -V C1R2.bam.split.bam.17.gvcf -V C1R2.bam.split.bam.18.gvcf -V C1R2.bam.split.bam.19.gvcf -V C1R2.bam.split.bam.20.gvcf -V C1R2.bam.split.bam.21.gvcf -V C1R2.bam.split.bam.22.gvcf -V C1R2.bam.split.bam.23.gvcf -V C1R2.bam.split.bam.24.gvcf -V C1R2.bam.split.bam.25.gvcf -V C1R2.bam.split.bam.26.gvcf -V C1R2.bam.split.bam.27.gvcf -V C1R2.bam.split.bam.28.gvcf -V C1R2.bam.split.bam.29.gvcf -V C1R2.bam.split.bam.30.gvcf -o C1R2.bam.split.bam.gvcf
11. HaplotypeCaller_GATK4
${gatk_path} --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \
HaplotypeCaller \
-R ${ref_fasta} \
-I ${input_bam} \
-L ${interval_list} \
-O ${base_name}.vcf.gz \
-dont-use-soft-clipped-bases \
--standard-min-confidence-threshold-for-calling ${default=20 stand_call_conf} \
--dbsnp ${dbSNP_vcf}

如果需要得到gvcf,需要加上参数:-ERC GVCF

12. VariantFiltration

功能:过滤

 ${gatk_path} \
VariantFiltration \
--R ${ref_fasta} \
--V ${input_vcf} \
--window 35 \
--cluster 3 \
--filter-name "FS" \
--filter "FS > 30.0" \
--filter-name "QD" \
--filter "QD < 2.0" \
-O ${base_name}
13. MergeVCFs

功能:vcf合并

${gatk_path} --java-options "-Xms2000m"  \
MergeVcfs \
--INPUT ${sep=' --INPUT=' input_vcfs} \
--OUTPUT ${output_vcf_name}

这是一个GATK3合并不同样品gvcf的示例:

java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=CombineGVCFs -jar GenomeAnalysisTK.jar -T CombineGVCFs -R ref.fa -o final.vcf.rawgvcf -V C1R1.bam.split.bam.gvcf -V C1R2.bam.split.bam.gvcf -V C2R1.bam.split.bam.gvcf -V C2R2.bam.split.bam.gvcf -V C3R1.bam.split.bam.gvcf -V C3R2.bam.split.bam.gvcf 

GATK对合并不同样品vcf并不友好,支持合并gvcf。
这一步忽略,用bcftools代替。
如果需要用GATK合并,应该在上面HaplotypeCaller这一步补充参数 -ERC GVCF。这样得到的vcf是gvcf格式,可以用GATK合并。

14. ScatterIntervalList
15. ScatterIntervalList_GATK4
16. RevertSam
${gatk_path} \
RevertSam \
--INPUT ${input_bam} \
--OUTPUT ${base_name}.bam \
--VALIDATION_STRINGENCY SILENT \
--ATTRIBUTE_TO_CLEAR FT \
--ATTRIBUTE_TO_CLEAR CO \
--SORT_ORDER ${sort_order}
三、执行命令

以上,转录组SNP/INDEL(RNAseq SNPs + Indels)这一步就学习完了,下面具体执行一下(用到了以上四个 task:5/7/11/12)

运行命令:

time gatk MarkDuplicates --INPUT sample1.sorted.bam --OUTPUT sample1.rmdup.bam --CREATE_INDEX true --VALIDATION_STRINGENCY SILENT --METRICS_FILE sample1.duplicate_metrics
time gatk SplitNCigarReads -R ucsc.hg19.fasta -I sample1.rmdup.bam -O sample1.SplitNC.bam
time gatk --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" HaplotypeCaller -R ucsc.hg19.fasta -I sample1.SplitNC.bam -L bed -O sample1.vcf.gz -dont-use-soft-clipped-bases --dbsnp dbsnp_138.hg19.vcf
time gatk VariantFiltration --R ucsc.hg19.fasta --V sample1.vcf.gz --window 35 --cluster 3 --filter-name FS --filter "FS > 30.0" --filter-name QD --filter "QD < 2.0" -O sample1.pass.vcf.gz

得到:

sample1.rmdup.bam
sample1.rmdup.bai
sample1.duplicate_metrics
sample1.SplitNC.bam
sample1.SplitNC.bai
sample1.vcf.gz
sample1.vcf.gz.tbi
sample1.pass.vcf.gz
sample1.pass.vcf.gz.idx

四步的运行时间:

real    3m8.185s
real    15m51.060s
real    37m44.360s
real    1m16.533s

以上四步完成GATK对RNA数据的call 变异。如果有多个样品,分别运行即可。

四、补充

下面再补充一点使用bcftools对不同样本vcf进行合并。

bcftools merge sample1.pass.vcf.vcf.gz sample2.pass.vcf.vcf.gz sample3.pass.vcf.vcf.gz -o all.vcf 

得到:

$ grep -v "##" all.vcf | head -5
#CHROM  POS ID  REF ALT QUALFILTER  INFOFORMAT  sample1   sample2   sample3
chrM  1985  .   TAG T   13.91   QD  BaseQRankSum=-0.362;ExcessHet=3.0103;FS=0;MQ=50;MQRankSum=0;QD=1.07;ReadPosRankSum=-0.538;SOR=0.551;DP=13;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1GT:AD:DP:GQ:PL  0/1:11,2:13:51:51,0,449 ./.:.:.:.:. ./.:.:.:.:.
chrM  2354  .   C   T   19329.8 PASSExcessHet=3.0103;FS=0;MQ=50;QD=29.87;SOR=0.82;DP=1347;AF=1;MLEAC=2;MLEAF=1;AN=6;AC=6GT:AD:DP:GQ:PL  1/1:0,398:398:99:11918,1197,0   1/1:0,310:310:99:9006,929,0 1/1:0,638:638:99:19358,1915,0
chrM  2485  .   C   T   38116.8 PASSExcessHet=3.0103;FS=0;MQ=50;QD=31.12;SOR=0.861;DP=2418;AF=1;MLEAC=2;MLEAF=1;AN=6;AC=6   GT:AD:DP:GQ:PL  1/1:0,757:757:99:23583,2276,0   1/1:0,539:539:99:17035,1623,0   1/1:0,1121:1121:99:38145,3381,0
chrM  2619  .   A   T   23769.8 PASSBaseQRankSum=0.596;ExcessHet=3.0103;FS=0;MQ=50;MQRankSum=0;QD=17.96;ReadPosRankSum=-0.735;SOR=0.713;DP=3734;AF=0.5;MLEAC=1;MLEAF=0.5;AN=6;AC=3  GT:AD:DP:GQ:PL  0/1:249,652:901:99:16215,0,4464 0/1:247,562:809:99:13205,0,4175 0/1:710,1079:1789:99:23798,0,14296

注意:输入vcf文件需要是bgzip的,如果是未经过压缩的,需要先进行压缩并建index。(bcftools功能还很强大,可以统计(stat)、画图(plot-vcfstats)等,这里就不介绍了。)

bgzip -c -f  vcf >  vcf.gz
bcftools index -t vcf.gz

这一步的学习基本完成。
由于胚系突变CNV(Germline CNVs)官网还没有明确发布,这一个step暂时搁置。
下一个step是对体细胞突变CNV(Somatic CNVs)的学习

相关文章

网友评论

    本文标题:GATK Best Practices — step4 转录组S

    本文链接:https://www.haomeiwen.com/subject/eywvzqtx.html