一、转录组SNP/INDEL(RNAseq SNPs + Indels)的介绍
GATK官网介绍:RNAseq short variant discovery (SNPs + Indels)
这一步主要是对单个样品分别进行分析。
官方给出了一套流程,并指出这个流程并不完善,说明文档等还有待开发:
RNAseq short variant per-sample calling
git clone https://github.com/gatk-workflows/gatk3-4-rnaseq-germline-snps-indels.git
二、WDL中各个task的介绍
总共包含了16个task,这些task中有一些重复,展示了GATK3和GATK4对同一步骤的使用:
1. SamToFastq
功能:ubam2fq
${gatk_path} \
SamToFastq \
--INPUT ${unmapped_bam} \
--VALIDATION_STRINGENCY SILENT \
--FASTQ ${base_name}.1.fastq.gz \
--SECOND_END_FASTQ ${base_name}.2.fastq.gz
忽略
2、3两步是比对, 用到了STAR这个软件。STAR是GATK官方推荐转录组比对软件。
比对也可以替换成其他软件如bowite2,后面的具体执行步骤会省略STAR这一步,直接用bowite2得到的bam进行call变异。
2. StarGenerateReferences
功能:建立Index
STAR \
--runMode genomeGenerate \
--genomeDir STAR2_5 \
--genomeFastaFiles ${ref_fasta} \
--sjdbGTFfile ${annotations_gtf} \
${"--sjdbOverhang "+(read_length-1)} \
--runThreadN ${threads}
示例:
STAR --runMode genomeGenerate --genomeDir Genome --genomeFastaFiles ref.fa --runThreadN 6 --genomeChrBinNbits 9 --genomeSAindexNbases 7
3. StarAlign
功能:比对
STAR \
--genomeDir STAR2_5 \
--runThreadN ${threads} \
--readFilesIn ${fastq1} ${fastq2} \
--readFilesCommand "gunzip -c" \
${"--sjdbOverhang "+(read_length-1)} \
--outSAMtype BAM SortedByCoordinate \
--twopassMode Basic \
--limitBAMsortRAM ${star_mem+"000000000"} \
--limitOutSJcollapsed ${default=1000000 star_limitOutSJcollapsed} \
--outFileNamePrefix ${base_name}.
示例:
cd Alignment/C1R1/ && STAR --genomeDir Genome --readFilesIn C1R1_1.fq C1R1_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang 125 --outSAMstrandField intronMotif --alignSoftClipAtReferenceEnds No --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 60000000000 --twopassMode Basic
cd Alignment/C1R2/ && STAR --genomeDir Genome --readFilesIn C1R2_1.fq C1R2_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang 125 --outSAMstrandField intronMotif --alignSoftClipAtReferenceEnds No --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 60000000000 --twopassMode Basic
cd Alignment/C2R1/ && STAR --genomeDir Genome --readFilesIn C2R1_1.fq C2R1_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang 125 --outSAMstrandField intronMotif --alignSoftClipAtReferenceEnds No --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 60000000000 --twopassMode Basic
cd Alignment/C2R2/ && STAR --genomeDir Genome --readFilesIn C2R2_1.fq C2R2_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang 125 --outSAMstrandField intronMotif --alignSoftClipAtReferenceEnds No --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 60000000000 --twopassMode Basic
cd Alignment/C3R1/ && STAR --genomeDir Genome --readFilesIn C3R1_1.fq C3R1_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang 125 --outSAMstrandField intronMotif --alignSoftClipAtReferenceEnds No --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 60000000000 --twopassMode Basic
cd Alignment/C3R2/ && STAR --genomeDir Genome --readFilesIn C3R2_1.fq C3R2_2.fq --runThreadN 6 --twopass1readsN 10000000000 --sjdbOverhang 125 --outSAMstrandField intronMotif --alignSoftClipAtReferenceEnds No --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 60000000000 --twopassMode Basic
4. MergeBamAlignment
功能:合并bam
${gatk_path} \
MergeBamAlignment \
--REFERENCE_SEQUENCE ${ref_fasta} \
--UNMAPPED_BAM ${unaligned_bam} \
--ALIGNED_BAM ${star_bam} \
--OUTPUT ${base_name}.bam \
--INCLUDE_SECONDARY_ALIGNMENTS false \
--PAIRED_RUN False \
--VALIDATION_STRINGENCY SILENT
5. MarkDuplicates
功能:dup的标记
${gatk_path} \
MarkDuplicates \
--INPUT ${input_bam} \
--OUTPUT ${base_name}.bam \
--CREATE_INDEX true \
--VALIDATION_STRINGENCY SILENT \
--METRICS_FILE ${base_name}.metrics
6、7步骤只是GATK3/4版本不同
6. SplitNCigarReads
功能:它会将落在内含子区间的 reads 片段直接切除,并对 MAPQ 进行调整。
参考:https://www.jianshu.com/p/b400dc7c5eea
官网介绍:Calling variants in RNAseq
java -jar /usr/gitc/GATK35.jar \
-T SplitNCigarReads \
-R ${ref_fasta} \
-I ${input_bam} \
-o ${base_name}.bam \
-rf ReassignOneMappingQuality \
-RMQF 255 \
-RMQT 60 \
-U ALLOW_N_CIGAR_READS
GATK3示例:
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C1R1.bam.dedupped.bam -o C1R1.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C1R2.bam.dedupped.bam -o C1R2.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C2R1.bam.dedupped.bam -o C2R1.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C2R2.bam.dedupped.bam -o C2R2.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C3R1.bam.dedupped.bam -o C3R1.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=SplitNTrim -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fa -I C3R2.bam.dedupped.bam -o C3R2.bam.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
7. SplitNCigarReads_GATK4
${gatk_path} \
SplitNCigarReads \
-R ${ref_fasta} \
-I ${input_bam} \
-O ${base_name}.bam
BQSR对应8、9两步:对于碱基质量较好的忽略此步骤。
如果学习可以看之前的介绍:GATK Best Practices — step1 数据预处理(Data Pre-processing)
8. BaseRecalibrator
${gatk_path} --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal \
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails \
-Xloggc:gc_log.log -Xms4000m" \
BaseRecalibrator \
-R ${ref_fasta} \
-I ${input_bam} \
--use-original-qualities \
-O ${recal_output_file} \
-known-sites ${dbSNP_vcf} \
-known-sites ${sep=" --known-sites " known_indels_sites_VCFs}
9. ApplyBQSR
${gatk_path} \
--java-options "-XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps \
-XX:+PrintGCDetails -Xloggc:gc_log.log \
-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xms3000m" \
ApplyBQSR \
--add-output-sam-program-record \
-R ${ref_fasta} \
-I ${input_bam} \
--use-original-qualities \
-O ${base_name}.bam \
--bqsr-recal-file ${recalibration_report}
10. HaplotypeCaller
功能:变异检测
java -jar /usr/gitc/GATK35.jar \
-T HaplotypeCaller \
-R ${ref_fasta} \
-I ${input_bam} \
-L ${interval_list} \
-dontUseSoftClippedBases \
-stand_call_conf ${default=20 stand_call_conf} \
--dbsnp ${dbSNP_vcf} \
-o ${base_name}.vcf.gz
GATK3示例:
这个例子拆开bam跑:
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=HaplotypeCaller -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fa -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -nct 5 -mte --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o C1R1.bam.split.bam.1.gvcf -L chr_allocation.1.list -I C1R1.bam.split.bam
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=HaplotypeCaller -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fa -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -nct 5 -mte --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o C1R1.bam.split.bam.2.gvcf -L chr_allocation.2.list -I C1R1.bam.split.bam
然后vcf合并:
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=CombineGVCFs -jar GenomeAnalysisTK.jar -T CombineGVCFs -R ref.fa -V C1R1.bam.split.bam.1.gvcf -V C1R1.bam.split.bam.2.gvcf -V C1R1.bam.split.bam.3.gvcf -V C1R1.bam.split.bam.4.gvcf -V C1R1.bam.split.bam.5.gvcf -V C1R1.bam.split.bam.6.gvcf -V C1R1.bam.split.bam.7.gvcf -V C1R1.bam.split.bam.8.gvcf -V C1R1.bam.split.bam.9.gvcf -V C1R1.bam.split.bam.10.gvcf -V C1R1.bam.split.bam.11.gvcf -V C1R1.bam.split.bam.12.gvcf -V C1R1.bam.split.bam.13.gvcf -V C1R1.bam.split.bam.14.gvcf -V C1R1.bam.split.bam.15.gvcf -V C1R1.bam.split.bam.16.gvcf -V C1R1.bam.split.bam.17.gvcf -V C1R1.bam.split.bam.18.gvcf -V C1R1.bam.split.bam.19.gvcf -V C1R1.bam.split.bam.20.gvcf -V C1R1.bam.split.bam.21.gvcf -V C1R1.bam.split.bam.22.gvcf -V C1R1.bam.split.bam.23.gvcf -V C1R1.bam.split.bam.24.gvcf -V C1R1.bam.split.bam.25.gvcf -V C1R1.bam.split.bam.26.gvcf -V C1R1.bam.split.bam.27.gvcf -V C1R1.bam.split.bam.28.gvcf -V C1R1.bam.split.bam.29.gvcf -V C1R1.bam.split.bam.30.gvcf -o C1R1.bam.split.bam.gvcf
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=CombineGVCFs -jar GenomeAnalysisTK.jar -T CombineGVCFs -R ref.fa -V C1R2.bam.split.bam.1.gvcf -V C1R2.bam.split.bam.2.gvcf -V C1R2.bam.split.bam.3.gvcf -V C1R2.bam.split.bam.4.gvcf -V C1R2.bam.split.bam.5.gvcf -V C1R2.bam.split.bam.6.gvcf -V C1R2.bam.split.bam.7.gvcf -V C1R2.bam.split.bam.8.gvcf -V C1R2.bam.split.bam.9.gvcf -V C1R2.bam.split.bam.10.gvcf -V C1R2.bam.split.bam.11.gvcf -V C1R2.bam.split.bam.12.gvcf -V C1R2.bam.split.bam.13.gvcf -V C1R2.bam.split.bam.14.gvcf -V C1R2.bam.split.bam.15.gvcf -V C1R2.bam.split.bam.16.gvcf -V C1R2.bam.split.bam.17.gvcf -V C1R2.bam.split.bam.18.gvcf -V C1R2.bam.split.bam.19.gvcf -V C1R2.bam.split.bam.20.gvcf -V C1R2.bam.split.bam.21.gvcf -V C1R2.bam.split.bam.22.gvcf -V C1R2.bam.split.bam.23.gvcf -V C1R2.bam.split.bam.24.gvcf -V C1R2.bam.split.bam.25.gvcf -V C1R2.bam.split.bam.26.gvcf -V C1R2.bam.split.bam.27.gvcf -V C1R2.bam.split.bam.28.gvcf -V C1R2.bam.split.bam.29.gvcf -V C1R2.bam.split.bam.30.gvcf -o C1R2.bam.split.bam.gvcf
11. HaplotypeCaller_GATK4
${gatk_path} --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \
HaplotypeCaller \
-R ${ref_fasta} \
-I ${input_bam} \
-L ${interval_list} \
-O ${base_name}.vcf.gz \
-dont-use-soft-clipped-bases \
--standard-min-confidence-threshold-for-calling ${default=20 stand_call_conf} \
--dbsnp ${dbSNP_vcf}
如果需要得到gvcf,需要加上参数:-ERC GVCF
12. VariantFiltration
功能:过滤
${gatk_path} \
VariantFiltration \
--R ${ref_fasta} \
--V ${input_vcf} \
--window 35 \
--cluster 3 \
--filter-name "FS" \
--filter "FS > 30.0" \
--filter-name "QD" \
--filter "QD < 2.0" \
-O ${base_name}
13. MergeVCFs
功能:vcf合并
${gatk_path} --java-options "-Xms2000m" \
MergeVcfs \
--INPUT ${sep=' --INPUT=' input_vcfs} \
--OUTPUT ${output_vcf_name}
这是一个GATK3合并不同样品gvcf的示例:
java -XX:ParallelGCThreads=5 -Xmx30g -Djava.io.tmpdir=CombineGVCFs -jar GenomeAnalysisTK.jar -T CombineGVCFs -R ref.fa -o final.vcf.rawgvcf -V C1R1.bam.split.bam.gvcf -V C1R2.bam.split.bam.gvcf -V C2R1.bam.split.bam.gvcf -V C2R2.bam.split.bam.gvcf -V C3R1.bam.split.bam.gvcf -V C3R2.bam.split.bam.gvcf
GATK对合并不同样品vcf并不友好,支持合并gvcf。
这一步忽略,用bcftools代替。
如果需要用GATK合并,应该在上面HaplotypeCaller这一步补充参数 -ERC GVCF。这样得到的vcf是gvcf格式,可以用GATK合并。
14. ScatterIntervalList
15. ScatterIntervalList_GATK4
16. RevertSam
${gatk_path} \
RevertSam \
--INPUT ${input_bam} \
--OUTPUT ${base_name}.bam \
--VALIDATION_STRINGENCY SILENT \
--ATTRIBUTE_TO_CLEAR FT \
--ATTRIBUTE_TO_CLEAR CO \
--SORT_ORDER ${sort_order}
三、执行命令
以上,转录组SNP/INDEL(RNAseq SNPs + Indels)这一步就学习完了,下面具体执行一下(用到了以上四个 task:5/7/11/12)
运行命令:
time gatk MarkDuplicates --INPUT sample1.sorted.bam --OUTPUT sample1.rmdup.bam --CREATE_INDEX true --VALIDATION_STRINGENCY SILENT --METRICS_FILE sample1.duplicate_metrics
time gatk SplitNCigarReads -R ucsc.hg19.fasta -I sample1.rmdup.bam -O sample1.SplitNC.bam
time gatk --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" HaplotypeCaller -R ucsc.hg19.fasta -I sample1.SplitNC.bam -L bed -O sample1.vcf.gz -dont-use-soft-clipped-bases --dbsnp dbsnp_138.hg19.vcf
time gatk VariantFiltration --R ucsc.hg19.fasta --V sample1.vcf.gz --window 35 --cluster 3 --filter-name FS --filter "FS > 30.0" --filter-name QD --filter "QD < 2.0" -O sample1.pass.vcf.gz
得到:
sample1.rmdup.bam
sample1.rmdup.bai
sample1.duplicate_metrics
sample1.SplitNC.bam
sample1.SplitNC.bai
sample1.vcf.gz
sample1.vcf.gz.tbi
sample1.pass.vcf.gz
sample1.pass.vcf.gz.idx
四步的运行时间:
real 3m8.185s
real 15m51.060s
real 37m44.360s
real 1m16.533s
以上四步完成GATK对RNA数据的call 变异。如果有多个样品,分别运行即可。
四、补充
下面再补充一点使用bcftools对不同样本vcf进行合并。
bcftools merge sample1.pass.vcf.vcf.gz sample2.pass.vcf.vcf.gz sample3.pass.vcf.vcf.gz -o all.vcf
得到:
$ grep -v "##" all.vcf | head -5
#CHROM POS ID REF ALT QUALFILTER INFOFORMAT sample1 sample2 sample3
chrM 1985 . TAG T 13.91 QD BaseQRankSum=-0.362;ExcessHet=3.0103;FS=0;MQ=50;MQRankSum=0;QD=1.07;ReadPosRankSum=-0.538;SOR=0.551;DP=13;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1GT:AD:DP:GQ:PL 0/1:11,2:13:51:51,0,449 ./.:.:.:.:. ./.:.:.:.:.
chrM 2354 . C T 19329.8 PASSExcessHet=3.0103;FS=0;MQ=50;QD=29.87;SOR=0.82;DP=1347;AF=1;MLEAC=2;MLEAF=1;AN=6;AC=6GT:AD:DP:GQ:PL 1/1:0,398:398:99:11918,1197,0 1/1:0,310:310:99:9006,929,0 1/1:0,638:638:99:19358,1915,0
chrM 2485 . C T 38116.8 PASSExcessHet=3.0103;FS=0;MQ=50;QD=31.12;SOR=0.861;DP=2418;AF=1;MLEAC=2;MLEAF=1;AN=6;AC=6 GT:AD:DP:GQ:PL 1/1:0,757:757:99:23583,2276,0 1/1:0,539:539:99:17035,1623,0 1/1:0,1121:1121:99:38145,3381,0
chrM 2619 . A T 23769.8 PASSBaseQRankSum=0.596;ExcessHet=3.0103;FS=0;MQ=50;MQRankSum=0;QD=17.96;ReadPosRankSum=-0.735;SOR=0.713;DP=3734;AF=0.5;MLEAC=1;MLEAF=0.5;AN=6;AC=3 GT:AD:DP:GQ:PL 0/1:249,652:901:99:16215,0,4464 0/1:247,562:809:99:13205,0,4175 0/1:710,1079:1789:99:23798,0,14296
注意:输入vcf文件需要是bgzip的,如果是未经过压缩的,需要先进行压缩并建index。(bcftools功能还很强大,可以统计(stat)、画图(plot-vcfstats)等,这里就不介绍了。)
bgzip -c -f vcf > vcf.gz
bcftools index -t vcf.gz
这一步的学习基本完成。
由于胚系突变CNV(Germline CNVs)官网还没有明确发布,这一个step暂时搁置。
下一个step是对体细胞突变CNV(Somatic CNVs)的学习
网友评论