美文网首页生信分析流程
GATK Best Practices — step3 体细胞突

GATK Best Practices — step3 体细胞突

作者: 小灰灰不会飞那又怎样 | 来源:发表于2019-05-16 10:17 被阅读0次
    一、体细胞突变 SNV/INDEL(Somatic SNVs + Indels)的介绍

    GATK官网介绍:Somatic short variant discovery (SNVs + Indels)

    单个样本/成对样本都可以进行这一步

    官方给出了两套流程(都在一个GIT下面):
    Somatic short variants tumor-normal pair
    Somatic short variants PON creation

    git clone https://github.com/gatk-workflows/gatk4-somatic-snvs-indels
    

    共有四套wdl:
    mutect2-normal-normal.wdl
    mutect2.wdl (single tumor-normal pair or on a single tumor sample用于一对N/T样品或者单个tumour情况)
    mutect2_nio.wdl
    mutect2_pon.wdl(Create a Mutect2 panel of normals 用于创建多个normal的模型)

    二、WDL中各个task的介绍

    这次学习采用的是mutect2.wdl流程。总共包含了14个task,一一来学习一下:
    GATK官网:(How to) Call somatic mutations using GATK4 Mutect2

    总共包含了14个task:

    1. CramToBam

    忽略

    2. SplitIntervals

    功能:将Bed进行拆分,由于我采用的bed并不大,这一步忽略。

    gatk --java-options "-Xmx5000m" SplitIntervals \
    -R ${ref_fasta} \
    ${"-L " + intervals} \
    -scatter 50 \
    -O interval-files \
    ${split_intervals_extra_args}
    

    -scatter这个参数表示拆分成多少个区域。

    Mutect2这一步主要的参考是3、9、10几步,主要看一下。
    GATK官网介绍: (How to) Call somatic mutations using GATK4 Mutect2

    3. M2

    这一步可以根据成对/单样品来进行分析,下面实例是成对,包含了以下3个步骤:

    1)GetSampleName

    GetSampleName is a BETA tool and is not yet ready for use in production
    这一步忽略

    2)Mutect2

    gatk --java-options "-Xmx5000m" Mutect2 \
    -R ${ref_fasta} \
    -I ${tumor_bam} -tumor tumor_NAME \
    -I ${normal_bam} -normal normal_NAME \
    --germline-resource   af-only-gnomad.raw.sites.hg19.vcf.gz \
    ${"-pon " + pon} \
    ${"-L " + intervals} \
    ${"--alleles " + gga_vcf} \
    -O "\${output_vcf}" \
    ${true='--bam-output bamout.bam' false='' make_bamout} \
    ${true='--f1r2-tar-gz f1r2.tar.gz' false='' run_ob_filter} \
    ${m2_extra_args}
    

    --germline-resource:Population vcf of germline sequencing containing allele fractions
    GATK官网问答介绍:MuTect2 beta --germline_resource for build b37vcf下载地址
    只有b37和hg38版本,没有hg19版本,hg19版本下载(但是来源并不保证),该链接来自于官网的问答

    -pon是多个N组成的PON文件,只有一对样品的情况可以忽略

    -L 区域文件,可以是一个Bed

    --alleles:FeatureInput The set of alleles at which to genotype when --genotyping_mode is GENOTYPE_GIVEN_ALLELES 忽略

    --bam-output,-bamout:String File to which assembled haplotypes should be written
    输出bam

    3)GetPileupSummaries

    Warning: GetPileupSummaries is a BETA tool and is not yet ready for use in production
    

    GATK官网有关GetPileupSummaries的介绍:GetPileupSummaries

    功能:统计,得到一个六列的表格,计算的是给定的VCF中的位点的count数

    These must be created, even if they remain empty, as cromwell doesn't support optional output
    这一步的结果是必须存在的,或者可以直接touch两个文件

    touch tumor-pileups.table
    touch normal-pileups.table
    

    tumor:

    gatk --java-options "-Xmx${command_mem}m" GetPileupSummaries  \
    -R ${ref_fasta}  \
    -I ${tumor_bam} ${"--interval-set-rule INTERSECTION -L " + intervals} \
    -V ${variants_for_contamination} -L ${variants_for_contamination}  \
    -O tumor-pileups.table
    

    normal:

    gatk --java-options "-Xmx${command_mem}m" GetPileupSummaries  \
    -R ${ref_fasta}  \
    -I ${normal_bam} ${"--interval-set-rule INTERSECTION -L " + intervals} \
    -V ${variants_for_contamination} -L ${variants_for_contamination}  \
    -O normal-pileups.table
    

    成对样品分别做这一步,如果没有N样品,忽略掉第二步;

    -V必须参数 输入的是vcf格式 GATK官网下载
    The tool requires a common germline variant sites VCF, e.g. derived from the gnomAD resource, with population allele frequencies (AF) in the INFO field. This resource must contain only biallelic SNPs and can be an eight-column sites-only VCF. The tool ignores the filter status of the variant calls in this germline resource.
    small_exac_common_3_hg19.vcf 示例如下:

    $ grep -v "#" small_exac_common_3_hg19.vcf | head
    chr1    17365   .       C       G       826621  InbreedingCoeff_Filter  AF=0.136;REMAP_ALIGN=FP
    chr1    17385   .       G       A       592354  InbreedingCoeff_Filter  AF=0.122;REMAP_ALIGN=FP
    chr1    30548   .       T       G       1923.65 VQSRTrancheSNP99.60to99.80      AF=0.081;REMAP_ALIGN=FP
    chr1    69761   .       A       T       2041400 PASS    AF=0.113;REMAP_ALIGN=FP
    chr1    139213  .       A       G       1634390 PASS    AF=0.233;REMAP_ALIGN=FP
    chr1    139233  .       C       A       1489200 PASS    AF=0.231;REMAP_ALIGN=FP
    chr1    567783  rs142895724     T       C       120608  PASS    AF=0.120;REMAP_ALIGN=FP
    chr1    567825  .       C       T       96811.60        PASS    AF=0.066;REMAP_ALIGN=FP
    chr1    721757  rs189147642     T       A       349942  PASS    AF=0.051;REMAP_ALIGN=FP
    chr1    739142  rs2340527       T       A       53624.30        PASS    AF=0.073;REMAP_ALIGN=FP
    

    -L支持多个参数输入 Although the sites (-L) and variants (-V) resources will often be identical, this need not be the case. 如果-V和-L区域一样,可以省略-L。

    GetPileupSummaries这一步得到:

    $ head samplename_T.pileups.table
    contig  position        ref_count       alt_count       other_alt_count allele_frequency
    chr1    8064578 1140    1179    1       0.144
    chr1    65311214        2290    0       0       0.061
    chr1    65311262        2299    1       1       0.151
    chr1    97981395        1101    987     1       0.192
    chr1    98165091        1956    0       2       0.086
    chr2    29416615        2375    0       0       0.091
    chr2    29416794        1634    0       2       0.129
    chr2    47693788        2203    1       0       0.088
    chr2    47703500        898     859     2       0.114
    
    4. MergeVCFs

    合并vcf 忽略

    5. MergeBamOuts

    忽略

    6. MergeStats

    忽略

    7. MergePileupSummaries

    忽略

    8. LearnReadOrientationModel

    忽略

    9. CalculateContamination
    gatk --java-options "-Xmx${command_mem}m"  \
    CalculateContamination -I ${tumor_pileups} \
    -O contamination.table --tumor-segmentation segments.table  \
    ${"-matched " + normal_pileups}
    

    --tumor-segmentation,-segments:File The output table containing segmentation of the tumor by minor allele fraction
    -matched 如果没有N样品,不需要这个参数

    The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.(来源:https://software.broadinstitute.org/gatk/documentation/article?id=11136

    得到:

    segments.table
    contamination.table
    
    10. Filter
    gatk --java-options "-Xmx${command_mem}m" FilterMutectCalls -V ${unfiltered_vcf} \
    -R ${ref_fasta} \
    -O ${output_vcf} \
    ${"--contamination-table " + contamination_table} \
    
    11. FilterAlignmentArtifacts
    gatk --java-options "-Xmx${command_mem}m" FilterAlignmentArtifacts \
    -V ${input_vcf} \
    -I ${bam} \
    --bwa-mem-index-image ${realignment_index_bundle} \
    ${realignment_extra_args} \
    -O ${output_vcf}
    
    12. oncotate_m2
    13. SumFloats

    功能:Calculates sum of a list of floats

    14. Funcotate
    三、执行命令
    1. 小panel的测试:
    time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L Covered.bed -O Mutect2.vcf.gz --bam-output Mutect2.bam
    time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta  -I  samplename_T.recal.bam -L Covered.bed -V small_exac_common_3_hg19.vcf  -O samplename_T.pileups.table
    time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta  -I  samplename_N.recal.bam -L Covered.bed -V small_exac_common_3_hg19.vcf  -O samplename_N.pileups.table
    time gatk --java-options "-Xmx5000m" CalculateContamination -I samplename_T.pileups.table -matched samplename_N.pileups.table -O contamination.table  --tumor-segmentation segments.table
    time gatk --java-options "-Xmx5000m" FilterMutectCalls -V Mutect2.vcf.gz -R  ucsc.hg19.fasta -O Mutect2.filter.vcf --contamination-table contamination.table
    

    得到:

    Mutect2.vcf.gz
    Mutect2.vcf.gz.tbi
    Mutect2.bam
    Mutect2.bai
    samplename_T.pileups.table
    samplename_N.pileups.table
    segments.table
    contamination.table
    Mutect2FilteringStats.tsv
    Mutect2.filter.vcf
    Mutect2.filter.vcf.idx
    

    运行时间:

    $ grep real Mutect2.sh.e
    real    28m7.830s (Mutect2这一步运行时间较长)
    real    2m36.866s
    real    3m6.991s
    real    1m7.443s
    real    1m8.615s
    

    最后得到:

    $ grep -v "#" Mutect2.filter.vcf | head -3
    chr1    8064650 .       AAAAC   A       .       artifact_in_normal;str_contraction      DP=4156;ECNT=1;NLOD=652.40;N_ART_LOD=12.45;POP_AF=8.590e-03;P_CONTAM=0.00;P_GERMLINE=-1.004e+03;RPA=4,3;RU=AAAC;STR;TLOD=15.08      GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/0:2301,9:0.068:2310:1186,3:1115,6:30,30:294,302:60:62:0   0/1:1249,9:7.890e-03:1258:669,5:580,4:30,30:190,316:60:23:0:0.010,0.010,7.154e-03:1.469e-03,6.913e-04,0.998
    chr1    162749855       .       TTC     T       .       artifact_in_normal;str_contraction      DP=3631;ECNT=1;NLOD=535.21;N_ART_LOD=34.95;POP_AF=1.000e-06;P_CONTAM=0.00;P_GERMLINE=-8.309e+02;RPA=8,7;RU=TC;STR;TLOD=33.52        GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/0:2120,35:0.060:2155:1027,19:1093,16:30,30:293,308:60:40:0        0/1:1151,29:0.022:1180:404,7:747,22:30,30:195,213:60:44:0:0.020,0.020,0.025:9.423e-03,7.577e-04,0.990
    chr1    201754410       .       CTTTTTT C,CT,CTT,CTTT,CTTTT,CTTTTT,CTTTTTTT     .       artifact_in_normal;germline_risk;multiallelic   DP=3984;ECNT=1;NLOD=472.90,322.80,52.53,-5.898e+02,-1.771e+03,-2.565e+03,391.62;N_ART_LOD=5.75,17.32,37.19,173.03,780.25,1811.50,28.21;POP_AF=7.803e-04,2.467e-03,0.010,0.058,0.545,1.000e-06,1.000e-06;P_CONTAM=0.00;P_GERMLINE=-4.905e+02,-3.158e+02,-2.000e+01,0.00,0.00,0.00,-3.794e+02;RPA=17,11,12,13,14,15,16,18;RU=T;STR;TLOD=8.13,49.36,85.34,219.24,582.57,881.43,28.23       GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB     0/0:287,9,18,34,134,500,898,50:0.017,0.023,0.032,0.077,0.237,0.406,0.042:1930:146,4,8,13,54,250,442,32:141,5,10,21,80,250,456,18:30,30,30,30,30,30,30,30:287,316,299,297,286,297,291,265:60,60,60,60,60,60,60:28,38,37,46,45,42,40:0     0/1/2/3/4/5/6/7:134,10,32,63,143,358,467,40:6.001e-03,0.024,0.045,0.107,0.287,0.387,0.027:1247:62,5,13,32,55,134,201,21:72,5,19,31,88,224,266,19:30,30,30,30,30,30,30,30:181,195,179,207,197,193,190,190:60,60,60,60,60,60,60:24,36,35,46,37,41,36:0:0.768,0.758,0.777:8.021e-03,0.061,0.931
    

    其表头有对FILTER和FORMAT列的解释,如:

    ##FILTER=<ID=artifact_in_normal,Description="artifact_in_normal">
    ##FILTER=<ID=germline_risk,Description="Evidence indicates this site is germline, not somatic">
    ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
    ##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
    ... ...
    
    2. WES的测试:
    time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L all_exon.bed -O Mutect2.vcf.gz --bam-output Mutect2.bam
    time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta  -I samplename_T.recal.bam -L all_exon.bed -V small_exac_common_3_hg19.vcf -O samplename_T.pileups.table
    time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta  -I  samplename_N.recal.bam -L all_exon.bed -V small_exac_common_3_hg19.vcf -O samplename_N.pileups.table
    time gatk --java-options "-Xmx5000m" CalculateContamination -I samplename_T.pileups.table -matched samplename_N.pileups.table -O contamination.table  --tumor-segmentation segments.table
    time gatk --java-options "-Xmx5000m" FilterMutectCalls -V Mutect2.vcf.gz -R  ucsc.hg19.fasta -O Mutect2.filter.vcf --contamination-table contamination.table
    

    所用时间:

    $ grep real wes_mutect.sh.e
    real    139m0.688s
    real    19m26.328s
    real    14m53.911s
    real    1m8.589s
    real    1m9.738s
    
    3. 小panel拆分Bed的测试:

    从上面时间上来看,这一步所用时间还是很长,这次拆分bed来测试一下,会比上面增加三步(拆分/mutect2 call vcf/合并vcf)

    1)拆分

    time gatk --java-options "-Xmx5000m" SplitIntervals -R ucsc.hg19.fasta -L  Covered.bed -scatter 30 -O interval-files
    

    real 1m7.952s

    2)mutect2 call vcf

    time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L interval-files/0000-scattered.intervals -O interval-files/0000-scattered.intervals.vcf --bam-output interval-files/0000-scattered.intervals.Mutect2.bam
    time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L interval-files/0001-scattered.intervals -O interval-files/0001-scattered.intervals.vcf --bam-output interval-files/0001-scattered.intervals.Mutect2.bam
    ... ...
    ... ...
    
    $ grep real *
    Mutect2_00002.sh.e86950:real    2m18.126s
    Mutect2_00003.sh.e86951:real    3m15.701s
    Mutect2_00004.sh.e86952:real    2m53.857s
    ... ...
    ... ...
    

    这一步是并行运行,按照最长时间计算是:3m15.701s

    3)合并vcf

    GATK3中采用:CombineVariants

    java -XX:ParallelGCThreads=20 -Xmx40g -Djava.io.tmpdir=swp -jar /share/nas2/genome/biosoft/GATK/GenomeAnalysisTK.jar -T CombineVariants -R bwa_Ref/ucsc.hg19.fasta -V Chr_Result/chr1.swp.vcf -V Chr_Result/chr10.swp.vcf -V Chr_Result/chr11.swp.vcf -V Chr_Result/chr12.swp.vcf -V Chr_Result/chr13.swp.vcf -V Chr_Result/chr14.swp.vcf -V Chr_Result/chr15.swp.vcf -V Chr_Result/chr16.swp.vcf -V Chr_Result/chr17.swp.vcf -V Chr_Result/chr18.swp.vcf -V Chr_Result/chr19.swp.vcf -V Chr_Result/chr2.swp.vcf -V Chr_Result/chr20.swp.vcf -V Chr_Result/chr21.swp.vcf -V Chr_Result/chr22.swp.vcf -V Chr_Result/chr3.swp.vcf -V Chr_Result/chr4.swp.vcf -V Chr_Result/chr5.swp.vcf -V Chr_Result/chr6.swp.vcf -V Chr_Result/chr7.swp.vcf -V Chr_Result/chr8.swp.vcf -V Chr_Result/chr9.swp.vcf -V Chr_Result/chrX.swp.vcf  --disable_auto_index_creation_and_locking_when_reading_rods -o samplename_T.swp.all --genotypemergeoption UNSORTED
    

    GATK4中CombineVariants已经没有,有三个合并vcf的功能:
    MergeVcfs (Picard) Combines multiple variant files into a single variant file
    GatherVcfs (Picard) Gathers multiple VCF files from a scatter operation into a single VCF file
    对于合并不同染色体的vcf,我们采用GatherVcfs,CatVariants, MergeVcfs or GatherVcfs
    GatherVcfsCloud (BETA Tool) Gathers multiple VCF files from a scatter operation into a single VCF file

    命令:

    time gatk  --java-options "-Xmx5000m" GatherVcfs -I interval-files/0000-scattered.intervals.vcf -I interval-files/0001-scattered.intervals.vcf -I interval-files/0002-scattered.intervals.vcf -I interval-files/0003-scattered.intervals.vcf -I interval-files/0004-scattered.intervals.vcf -I interval-files/0005-scattered.intervals.vcf -I interval-files/0006-scattered.intervals.vcf -I interval-files/0007-scattered.intervals.vcf -I interval-files/0008-scattered.intervals.vcf -I interval-files/0009-scattered.intervals.vcf -I interval-files/0010-scattered.intervals.vcf -I interval-files/0011-scattered.intervals.vcf -I interval-files/0012-scattered.intervals.vcf -I interval-files/0013-scattered.intervals.vcf -I interval-files/0014-scattered.intervals.vcf -I interval-files/0015-scattered.intervals.vcf -I interval-files/0016-scattered.intervals.vcf -I interval-files/0017-scattered.intervals.vcf -I interval-files/0018-scattered.intervals.vcf -I interval-files/0019-scattered.intervals.vcf -I interval-files/0020-scattered.intervals.vcf -I interval-files/0021-scattered.intervals.vcf -I interval-files/0022-scattered.intervals.vcf -I interval-files/0023-scattered.intervals.vcf -I interval-files/0024-scattered.intervals.vcf -I interval-files/0025-scattered.intervals.vcf -I interval-files/0026-scattered.intervals.vcf -I interval-files/0027-scattered.intervals.vcf -I interval-files/0028-scattered.intervals.vcf -I interval-files/0029-scattered.intervals.vcf -O gather.raw.vcf
    

    real 0m15.819s

    bed不拆分和拆分得到的vcf基本一致,少量位点不同。
    时间上,不经过bed分割的mutect2需要28m,经过bed分割的只需要4m 提速了6倍。
    -L参数 Intervals and interval lists的介绍

    之后的步骤和bed不拆分流程相同。

    这一步的学习基本完成。
    下一个step是对转录组SNP/INDEL(RNAseq SNPs + Indels)的学习

    相关文章

      网友评论

        本文标题:GATK Best Practices — step3 体细胞突

        本文链接:https://www.haomeiwen.com/subject/gkxvaqtx.html