一、体细胞突变CNV(Somatic CNVs)的介绍
GATK官网对这一步的介绍:Somatic copy number variant discovery (CNVs)
官网给出的Somatic CNVs和Germline CNVs的区别,指出两个流程不能通用:
Copy number alterations/aberrations (CNAs) are changes in copy number that have arisen in somatic tissue (for example, just in a tumor), copy number variations (CNVs) originated from changes in copy number in germline cells (and are thus in all cells of the organism). The algorithms to detect the two are different.
官网给出了两套流程(下载链接相同):
Somatic CNV case sample Case BAM to CNV universal yes b37
Somatic CNV PON creation Normal BAMs to PON universal yes b37
git clone https://github.com/gatk-workflows/gatk4-somatic-cnvs
得到:
-rw-rw-r-- 1 test test 6019 5月 23 22:48 README.md
-rw-rw-r-- 1 test test 1514 5月 23 22:48 LICENSE
-rw-rw-r-- 1 test test 6915 5月 23 22:48 cnv_somatic_pair_workflow.b37.inputs.json
-rw-rw-r-- 1 test test 2947 5月 23 22:48 cnv_somatic_oncotator_workflow.wdl
-rw-rw-r-- 1 test test 17188 5月 23 22:48 cnv_common_tasks.wdl ##cnv前期步骤
-rw-rw-r-- 1 test test 9949 5月 23 22:48 cnv_somatic_panel_workflow.wdl ##cnv_pon合并步骤
-rw-rw-r-- 1 test test 3020 5月 23 22:48 cnv_somatic_panel_workflow.b37.inputs.json
-rw-rw-r-- 1 test test 41298 5月 23 22:48 cnv_somatic_pair_workflow.wdl ##cnv_pair步骤
二、WDL中各个task的介绍
这一部分主要学习的是cnv_common_tasks.wdl、cnv_somatic_panel_workflow.wdl、cnv_somatic_pair_workflow.wdl这3个workflow。
对各个task的介绍:Tool Documentation Index
cnv_common_tasks.wdl 介绍了CNV的一些前期必须步骤,包含了7个task:
1. PreprocessIntervals
官网介绍:PreprocessIntervals
功能:对bed文件的处理 Prepares bins for coverage collection.
The input intervals are first checked for overlapping intervals, which are merged. The resulting intervals are then padded. The padded intervals are then split into bins. Finally, bins that contain only Ns are filtered out.
命令:
gatk --java-options "-Xmx${command_mem_mb}m" PreprocessIntervals \
${"-L " + intervals} \
${"-XL " + blacklist_intervals} \
--sequence-dictionary ${ref_fasta_dict} \
--reference ${ref_fasta} \
--padding ${default="250" padding} \
--bin-length ${default="1000" bin_length} \
--interval-merging-rule OVERLAPPING_ONLY \
--output ${base_filename}.preprocessed.interval_list
示例:
time gatk --java-options "-Xmx5000m" PreprocessIntervals -L Covered.bed --reference ucsc.hg19.fasta --output samplename.preprocessed.interval_list --padding 250 --bin-length 0 --interval-merging-rule OVERLAPPING_ONLY
得到:
$ tail -5 samplename.preprocessed.interval_list
chrX 6694137066942109 + .
chrX 6694237866943117 + .
chrX 6694323666943975 + .
chrX 100610778 100611517 + .
chrX 100629847 100630586 + .
2. AnnotateIntervals
功能:Annotates intervals with GC content, and optionally, mappability and segmental-duplication content. The output may optionally be used as input to CreateReadCountPanelOfNormals, DenoiseReadCounts, and GermlineCNVCaller.
官网介绍:AnnotateIntervals
命令:
gatk --java-options "-Xmx${command_mem_mb}m" AnnotateIntervals \
-L ${intervals} \
--reference ${ref_fasta} \
${"--mappability-track " + mappability_track_bed} \
${"--segmental-duplication-track " + segmental_duplication_track_bed} \
--feature-query-lookahead ${default=1000000 feature_query_lookahead} \
--interval-merging-rule OVERLAPPING_ONLY \
--output ${base_filename}.annotated.tsv
示例:
time gatk --java-options "-Xmx5000m" AnnotateIntervals -L Covered.bed --reference ucsc.hg19.fasta --output samplename.annotated.tsv --interval-merging-rule OVERLAPPING_ONLY
得到:
$ grep -v "@" samplename.annotated.tsv | head
CONTIG START END GC_CONTENT
chr1 8064510 8064689 0.461111
chr1 8073233 8074492 0.471429
chr1 8075275 8075736 0.426407
chr1 11169297111694760.394444
chr1 11174332111745710.541667
chr1 11181990111822290.554167
chr1 11184503111847420.483333
chr1 11187014111872530.479167
chr1 11187652111878910.512500
3. FilterIntervals
功能:对指定区域的过滤
Given specified intervals, annotated intervals output by AnnotateIntervals, and/or counts output by CollectReadCounts, outputs a filtered Picard interval list. The set intersection of intervals from the specified intervals, the annotated intervals, and the first count file will be taken as the initial set of intervals on which to perform filtering. Parameters for filtering based on the annotations and counts can be adjusted. Annotation-based filters will be applied first, followed by count-based filters. The result may be passed via -L to other tools (e.g., DetermineGermlineContigPloidy and GermlineCNVCaller) to mask intervals from analysis.
官网介绍:FilterIntervals
命令:
gatk --java-options "-Xmx${command_mem_mb}m" FilterIntervals \
-L ${intervals} \
${"-XL " + blacklist_intervals} \
${"--annotated-intervals " + annotated_intervals} \
${if defined(read_count_files) then "--input " else ""} ${sep=" --input " read_count_files} \
--minimum-gc-content ${default="0.1" minimum_gc_content} \
--maximum-gc-content ${default="0.9" maximum_gc_content} \
--minimum-mappability ${default="0.9" minimum_mappability} \
--maximum-mappability ${default="1.0" maximum_mappability} \
--minimum-segmental-duplication-content ${default="0.0" minimum_segmental_duplication_content} \
--maximum-segmental-duplication-content ${default="0.5" maximum_segmental_duplication_content} \
--low-count-filter-count-threshold ${default="5" low_count_filter_count_threshold} \
--low-count-filter-percentage-of-samples ${default="90.0" low_count_filter_percentage_of_samples} \
--extreme-count-filter-minimum-percentile ${default="1.0" extreme_count_filter_minimum_percentile} \
--extreme-count-filter-maximum-percentile ${default="99.0" extreme_count_filter_maximum_percentile} \
--extreme-count-filter-percentage-of-samples ${default="90.0" extreme_count_filter_percentage_of_samples} \
--interval-merging-rule OVERLAPPING_ONLY \
--output ${base_filename}.filtered.interval_list
示例(报错):
time gatk FilterIntervals -L samplename.preprocessed.interval_list --annotated-intervals samplename.annotated.tsv -O filtered_intervals.interval_list
不过我在做这一步的过程中,总是遇到报错,这一步对后面影响不大,先把这一步忽略吧,同时也不用做上面那一步AnnotateIntervals。
4. CollectCounts
官网介绍:CollectCounts
功能:Collects read counts at specified intervals. The count for each interval is calculated by counting the number of read starts that lie in the interval.
命令:
gatk --java-options "-Xmx${command_mem_mb}m" CollectReadCounts \
-L ${intervals} \
--input ${bam} \
--reference ${ref_fasta} \
--format ${default="HDF5" format} \
--interval-merging-rule OVERLAPPING_ONLY \
--output ${counts_filename}
示例:
time gatk --java-options "-Xmx5000m" CollectReadCounts -L samplename.preprocessed.interval_list --reference ucsc.hg19.fasta --input samplename_T.recal.bam --format HDF5 --interval-merging-rule OVERLAPPING_ONLY --output samplename_T.count.HDF5
time gatk --java-options "-Xmx5000m" CollectReadCounts -L samplename.preprocessed.interval_list --reference ucsc.hg19.fasta --input samplename_N.recal.bam --format HDF5 --interval-merging-rule OVERLAPPING_ONLY --output samplename_N.count.HDF5
得到两个二进制文件
5. CollectAllelicCounts
功能:
Collects reference and alternate allele counts at specified sites. The alt count is defined as the total count minus the ref count, and the alt nucleotide is defined as the non-ref base with the highest count, with ties broken by the order of the bases in AllelicCountCollector#BASES. Only reads that pass the specified read filters and bases that exceed the specified minimum-base-quality will be counted.
官网介绍:CollectAllelicCounts
这一步可选,后面ModelSegments步骤中可以选择是否将这一步的输出作为输入。这一步中各个样品需要分别进行这一步。
命令:
gatk --java-options "-Xmx${command_mem_mb}m" CollectAllelicCounts \
-L ${common_sites} \
--input ${bam} \
--reference ${ref_fasta} \
--minimum-base-quality ${default="20" minimum_base_quality} \
--output ${allelic_counts_filename}
示例:
time gatk --java-options "-Xmx5000m" CollectAllelicCounts -L samplename.preprocessed.interval_list --input samplename_T.recal.bam --reference ucsc.hg19.fasta --minimum-base-quality 20 --output samplename_T.allelic_counts
time gatk --java-options "-Xmx5000m" CollectAllelicCounts -L samplename.preprocessed.interval_list --input samplename_N.recal.bam --reference ucsc.hg19.fasta --minimum-base-quality 20 --output samplename_N.allelic_counts
得到:
$ grep -v "@" samplename_T.allelic_counts | head
CONTIG POSITIONREF_COUNT ALT_COUNT REF_NUCLEOTIDE ALT_NUCLEOTIDE
chr1 8064260 0 0 G N
chr1 8064261 0 0 C N
chr1 8064262 0 0 A N
chr1 8064263 0 0 T N
chr1 8064264 0 0 G N
chr1 8064265 0 0 G N
chr1 8064266 0 0 G N
chr1 8064267 0 0 C N
chr1 8064268 0 0 C N
6. ScatterIntervals
官网介绍:ScatterIntervals
功能:
This tool offers multiple interval list file manipulation capabilities, including: sorting, merging, subtracting, padding, and other set-theoretic operations. The default action is to merge and sort the intervals provided in the INPUTs. Other options, e.g. interval subtraction, are controlled by the arguments.
Both IntervalList and VCF files are accepted as input. IntervalList should be denoted with the extension .interval_list, while a VCF must have one of .vcf, .vcf.gz, .bcf When VCF file is used as input, each variant is translated into an using its reference allele or the END INFO annotation (if present) to determine the extent of the interval. IntervalListTools can also "scatter" the resulting interval-list into many interval-files. This can be useful for creating multiple interval lists for scattering an analysis over.
用于Bed区域的合并 排序等操作 没有用到 忽略这一步
7. PostprocessGermlineCNVCalls
官网介绍:PostprocessGermlineCNVCalls
用于GermlineCNV后续处理,本章介绍的是体细胞突变,忽略这一步。
cnv_somatic_panel_workflow.wdl 包含了一个task,介绍了多个N样品合成一个pon:
1. CreateReadCountPanelOfNormals
命令:
gatk --java-options "-Xmx${command_mem_mb}m" CreateReadCountPanelOfNormals \
--input ${sep=" --input " read_count_files} \
--minimum-interval-median-percentile ${default="10.0" minimum_interval_median_percentile} \
--maximum-zeros-in-sample-percentage ${default="5.0" maximum_zeros_in_sample_percentage} \
--maximum-zeros-in-interval-percentage ${default="5.0" maximum_zeros_in_interval_percentage} \
--extreme-sample-median-percentile ${default="2.5" extreme_sample_median_percentile} \
--do-impute-zeros ${default="true" do_impute_zeros} \
--extreme-outlier-truncation-percentile ${default="0.1" extreme_outlier_truncation_percentile} \
--number-of-eigensamples ${default="20" number_of_eigensamples} \
--maximum-chunk-size ${default="16777216" maximum_chunk_size} \
${"--annotated-intervals " + annotated_intervals} \
--output ${pon_entity_id}.pon.hdf5
示例:
time gatk --java-options "-Xmx5000m" CreateReadCountPanelOfNormals --input samplename_N.count.HDF5 --minimum-interval-median-percentile 5.0 --output samplename_N.cnvpon.hdf5
cnv_somatic_pair_workflow.wdl 包含了5个task,介绍了成对样品call CNV的步骤:
1. DenoiseReadCounts
官网介绍:DenoiseReadCounts
命令:
gatk --java-options "-Xmx${command_mem_mb}m" DenoiseReadCounts \
--input ${read_counts} \
--count-panel-of-normals ${read_count_pon} \
${"--number-of-eigensamples " + number_of_eigensamples} \
--standardized-copy-ratios ${entity_id}.standardizedCR.tsv \
--denoised-copy-ratios ${entity_id}.denoisedCR.tsv
示例:
time gatk --java-options "-Xmx5000m" DenoiseReadCounts --input samplename_T.count.HDF5 --count-panel-of-normals samplename_N.cnvpon.hdf5 --standardized-copy-ratios samplename.standardizedCR.tsv --denoised-copy-ratios samplename.denoisedCR.tsv
得到:
$ grep -v "@" samplename.standardizedCR.tsv | head
CONTIG START END LOG2_COPY_RATIO
chr1 8064260 8064939 0.175184
chr1 8072983 8074742 -0.059839
chr1 8075025 8075986 -0.182930
chr1 11169047111697260.007813
chr1 11174082111748210.152895
chr1 11181740111824790.182681
chr1 1118425311184992-0.049350
chr1 11186764111874520.181151
chr1 11187453111879460.067477
2. ModelSegments
官网介绍:ModelSegments
命令:
gatk --java-options "-Xmx${command_mem_mb}m" ModelSegments \
--denoised-copy-ratios ${denoised_copy_ratios} \
--allelic-counts ${allelic_counts} \
${"--normal-allelic-counts " + normal_allelic_counts} \
--minimum-total-allele-count-case ${min_total_allele_count_} \
--minimum-total-allele-count-normal ${default="30" min_total_allele_count_normal} \
--genotyping-homozygous-log-ratio-threshold ${default="-10.0" genotyping_homozygous_log_ratio_threshold} \
--genotyping-base-error-rate ${default="0.05" genotyping_base_error_rate} \
--maximum-number-of-segments-per-chromosome ${default="1000" max_num_segments_per_chromosome} \
--kernel-variance-copy-ratio ${default="0.0" kernel_variance_copy_ratio} \
--kernel-variance-allele-fraction ${default="0.025" kernel_variance_allele_fraction} \
--kernel-scaling-allele-fraction ${default="1.0" kernel_scaling_allele_fraction} \
--kernel-approximation-dimension ${default="100" kernel_approximation_dimension} \
--window-size ${sep=" --window-size " window_sizes} \
--number-of-changepoints-penalty-factor ${default="1.0" num_changepoints_penalty_factor} \
--minor-allele-fraction-prior-alpha ${default="25.0" minor_allele_fraction_prior_alpha} \
--number-of-samples-copy-ratio ${default="100" num_samples_copy_ratio} \
--number-of-burn-in-samples-copy-ratio ${default="50" num_burn_in_copy_ratio} \
--number-of-samples-allele-fraction ${default="100" num_samples_allele_fraction} \
--number-of-burn-in-samples-allele-fraction ${default="50" num_burn_in_allele_fraction} \
--smoothing-credible-interval-threshold-copy-ratio ${default="2.0" smoothing_threshold_copy_ratio} \
--smoothing-credible-interval-threshold-allele-fraction ${default="2.0" smoothing_threshold_allele_fraction} \
--maximum-number-of-smoothing-iterations ${default="10" max_num_smoothing_iterations} \
--number-of-smoothing-iterations-per-fit ${default="0" num_smoothing_iterations_per_fit} \
--output ${output_dir_} \
--output-prefix ${entity_id}
示例:
time gatk --java-options "-Xmx5000m" ModelSegments --denoised-copy-ratios samplename.denoisedCR.tsv --allelic-counts samplename_T.allelic_counts --normal-allelic-counts samplename_N.allelic_counts --output segments --output-prefix samplename ###segments 需要存在,否则报错 A USER ERROR has occurred: Output directory segments does not exist.
得到:
$ ll segments
总用量 68
-rw-rw-r-- 1 test test 10935 5月 28 10:17 samplename.hets.normal.tsv
-rw-rw-r-- 1 test test 10374 5月 28 10:17 samplename.hets.tsv
-rw-rw-r-- 1 test test 5131 5月 28 10:17 samplename.modelBegin.seg
-rw-rw-r-- 1 test test 370 5月 28 10:17 samplename.modelBegin.cr.param
-rw-rw-r-- 1 test test 466 5月 28 10:17 samplename.modelBegin.af.param
-rw-rw-r-- 1 test test 5131 5月 28 10:17 samplename.modelFinal.seg
-rw-rw-r-- 1 test test 370 5月 28 10:17 samplename.modelFinal.cr.param
-rw-rw-r-- 1 test test 466 5月 28 10:17 samplename.modelFinal.af.param
-rw-rw-r-- 1 test test 3838 5月 28 10:17 samplename.cr.seg
-rw-rw-r-- 1 test test 1124 5月 28 10:17 samplename.cr.igv.seg
-rw-rw-r-- 1 test test 1107 5月 28 10:17 samplename.af.igv.seg
3. CallCopyRatioSegments
官网介绍:CallCopyRatioSegments
命令:
gatk --java-options "-Xmx${command_mem_mb}m" CallCopyRatioSegments \
--input ${copy_ratio_segments} \
--neutral-segment-copy-ratio-lower-bound ${default="0.9" neutral_segment_copy_ratio_lower_bound} \
--neutral-segment-copy-ratio-upper-bound ${default="1.1" neutral_segment_copy_ratio_upper_bound} \
--outlier-neutral-segment-copy-ratio-z-score-threshold ${default="2.0" outlier_neutral_segment_copy_ratio_z_score_threshold} \
--calling-copy-ratio-z-score-threshold ${default="2.0" calling_copy_ratio_z_score_threshold} \
--output ${entity_id}.called.seg
示例:
time gatk --java-options "-Xmx5000m" CallCopyRatioSegments -I segments/samplename.cr.seg -O segments/samplename.called.seg
得到samplename.called.seg最终的CNV结果:
$ grep -v "@" segments/samplename.called.seg | head
CONTIG START END NUM_POINTS_COPY_RATIO MEAN_LOG2_COPY_RATIO CALL
chr1 8064260 237048839 34 0.1262210
chr2 29415829234669483 73 -0.065964 0
chr3 10187889178952417 45 -0.044041 0
chr4 1803287 153333226 37 -0.001163 0
chr5 7870633 172422109 37 0.0655670
chr6 43738732152415981 30 -0.044388 0
chr7 1787190 148509104 92 0.0473070
chr8 38270864128753472 25 0.186192+
chr9 5073402 135804576 51 0.0870290
4. PlotDenoisedCopyRatios
功能:画图
需要先建立output目录,sequence-dictionary是必须参数(samtools dict建立即可),并且需要装有optparse包
gatk --java-options "-Xmx${command_mem_mb}m" PlotDenoisedCopyRatios \
--standardized-copy-ratios ${standardized_copy_ratios} \
--denoised-copy-ratios ${denoised_copy_ratios} \
--sequence-dictionary ${ref_fasta_dict} \
--minimum-contig-length ${default="1000000" minimum_contig_length} \
--output ${output_dir_} \
--output-prefix ${entity_id}
5. PlotModeledSegments
画图
gatk --java-options "-Xmx${command_mem_mb}m" PlotModeledSegments \
--denoised-copy-ratios ${denoised_copy_ratios} \
--allelic-counts ${het_allelic_counts} \
--segments ${modeled_segments} \
--sequence-dictionary ${ref_fasta_dict} \
--minimum-contig-length ${default="1000000" minimum_contig_length} \
--output ${output_dir_} \
--output-prefix ${entity_id}
三、执行命令
对上面task学习完成后,整理得到最后的体细胞call CNV的命令。
用到了cnv_common_tasks.wdl task 1、4(N/T分别分析)、5(N/T分别分析)
cnv_somatic_panel_workflow.wdl task 1
cnv_somatic_pair_workflow.wdl task 1、2、3
参考:https://mp.weixin.qq.com/s?__biz=MzAxMDkxODM1Ng==&mid=2247486362&idx=1&sn=0e8ca112641b4d81608b598095255a6f&chksm=9b484b21ac3fc237a1fecf40858a161f33f545f954049bf5142adbc5191615f62006a6c42509&scene=21#wechat_redirect
官网给出的pipeline:https://gatkforums.broadinstitute.org/gatk/discussion/9143/how-to-call-somatic-copy-number-variants-
执行命令:
time gatk --java-options "-Xmx5000m" PreprocessIntervals -L Covered.bed --reference ucsc.hg19.fasta --output samplename.preprocessed.interval_list --padding 250 --bin-length 0 --interval-merging-rule OVERLAPPING_ONLY
time gatk --java-options "-Xmx5000m" CollectReadCounts -L samplename.preprocessed.interval_list --reference ucsc.hg19.fasta --input samplename_T.recal.bam --format HDF5 --interval-merging-rule OVERLAPPING_ONLY --output samplename_T.count.HDF5
time gatk --java-options "-Xmx5000m" CollectReadCounts -L samplename.preprocessed.interval_list --reference ucsc.hg19.fasta --input samplename_N.recal.bam --format HDF5 --interval-merging-rule OVERLAPPING_ONLY --output samplename_N.count.HDF5
time gatk --java-options "-Xmx5000m" CollectAllelicCounts -L samplename.preprocessed.interval_list --input samplename_T.recal.bam --reference ucsc.hg19.fasta --minimum-base-quality 20 --output samplename_T.allelic_counts
time gatk --java-options "-Xmx5000m" CollectAllelicCounts -L samplename.preprocessed.interval_list --input samplename_N.recal.bam --reference ucsc.hg19.fasta --minimum-base-quality 20 --output samplename_N.allelic_counts
time gatk --java-options "-Xmx5000m" CreateReadCountPanelOfNormals --input samplename_N.count.HDF5 --minimum-interval-median-percentile 5.0 --output samplename_N.cnvpon.hdf5
time gatk --java-options "-Xmx5000m" DenoiseReadCounts --input samplename_T.count.HDF5 --count-panel-of-normals samplename_N.cnvpon.hdf5 --standardized-copy-ratios samplename.standardizedCR.tsv --denoised-copy-ratios samplename.denoisedCR.tsv
mkdir segments
time gatk --java-options "-Xmx5000m" ModelSegments --denoised-copy-ratios samplename.denoisedCR.tsv --allelic-counts samplename_T.allelic_counts --normal-allelic-counts samplename_N.allelic_counts --output segments --output-prefix samplename
time gatk --java-options "-Xmx5000m" CallCopyRatioSegments -I segments/samplename.cr.seg -O segments/samplename.called.seg
得到:
-rw-rw-r-- 1 test test 43606 5月 28 13:34 samplename.preprocessed.interval_list
-rw-rw-r-- 1 test test 45936 5月 28 13:34 samplename_T.count.HDF5
-rw-rw-r-- 1 test test 45936 5月 28 13:35 samplename_N.count.HDF5
-rw-rw-r-- 1 test test 17358810 5月 28 13:37 samplename_T.allelic_counts
-rw-rw-r-- 1 test test 17450878 5月 28 13:41 samplename_N.allelic_counts
-rw-rw-r-- 1 test test 85976 5月 28 13:41 samplename_N.cnvpon.hdf5
-rw-rw-r-- 1 test test 33992 5月 28 13:41 samplename.standardizedCR.tsv
-rw-rw-r-- 1 test test 33992 5月 28 13:41 samplename.denoisedCR.tsv
drwxrwxr-x 2 test test 4096 5月 28 13:41 segments
时间:
real 0m10.336s
real 0m43.469s
real 0m33.091s
real 2m21.804s
real 3m12.449s
real 0m16.881s
real 0m10.828s
real 0m12.202s
real 0m13.767s
这一专辑 GATK Best Practices 从step0-step5学习完成!
网友评论