Structure Variantion Pipeline(结构

作者: 期待未来 | 来源:发表于2021-08-11 23:50 被阅读0次

    Pipeline Overview

    (搬运github)-(https://github.com/broadinstitute/gatk-sv

    The pipeline consists of a series of modules that perform the following:

    • Module 00a: SV evidence collection, including calls from a configurable set of algorithms (Delly, Manta, MELT, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).
    • Module 00b: Dosage bias scoring and ploidy estimation
    • Module 00c: Copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation
    • Module 01: Variant clustering
    • Module 02: Variant filtering metric generation
    • Module 03: Variant filtering; outlier exclusion
    • Module 04: Genotyping
    • Module 05/06: Cross-batch integration; complex variant resolution and re-genotyping; vcf cleanup
    • Module 07: Downstream filtering, including minGQ, batch effect check, outlier samples removal and final recalibration;
    • Module 08: Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets;
    • Module 09: Visualization, including scripts that generates IGV screenshots and rd plots.
    • Additional modules to be added: de novo and mosaic scripts

    gCNV Training

    Both the cohort and single-sample modes use the GATK gCNV depth calling pipeline, which requires a trained model as input. The samples used for training should be technically homogeneous and similar to the samples to be processed (i.e. same sample type, library prep protocol, sequencer, sequencing center, etc.). The samples to be processed may comprise all or a subset of the training set. For small cohorts, a single gCNV model is usually sufficient. If a cohort contains multiple data sources, we recommend clustering them using the dosage score, and training a separate model for each cluster.

    Module Descriptions

    The following sections briefly describe each module and highlights inter-dependent input/output files. Note that input/output mappings can also be gleaned from GATKSVPipelineBatch.wdl, and example input files for each module can be found in /test.

    Module 00a

    Runs raw evidence collection on each sample.

    Note: a list of sample IDs must be provided. Refer to the sample ID requirements for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.

    Inputs:

    • Per-sample BAM or CRAM files aligned to hg38. Index files (.bai) must be provided if using BAMs.

    Outputs:

    • Caller VCFs (Delly, Manta, MELT, and/or Wham)
    • Binned read counts file
    • Split reads (SR) file
    • Discordant read pairs (PE) file
    • B-allele fraction (BAF) file

    Module 00b

    Runs ploidy estimation, dosage scoring, and optionally VCF QC. The results from this module can be used for QC and batching.

    For large cohorts, we recommend dividing samples into smaller batches (~500 samples) with ~1:1 male:female ratio. Refer to the Batching section for further guidance on creating batches.

    We also recommend using sex assignments generated from the ploidy estimates and incorporating them into the PED file.

    Prerequisites:

    Inputs:

    Outputs:

    • Per-sample dosage scores with plots
    • Ploidy estimates, sex assignments, with plots
    • (Optional) Outlier samples detected by call counts

    Preliminary Sample QC

    The purpose of sample filtering at this stage after Module00b is to prevent very poor quality samples from interfering with the results for the rest of the callset. In general, samples that are borderline are okay to leave in, but you should choose filtering thresholds to suit the needs of your cohort and study. There will be future opportunities (as part of Module03) for filtering before the joint genotyping stage if necessary. Here are a few of the basic QC checks that we recommend:

    • Look at the X and Y ploidy plots, and check that sex assignments match your expectations. If there are discrepancies, check for sample swaps and update your PED file before proceeding.
    • Look at the dosage score (WGD) distribution and check that it is centered around 0 (the distribution of WGD for PCR- samples is expected to be slightly lower than 0, and the distribution of WGD for PCR+ samples is expected to be slightly greater than 0. Refer to the gnomAD-SV paper for more information on WGD score). Optionally filter outliers.
    • Look at the low outliers for each SV caller (samples with much lower than typical numbers of SV calls per contig for each caller). An empty low outlier file means there were no outliers below the median and no filtering is necessary. Check that no samples had zero calls.
    • Look at the high outliers for each SV caller and optionally filter outliers; samples with many more SV calls than average may be poor quality.

    gCNV Training

    Trains a gCNV model for use in Module 00c. The WDL can be found at /gcnv/trainGCNV.wdl.

    Prerequisites:

    Inputs:

    Outputs:

    • Contig ploidy model tarball
    • gCNV model tarballs

    Module 00c

    Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample raw evidence into a batch. See above for more information on batching.

    Prerequisites:

    Inputs:

    • PED file (updated with Module 00b sex assignments, including sex = 0 for sex aneuploidies. Calls will not be made on sex chromosomes when sex = 0 in order to avoid generating many confusing calls or upsetting normalized copy numbers for the batch.)
    • Per-sample GVCFs generated with HaplotypeCaller (gvcfs input), or a jointly-genotyped VCF (position-sharded, snp_vcfs input or snp_vcfs_shard_list input). The jointly-genotyped VCF may contain multi-allelic sites and indels, but only biallelic SNVs will be used by the pipeline. We recommend shards of 10 GB or less to lower compute time and resources.
    • Read count, BAF, PE, and SR files (Module 00a)
    • Caller VCFs (Module 00a)
    • Contig ploidy model and gCNV model files (gCNV training)

    Outputs:

    • Combined read count matrix, SR, PE, and BAF files
    • Standardized call VCFs
    • Depth-only (DEL/DUP) calls
    • Per-sample median coverage estimates
    • (Optional) Evidence QC plots

    Module 01

    Clusters SV calls across a batch.

    Prerequisites:

    Inputs:

    Outputs:

    • Clustered SV VCFs
    • Clustered depth-only call VCF

    Module 02

    Generates variant metrics for filtering.

    Prerequisites:

    Inputs:

    Outputs:

    • Metrics file

    Module 03

    Filters poor quality variants and filters outlier samples.

    Prerequisites:

    Inputs:

    • Batch PED file
    • Metrics file (Module 02)
    • Clustered SV and depth-only call VCFs (Module 01)

    Outputs:

    • Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded
    • Filtered depth-only call VCF with outlier samples excluded
    • Random forest cutoffs file
    • PED file with outlier samples excluded

    Merge Cohort VCFs

    Combines filtered variants across batches. The WDL can be found at: /wdl/MergeCohortVcfs.wdl.

    Prerequisites:

    Inputs:

    Outputs:

    • Combined cohort PESR and depth VCFs
    • Cohort and clustered depth variant BED files

    Module 04

    Genotypes a batch of samples across unfiltered variants combined across all batches.

    Prerequisites:

    Inputs:

    • Batch PESR and depth VCFs (Module 03)
    • Cohort PESR and depth VCFs (Merge Cohort VCFs)
    • Batch read count, PE, and SR files (Module 00c)

    Outputs:

    • Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded
    • Filtered depth-only call VCF with outlier samples excluded
    • PED file with outlier samples excluded
    • List of SR pass variants
    • List of SR fail variants
    • (Optional) Depth re-genotyping intervals list

    Module 04b

    Re-genotypes probable mosaic variants across multiple batches.

    Prerequisites:

    Inputs:

    • Per-sample median coverage estimates (Module 00c)
    • Pre-genotyping depth VCFs (Module 03)
    • Batch PED files (Module 03)
    • Clustered depth variant BED file (Merge Cohort VCFs)
    • Cohort depth VCF (Merge Cohort VCFs)
    • Genotyped depth VCFs (Module 04)
    • Genotyped depth RD cutoffs file (Module 04)

    Outputs:

    • Re-genotyped depth VCFs

    Module 05/06

    Combines variants across multiple batches, resolves complex variants, re-genotypes, and performs final VCF clean-up.

    Prerequisites:

    Inputs:

    Outputs:

    • Finalized "cleaned" VCF and QC plots

    Module 07 (in development)

    Apply downstream filtering steps to the cleaned vcf to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects.

    Filterings methods include:

    • minGQ - remove variants based on the genotype quality across populations. Note: Trio families are required to build the minGQ filtering model in this step. We provide tables pre-trained with the 1000 genomes samples at different FDR thresholds for projects that lack family structures, and they can be found here:
    gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.10perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
    gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.1perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
    gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.5perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
    
    
    • BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches
    • FilterOutlierSamples - remove outlier samples with unusually high or low number of SVs
    • FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation

    Module 08 (in development)

    Add annotations, such as the inferred function and allele frequencies of variants, to final vcf.

    Annotations methods include:

    • Functional annotation - annotate SVs with inferred function on protein coding regions, regulatory regions such as UTR and Promoters and other non coding elements;
    • Allele Frequency annotation - annotate SVs with their allele frequencies across all samples, and samples of specific sex, as well as specific sub-populations.
    • Allele Frequency annotation with external callset - annotate SVs with the allele frequencies of their overlapping SVs in another callset, eg. gnomad SV callset.

    Module 09 (in development)

    Visualize SVs with IGV screenshots and read depth plots.

    Visualization methods include:

    • RD Visualization - generate RD plots across all samples, ideal for visualizing large CNVs.
    • IGV Visualization - generate IGV plots of each SV for individual sample, ideal for visualizing de novo small SVs.
    • Module09.visualize.wdl - generate RD plots and IGV plots, and combine them for easy review.

    参考文献:A structural variation reference for medical and population genetics

    相关文章

      网友评论

        本文标题:Structure Variantion Pipeline(结构

        本文链接:https://www.haomeiwen.com/subject/uwaqbltx.html