Structure Variantion Pipeline(结构

作者: 期待未来 | 来源:发表于2021-08-11 23:50 被阅读0次

Structure Variantion Pipeline(结构
MATLAB的Structure数组
ESRI的Widget开发
netty源码分析(五) - pipeline
群体结构structure
c结构体
Jenkins2 学习系列5 -- pipeline中的指令
Apache Beam Pipeline设计.docx
数据结构与算法--数据结构入门
Material design - Layout- Struct

Pipeline Overview

（搬运github)-(https://github.com/broadinstitute/gatk-sv）

The pipeline consists of a series of modules that perform the following:

Module 00a: SV evidence collection, including calls from a configurable set of algorithms (Delly, Manta, MELT, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).
Module 00b: Dosage bias scoring and ploidy estimation
Module 00c: Copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation
Module 01: Variant clustering
Module 02: Variant filtering metric generation
Module 03: Variant filtering; outlier exclusion
Module 04: Genotyping
Module 05/06: Cross-batch integration; complex variant resolution and re-genotyping; vcf cleanup
Module 07: Downstream filtering, including minGQ, batch effect check, outlier samples removal and final recalibration;
Module 08: Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets;
Module 09: Visualization, including scripts that generates IGV screenshots and rd plots.
Additional modules to be added: de novo and mosaic scripts

gCNV Training

Both the cohort and single-sample modes use the GATK gCNV depth calling pipeline, which requires a trained model as input. The samples used for training should be technically homogeneous and similar to the samples to be processed (i.e. same sample type, library prep protocol, sequencer, sequencing center, etc.). The samples to be processed may comprise all or a subset of the training set. For small cohorts, a single gCNV model is usually sufficient. If a cohort contains multiple data sources, we recommend clustering them using the dosage score, and training a separate model for each cluster.

Module Descriptions

The following sections briefly describe each module and highlights inter-dependent input/output files. Note that input/output mappings can also be gleaned from GATKSVPipelineBatch.wdl, and example input files for each module can be found in /test.

Module 00a

Runs raw evidence collection on each sample.

Note: a list of sample IDs must be provided. Refer to the sample ID requirements for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.

Inputs:

Per-sample BAM or CRAM files aligned to hg38. Index files (.bai) must be provided if using BAMs.

Outputs:

Caller VCFs (Delly, Manta, MELT, and/or Wham)
Binned read counts file
Split reads (SR) file
Discordant read pairs (PE) file
B-allele fraction (BAF) file

Module 00b

Runs ploidy estimation, dosage scoring, and optionally VCF QC. The results from this module can be used for QC and batching.

For large cohorts, we recommend dividing samples into smaller batches (~500 samples) with ~1:1 male:female ratio. Refer to the Batching section for further guidance on creating batches.

We also recommend using sex assignments generated from the ploidy estimates and incorporating them into the PED file.

Prerequisites:

Module 00a

Inputs:

Read count files (Module 00a)
(Optional) SV call VCFs (Module 00a)

Outputs:

Per-sample dosage scores with plots
Ploidy estimates, sex assignments, with plots
(Optional) Outlier samples detected by call counts

Preliminary Sample QC

The purpose of sample filtering at this stage after Module00b is to prevent very poor quality samples from interfering with the results for the rest of the callset. In general, samples that are borderline are okay to leave in, but you should choose filtering thresholds to suit the needs of your cohort and study. There will be future opportunities (as part of Module03) for filtering before the joint genotyping stage if necessary. Here are a few of the basic QC checks that we recommend:

Look at the X and Y ploidy plots, and check that sex assignments match your expectations. If there are discrepancies, check for sample swaps and update your PED file before proceeding.
Look at the dosage score (WGD) distribution and check that it is centered around 0 (the distribution of WGD for PCR- samples is expected to be slightly lower than 0, and the distribution of WGD for PCR+ samples is expected to be slightly greater than 0. Refer to the gnomAD-SV paper for more information on WGD score). Optionally filter outliers.
Look at the low outliers for each SV caller (samples with much lower than typical numbers of SV calls per contig for each caller). An empty low outlier file means there were no outliers below the median and no filtering is necessary. Check that no samples had zero calls.
Look at the high outliers for each SV caller and optionally filter outliers; samples with many more SV calls than average may be poor quality.

gCNV Training

Trains a gCNV model for use in Module 00c. The WDL can be found at /gcnv/trainGCNV.wdl.

Prerequisites:

Module 00a
(Recommended) Module 00b

Inputs:

Read count files (Module 00a)

Outputs:

Contig ploidy model tarball
gCNV model tarballs

Module 00c

Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample raw evidence into a batch. See above for more information on batching.

Prerequisites:

Module 00a
(Recommended) Module 00b
gCNV training

Inputs:

PED file (updated with Module 00b sex assignments, including sex = 0 for sex aneuploidies. Calls will not be made on sex chromosomes when sex = 0 in order to avoid generating many confusing calls or upsetting normalized copy numbers for the batch.)
Per-sample GVCFs generated with HaplotypeCaller (gvcfs input), or a jointly-genotyped VCF (position-sharded, snp_vcfs input or snp_vcfs_shard_list input). The jointly-genotyped VCF may contain multi-allelic sites and indels, but only biallelic SNVs will be used by the pipeline. We recommend shards of 10 GB or less to lower compute time and resources.
Read count, BAF, PE, and SR files (Module 00a)
Caller VCFs (Module 00a)
Contig ploidy model and gCNV model files (gCNV training)

Outputs:

Combined read count matrix, SR, PE, and BAF files
Standardized call VCFs
Depth-only (DEL/DUP) calls
Per-sample median coverage estimates
(Optional) Evidence QC plots

Module 01

Clusters SV calls across a batch.

Prerequisites:

Module 00c

Inputs:

Standardized call VCFs (Module 00c)
Depth-only (DEL/DUP) calls (Module 00c)

Outputs:

Clustered SV VCFs
Clustered depth-only call VCF

Module 02

Generates variant metrics for filtering.

Prerequisites:

Module 01

Inputs:

Combined read count matrix, SR, PE, and BAF files (Module 00c)
Per-sample median coverage estimates (Module 00c)
Clustered SV VCFs (Module 01)
Clustered depth-only call VCF (Module 01)

Outputs:

Metrics file

Module 03

Filters poor quality variants and filters outlier samples.

Prerequisites:

Module 02

Inputs:

Batch PED file
Metrics file (Module 02)
Clustered SV and depth-only call VCFs (Module 01)

Outputs:

Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded
Filtered depth-only call VCF with outlier samples excluded
Random forest cutoffs file
PED file with outlier samples excluded

Merge Cohort VCFs

Combines filtered variants across batches. The WDL can be found at: /wdl/MergeCohortVcfs.wdl.

Prerequisites:

Module 03

Inputs:

List of filtered PESR VCFs (Module 03)
List of filtered depth VCFs (Module 03)

Outputs:

Combined cohort PESR and depth VCFs
Cohort and clustered depth variant BED files

Module 04

Genotypes a batch of samples across unfiltered variants combined across all batches.

Prerequisites:

Module 03
Merge Cohort VCFs

Inputs:

Batch PESR and depth VCFs (Module 03)
Cohort PESR and depth VCFs (Merge Cohort VCFs)
Batch read count, PE, and SR files (Module 00c)

Outputs:

Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded
Filtered depth-only call VCF with outlier samples excluded
PED file with outlier samples excluded
List of SR pass variants
List of SR fail variants
(Optional) Depth re-genotyping intervals list

Module 04b

Re-genotypes probable mosaic variants across multiple batches.

Prerequisites:

Module 04

Inputs:

Per-sample median coverage estimates (Module 00c)
Pre-genotyping depth VCFs (Module 03)
Batch PED files (Module 03)
Clustered depth variant BED file (Merge Cohort VCFs)
Cohort depth VCF (Merge Cohort VCFs)
Genotyped depth VCFs (Module 04)
Genotyped depth RD cutoffs file (Module 04)

Outputs:

Re-genotyped depth VCFs

Module 05/06

Combines variants across multiple batches, resolves complex variants, re-genotypes, and performs final VCF clean-up.

Prerequisites:

Module 04
(Optional) Module 04b

Inputs:

RD, PE and SR file URIs (Module 00c)
Batch filtered PED file URIs (Module 03)
Genotyped PESR VCF URIs (Module 04)
Genotyped depth VCF URIs (Module 04 or 04b)
SR pass variant file URIs (Module 04)
SR fail variant file URIs (Module 04)
Genotyping cutoff file URIs (Module 04)
Batch IDs
Sample ID list URIs

Outputs:

Finalized "cleaned" VCF and QC plots

Module 07 (in development)

Apply downstream filtering steps to the cleaned vcf to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects.

Filterings methods include:

minGQ - remove variants based on the genotype quality across populations. Note: Trio families are required to build the minGQ filtering model in this step. We provide tables pre-trained with the 1000 genomes samples at different FDR thresholds for projects that lack family structures, and they can be found here:

gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.10perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.1perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.5perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt

BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches
FilterOutlierSamples - remove outlier samples with unusually high or low number of SVs
FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation

Module 08 (in development)

Add annotations, such as the inferred function and allele frequencies of variants, to final vcf.

Annotations methods include:

Functional annotation - annotate SVs with inferred function on protein coding regions, regulatory regions such as UTR and Promoters and other non coding elements;
Allele Frequency annotation - annotate SVs with their allele frequencies across all samples, and samples of specific sex, as well as specific sub-populations.
Allele Frequency annotation with external callset - annotate SVs with the allele frequencies of their overlapping SVs in another callset, eg. gnomad SV callset.

Module 09 (in development)

Visualize SVs with IGV screenshots and read depth plots.

Visualization methods include:

RD Visualization - generate RD plots across all samples, ideal for visualizing large CNVs.
IGV Visualization - generate IGV plots of each SV for individual sample, ideal for visualizing de novo small SVs.
Module09.visualize.wdl - generate RD plots and IGV plots, and combine them for easy review.

参考文献：A structural variation reference for medical and population genetics

网友评论

本文标题：Structure Variantion Pipeline(结构

本文链接：https://www.haomeiwen.com/subject/uwaqbltx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Structure Variantion Pipeline(结构

Pipeline Overview

gCNV Training

Module Descriptions

Module 00a

Inputs:

Outputs:

Module 00b

Prerequisites:

Inputs:

Outputs:

Preliminary Sample QC

gCNV Training

Prerequisites:

Inputs:

Outputs:

Module 00c

Prerequisites:

Inputs:

Outputs:

Module 01

Prerequisites:

Inputs:

Outputs:

Module 02

Prerequisites:

Inputs:

Outputs:

Module 03

Prerequisites:

Inputs:

Outputs:

Merge Cohort VCFs

Prerequisites:

Inputs:

Outputs:

Module 04

Prerequisites:

Inputs:

Outputs:

Module 04b

Prerequisites:

Inputs:

Outputs:

Module 05/06

Prerequisites:

Inputs:

Outputs:

Module 07 (in development)

Module 08 (in development)

Module 09 (in development)

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读