美文网首页2019龙星计划@PKU生信小白科研信息学
Dragon star Day 3 Pt.1 关于结构变异、C

Dragon star Day 3 Pt.1 关于结构变异、C

作者: 美式永不加糖 | 来源:发表于2019-08-12 09:06 被阅读13次

    Dragon star Day 3 Pt.1

    关于结构变异、CNV Calling、SNP genotyping、HMM、基于NGS的SV检测

    Dragonstar2019 by Kai Wang

    1. Detection of structural variants in human
    2. Annotation and phenotype-driven interpretation of genetic variants

    Part Ⅰ Detection of structural variants in human

    1 Human genetic variation

    Pollex et al, Circulation. 2007

    http://doc.goldenhelix.com/SVS/tutorials/cnv_univariate_analysis/overview.html

    2 Mechanisms underlying structural variant formation

    Ottaviani D, LeCain M, Sheer D. The role of microhomology in genomic structural variation[J]. Trends in Genetics, 2014, 30(3): 85-94.

    2.1 Recurrent structural variants

    • Share the same size and genomic content in unrelated individuals.

    • Often caused by NAHR (Nonallelic homologous recombination--Nonallelic pairing of paralogous sequences and crossover leading to deletions, duplications and inversions )

      非等位基因同源重组

      Non-allelic homologous recombination (NAHR) is a form of homologous recombination that occurs between two lengths of DNA that have high sequence similarity, but are not alleles.

      https://en.wikipedia.org/wiki/Non-allelic_homologous_recombination

    2.2 Nonrecurrent rearrangements

    • Typical mechanisms:

      • NHEJ: Non-homologous end joining - 非同源性末端接合

        Nonhomologous end joining (NHEJ) is an error-prone mechanism that does not use any DNA sequence homology to repair the DSB (Weterings & Chen, 2008).

        From: Methods in Enzymology, 2017

      • MMEJ: Microhomology-mediated end joining - 微同源介導的末端連接

        Microhomology-mediated end joining (MMEJ), also known as alternative nonhomologous end-joining (Alt-NHEJ) is one of the pathways for repairing double-strand breaks in DNA.

        https://en.wikipedia.org/wiki/Microhomology-mediated_end_joining

      • FoSTeS/MMBIR: microhomology-mediated break-induced replication

        FoSTeS: fork stalling and template switching

      • SRS: Smaller complex rearrangements caused by serial replication slippage

    3 Technologies for CNV Detection

    3.1 Karyotyping and cytogenetic analysis

    • Giemsa staining

      • Giemsa banding (G-banding) - G显带
      • Dark bands are AT-rich and have less genes.
      • Light bands are GC-rich DNA and are more transcriptionally active.
      Huang et al, The Tohoku journal of experimental medicine. 2010
    • Fluorescent in situ hybridization (FISH) - 荧光原位杂交

      FISH uses fluorescent probes that bind to specific chromosomal
      regions where there is a high degree of sequence complementarity.\

      Aleksic et al, Scientific reports. 2013
    • Comparative genomic hybridization (CGH) - 比较基因组杂交

      比较基因组杂交(英语:Comparative genomic hybridization,CGH)是一种分子细胞遗传学方法,在不培养细胞的情况下,分析相对于参照样品,测试样品的DNA中拷贝数变异(CNV)的多倍性程度。其目的是快速有效地比较两个来源的两组DNA样本,这两组DNA通常是密切相关的,因为两者在整个染色体或亚染色体区域(整个染色体的一部分)上都可能有获得或丢失。该技术最初是为了评估实体瘤和正常组织的染色体互补差异而开发的[[1]](https://zh.wikipedia.org/wiki/比较基因组杂交#cite_note-Kallioniemi,Kallioniemi,Sudar,Rutovitz,Gray,Waldman,Pinkel-1),相比更传统的[Giemsa显带](https://zh.wikipedia.org/wiki/G显带)技术和[荧光原位杂交](https://zh.wikipedia.org/wiki/荧光原位杂交)技术(FISH)(受限于所使用的显微镜分辨率),它的分辨率提高到了5-10Mb[[1]](https://zh.wikipedia.org/wiki/比较基因组杂交#cite_note-Kallioniemi,Kallioniemi,Sudar,Rutovitz,Gray,Waldman,Pinkel-1)[2][3]

      https://zh.wikipedia.org/wiki/比较基因组杂交

      http://www.cytogen.jp/index/pdf/13-b.pdf
    • Spectral karyotyping (SKY)

      Spectral karyotyping (SKY) is a FISH-based method that labels each chromosome with a different color, allowing the identification of the chromosomal origin of all elements of the examined karyotype.

      From: Molecular Diagnostics, 2010

    4 SNP genotyping arrays

    SNP genotyping array is a type of DNA microarray which is used to detect SNPs.

    • Affymetrix arrays
      • In the Affymetrix assay, there are 25-mer probes for both alleles.
      • Assuming there are two alleles (e.g. A-Allele and B-allele) at a
        particular site.
        • The DNA can bind to both probes.
        • But will have much higher affinity for the perfectly matched probe.
    • Illumina arrays
      • In the Illumina array, attached to each Illumina bead is a 50-
        mer sequence complementary to the sequence adjacent to the SNP site.
      • The single-base extension (T or G) that is complementary to the allele carried by the DNA (A or C, respectively) then binds and results in the appropriately-colored signal (red or green,
        respectively).
    Schematic view of SNP array analysis by Affymetrix (right) and Illumina (left).

    Iacobucci I, Lonetti A, Papayannidis C, et al. Use of single nucleotide polymorphism array technology to improve the identification of chromosomal lesions in leukemia[J]. Current cancer drug targets, 2013, 13(7): 791-810.

    5 CNV Detection

    There is a need to develop a high-resolution CNV detection algorithm using high-density SNP genotyping data:

    • Identify location of the CNVs
    • Estimate the copy numbers
    • Model family relationships
    • Incorporate de novo events

    5.1 Log R Ratio (LRR) and B Allele Frequency (BAF)

    For both platforms, the computational algorithms convert the raw signals into Log R Ratio (LRR) and B Allele Frequency (BAF).

    • LRR is a measure of normalized total signal intensity.
    • BAF is a measure of normalized allelic intensity ratio.

    BAF = Y / (X + Y)
    LRR = log2( (X + Y)sampleOfInterest / (X+Y)baselineSample)

    https://www.biostars.org/p/199025/

    The combination of LRR and BAF can be used together to determine different copy numbers and to differentiate copy-neutral LOH regions from normal copy regions.

    Loss of heterozygosity (LOH) is a cross chromosomal event that results in loss of the entire gene and the surrounding chromosomal region.

    https://en.wikipedia.org/wiki/Loss_of_heterozygosity

    5.2 Detection of CNVs from SNP arrays using PennCNV

    • Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with hidden states.

      隐马尔可夫模型(Hidden Markov Model;缩写HMM)或称作隐性马尔可夫模型,是统计模型,它用来描述一个含有隐含未知参数的马尔可夫过程。其难点是从可观察的参数中确定该过程的隐含参数。然后利用这些参数来作进一步的分析,例如模式识别

      正常的马尔可夫模型中,状态对于观察者来说是直接可见的。这样状态的转换概率便是全部的参数。而在马尔可夫模型中,状态并不是直接可见的,但受状态影响的某些变量则是可见的。每一个状态在可能输出的符号上都有一概率分布。因此输出符号的序列能够透露出状态序列的一些信息。

      https://zh.wikipedia.org/wiki/隐马尔可夫模型

      延伸阅读:漫谈 Hidden Markov Model

    • What we know are: LRR and BAF

    • What we want to know is: copy number

    5.2.1 PennCNV Flowchart
    5.2.2 SNP Signal Intensities

    R=X_{A}+X_{B}, θ=(2/π)*arctan(X_{A}/X_{B}), LRR=log_2(R_{subject}/R_{expected})

    XA and X~B~: normalized signal intensities for alleles A and B

    R~expected~: calculated based on a reference dataset assuming copy number = 2

    Infinium II is a two-channel assay and data consist of two intensity values (X, Y) for each SNP, with one intensity channel for each of the fluorescent dyes associated with the two alleles of the SNP.

    Normalized allele intensities are transformed to a combined SNP intensity, R (R = X + Y), and an allelic intensity ratio, theta (θ = 2/π*arctan(Y/X)).

    Staaf J, Vallon-Christersson J, Lindgren D, et al. Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios[J]. BMC bioinformatics, 2008, 9(1): 409.

    5.2.3 Visualization of CNVs
    5.2.4 Hidden Markov Model in PennCNV

    Transition probability matrix aij: a(i,j)= P[q _{t+1}=j|q_{t}=i]

    Emission probabilities ei(a) probability state i emits character a

    http://www.cs.cmu.edu/~durand/03-711/2009/Lectures/hmm09-1.pdf

    https://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/download/lectures/PCB_Lect06_HMM.pdf

    • Emission Probability of LRR

      Given a copy number state, LRR is normally distributed

    • Emission Probability of BAF

    5.2.5 Copy Number States

    6 states:

    • State1: CNV=0 (double deletions)
    • State2: CNV=1 (single deletion)
    • State3: CNV=2 (normal)
    • State4: CNV=2 (normal with LOH)
    • State5: CNV=3 (single duplication)
    • State6: CNV=4 (double duplications)

    每种state在图中有不同的表现

    5.2.6 Hidden states, copy numbers, CNV genotypes, and their descriptions

    6 CNV Calling

    6.1 Viterbi algorithm for calling

    • Calculate the most likely path in HMM (a path of state 1-6 for each SNP marker)
    • Collect any non-normal state path as the CNV calls

    6.2 Other Types of Signal Data

    PennCNV can be applied to data from other technical platforms:

    • Transformation of signal data to LRR/BAF:
      • Affymetrix whole-genome SNP genotyping array
      • Perlegen whole-genome SNP genotyping array
    • Use information from LRR only:
      • BAC clone based array-CGH
      • Oligonucleotide arrays
      • Non-polymorphic markers in recent SNP genotyping arrays

    6.3 PennCNV-Affy Pipeline

    SNP→CNV→genotype call, 但 SNP array 效率较低

    6.4 Joint Modeling on Family Data

    • Most CNVs demonstrate Mendelian inheritance

    • Incorporate family relationship can potentially improve sensitivity of CNV calling

    • Example of Inherited CNV

    • Example of de novo CNV

    6.5 Joint modeling of the CNVs in a trio

    6.6 Likelihood of Signal Intensities

    By treating the trio as a unit, this calling algorithm can avoid generating calls that are Mendelian inconsistent but preserve the ability to allow de novo events.

    Likelihood of an observation sequence given a state sequence, or likelihood of an observation sequencealong a single path : given an observation sequence X = {x1, x2, · · · , xT } and a state sequence Q = {q1, · · · , qT } (of the same length) determined from a HMM with parameters Θ, the likelihood of X along the path Q is equal to:

    p(X|Q, Θ) =\prod_{t=1}^T p(x_{i}|q_{i} , Θ) = b_{1}(x_{1}) · b_{2}(x_{2})· · · b_{T} (x_{T} )

    https://www.cs.ubc.ca/~murphyk/Software/HMM/labman2.pdf

    7 NGS-based SV detection

    SV: Structural Variants

    Escaramís G, et al. Briefings in Functional Genomics, 2015

    特别复杂的SV需要 de novo assemble,contig比对到reference上。

    ( A ) Read depth . Reads are aligned into the reference genome and when compared to diploid regions they show a reduced number of reads in a deleted region or higher read depth in a duplicated region.

    ( B ) Paired reads. Pairs of sequence reads are mapped into the reference genome (from left to right): (1) no SV, pairs are aligned into correct order, correct orientation and spanned as expected based on the library’s insert size; (2) deletion, the aligned pairs span far apart from that expected based on library insert size; (3) tandem duplication, read pairs are aligned in unexpected order, where expected order means that the leftmost read should be aligned in the forward strand and the rightmost read in the reverse strand; (4) novel sequence insertion, the pairs are aligned closer from that expected based on library insert size; (5) inversion, read pairs are aligned in wrong orientation, both reads align either in forward or reverse strand; and (6) read pairs mapped to different chromosomes.

    ( C ) Split reads. Sequenced reads pointing to the same breakpoint are split at the nucleotide where the breakpoint occurs. The corresponding paired read is properly aligned to the reference genome.

    ( D ) De novo assembly. Sample reads from novel sequence insertions are assembled without a reference sequenced genome.

    • Read-Pair (RP) method is to estimate the likelihood of expected value of insert
      size variation associated with deletion and insertion.
    • Read-Depth based algorithm reports exact number of sequence copies in the genome.

    Ye K, Hall G, Ning Z. Structural variation detection from next generation sequencing[J]. Next Generat Sequenc & Applic, 2016, 1(007).

    7.1 Read count-based methods for SV detection

    • Detect the change of read count/sequencing coverage in a certain region
    • Examples of software tools: CNVnator, BIC-SEQ2, PennCNV-Seq
    • Limitation:
      • Only detects unbalanced events (copy number variation)
      • Cannot resolve breakpoints at base pair resolution

    7.2 Detection of SVs from discordant read pairs

    • Widely used software tools: Delly, Lumpy

    • Pattern of deletions: large gaps between read pairs

      Rausch T, et al. (2012) Bioinformatics
    • Pattern of inversions: same orientation between read pairs

      Rausch T, et al. (2012) Bioinformatics
    • Pattern of tandem duplication: the first and second read changed their relative order

      Rausch T, et al. (2012) Bioinformatics
    • Pattern of translocations: paired-ends mapping to different chromosomes

    7.5 Detection of SVs using assembly-based methods

    Most short read methods based on assembly for SV detection use a reference assisted approach. Reads with missing pair or unmapped reads after a reference alignment are collected and a local assembly is performed to generate contig that represents the actual local structural variation.

    https://www.1010genome.com/sv-detection/

    • De novo sequence assembly (AS) enables the fine-scale discovery of SVs, including novel (non-reference) sequence insertions.
    • Either global or local assembly may be used to discover SVs.
    • Example tools:
      • SvABA (genome-wide detection of structural variants and indels by local assembly)
      • novoBreak (local assembly for breakpoint detection in cancer genomes)
      • TIGRA (a targeted iterative graph routing assembler for breakpoint assembly)

    7.6 SV detection from long-read sequencing

    Pacbio and Oxford nanopore platforms offer a different view of structural variation in a genome with help of their average >10kb read lengths. A low coverage of 10x for these long reads can help detect a high percentage of structural variations (>80%) in a complex genome.

    https://www.1010genome.com/sv-detection/

    • Multiple alignment tools have been developed to map long reads to the reference genome

      • Minimap2: a ultra-fast long read alignment tool
      • NGMLR: an aligner that is specifically developed for SV discovery
      • BLASR: a aligner developed for PacBio reads
      • BWA-MEM: an early aligner for long reads, could be replaced by Minimap2
    • Several tools have been developed to detect SVs from long read sequencing

      http://schatz-lab.org/presentations/2018/2018.PAG.G10kTalk.pdf

      • SV callers for PacBio reads:
        • PBSV
        • SMRT-SV
        • PBHoney
        • Sniffles
      • SV callers for Nanopore reads:
        • NanoSV
        • Sniffles
        • Picky
    • Short/long reads on SV detection

      绝大多数SV可以通过short read seq找出来

    7.7 Bionano optical mapping for SV detection

    Optical mapping technique like Bionano genomics further enhance the ability of NGS based SVs to detect large and complex SVs. Optical mapping generates images of megabase size DNA molecules that in turn produce genome maps.

    https://www.1010genome.com/sv-detection/

    A nanopore array that detects a characteristic 6 or 7-nucleotide sequence along very long genomic segments.

    SV detection from Single-molecule optical mapping

    • To identify a structural variation, a de novo genome map assembly can be
      aligned to a reference genome.
    • By observing changes in label spacing and comparisons of order, position, and orientation of label patterns, SVs can be detected.

    相关文章

      网友评论

        本文标题:Dragon star Day 3 Pt.1 关于结构变异、C

        本文链接:https://www.haomeiwen.com/subject/hmcyjctx.html