今天要分享的是一本合集 Clinical Bioinformatics 临床生物信息学实验指南中的第五章Bioinformatics Challenges in Genome-Wide Association Studies (GWAS)
De R., Bush W.S., Moore J.H. (2014) Bioinformatics Challenges in Genome-Wide Association Studies (GWAS). In: Trent R. (eds) Clinical Bioinformatics. Methods in Molecular Biology (Methods and Protocols), vol 1168. Humana Press, New York, NY
一张导图总结
作者之一Jason H. Moore教授就职于Geisel School of Medicine at Dartmouth,研究方向是生物统计、流行病学和基因组,开发SPARCoC软件,还写过一本书Computational Methods for Genetics of Complex Traits(2010)以后有钱了找来看看。。。。
真的很贵好了,继续来说这篇文章
摘要:本章回顾了GWAS 的基本概念、用于捕获遗传变异的技术、遗传力缺失问题、高效实验设计、减少引入到数据集中的偏差以及如何利用新的资源(如电子病历)
Key words:Data imputation, Epistasis, Electronic medical records, Filtering, Gene–gene interactions, GWAS, Meta-analysis, Missing heritability, Replication
一、简介
GWAS 是基于常见疾病-共同变异(Common Disease—Common Variant,CD-CV)假说的,即common diseases (II型糖尿病,类风湿性关节炎或原发性高血压等) are caused in part by genetic variations that are also common in the population。
SNP遗传效力和疾病遗传力的关系 If common variants have a small effect size but common diseases show a strong inheritance in families (high heritability), then almost by definition the disease must be influenced by multiple genetic factors.
The missing heritability problem: GWAS has had limited success in detecting genetic variants that account for a large portion of the heritability of any common disease trait. 作者举例在breast cancer研究中找到的两个loci仅能解释5.9%的乳腺癌家族风险。
*产生原因之一是上位效应epistatic interactions. Biological epistasis refers to the physical interactions between biomolecules that are influenced by multiple genetic variants. Statistical epistasis is the term for the nonadditive interactions between multiple genes, each of which affects disease susceptibility, and the environment.
*解决办法: 1) Designing our studies to search for nonlinear interactions amongst SNPs. 2) Using methods such as meta-analysis and data imputation to increase our statistical power. 3) Establishing strict criteria for defining phenotypes
二、材料
介绍了Illumina和Affymetrix两家测序平台以及Electronic Medical Records的应用,这里略过
三、方法
Overview of the GWAS process1 关于基本概念:
SNP-single base pair changes in the DNA sequence, have now become the modern unit of genetic variation
MAF-the frequency of the less common allele is referred to as minor allele frequency
LD-Linkage disequilibrium is a measure of correlation between SNP alleles at one site and the specific alleles carried at variant sites nearby. 用D′ 或r2来计算
Haplotype-a particular combination of alleles along a chromosome
tag SNPs-in strong LD with other variants surrounding them最终会被筛选出来
2 关于实验设计:
(1)Case–Control VS Quantitative
Case–Control案例研究通常是二元结果,如病例/对照或受影响/未受影响。若病例中SNP频率高于对照组,说明SNP与疾病风险增加有关;Quantitative定量研究评估量化或连续性状,以获得定量值(如HDL、LDL),研究SNP或等位基因的频率是否与数量性状相关。
(2)Standardizing Phenotype Criteria
对表型的标准化定义是非常重要的,特别是在多机构的合作中。有时案例研究里把病人由case错归为control的影响要比定量研究中记录错数值严重得多。
(3)Testing for an Association(重点)
1)前期准备
选择合适的方法——关联分析可分为allelic或genotypic与表型相关联,需根据具体情况选择显性、隐形、加性效应模型来分析
调整数据集——用Regression方法调整协变量以防出现假阳性结果
群体结构分析Population substructure——作为重要协变量之一, ethnic-specific SNPs may show up to be associated with a trait due to population stratification,可以用STRUCTURE或EIGENSTRAT来分析
2)单一位点 VS 多位点
在Binary traits, case–control研究中常采用 a contingency table method或logistic regression.
*A contingency table summarizes the number of individuals within each genotypic group for a single biallelic SNP. It searches for a deviation from the null hypothesis that there is no association between the phenotype and genotype. e.g. the chi-square test or the Fisher’s exact test by SAS, SPSS, Stata, or Microsoft Excel.
*Logistic regression is an extension of linear regression where the phenotypic outcome studied is transformed using a logistic function. This method predicts the probability of an individual having a case status, given their genotype class. 因允许协变量调整而被更广泛地使用
对于quantitative traits,常采用方差分析Analysis of Variance (ANOVA). It assumes that 1) the trait is normally distributed (正态分布), 2) the variance of the trait is the same within each group, and 3) that the groups are independent. For single-SNP analysis, ANOVA functions under the null hypothesis.
PLINK是GWAS分析中的常用软件,功能强大,操作简便,可以使用the allelic orinheritance模型, or by using the Cochran-Armitage test (a contingency table method).
由于用linear modeling framework 去分析单一SNPs at a time会导致之前提到过的missing heritability问题, 因此需要用到multi-locus analysis, more holistic approaches that recognize the complex landscape of the genotype–phenotype relationship and examine nonlinear interactions between genetic variants throughout the genome. 这里最大的挑战在于处理50万个SNP会消耗大量计算资源,需用特定的过滤方法来减轻计算压力。
一般的GWAS single SNP分析会基于MAF\LD值进行初始过滤(仍会留下30万SNPs), 然后会通过设定显著性阈值筛选出一些主效markers (和疾病强关联的单一SNPs)
另一种过滤方法是检测marks有没有在某一通路、蛋白家族中存在相互作用 dataset can also be filtered so that only those multi-marker interactions will be examined that fit within a certain biological context such as a biological pathway, protein family, and group of genes or proteins involved in a certain molecular function.
如Biofilter algorithm 算法 combines biomedical knowledge from multiple public repositories with statistical methods such as logistic regression or multifactor dimensionality reduction (MDR) method to analyze SNP–SNP combinations.
3)Post Analysis 纠错
p-value 检验 is defined as the probability of observing a test statistic that is equal to or greater than the observed test statistic, if the null hypothesis is true. P值的问题
GWAS中常用的多重假设检验矫正方法有:
*The Bonferroni correction
*Adjusting the False Discovery Rate (FDR)
*Using permutation testing to adjust the significance threshold by PLINK, PRESTO, and PERMORY
(4)结果的可重复
重复的唯一目的是评估GWAS最初的阳性结果,证实其有效性和可信度
1)Statistical Replication
要实现统计上的可重复需满足以下条件:
*样本量足够大 由于winner’s curse 赢家的诅咒 (GWAS在研究群体中的效应被高估,即比实际在人群中要高) 的存在,这点至关重要
*重复必须在同一群体的独立数据集中进行,并应该使用相同的标准来定义所讨论的
*由于GWAS标记是基于LD模式选择的,应旨在重复某个基因组区域,而不一定是最初研究中得到的具体某个SNP
2)Meta-analysis
Meta-analysis is a statistical method for combining several different studies to provide one summary result aims to examine the effect of the same allele across all studies.(前提是所有研究需基于相同的假说). 可以用Cochran’s Q 或 I2 statistic来计算heterogeneity
3)Data Imputation
The imputation procedure makes use of the known LD and haplotype patterns in reference panels to estimate genotypes for SNPs that were not directly genotyped within a study. 常用的算法有BimBam, IMPUTE, MaCH, and Beagle (均基于haplotype phasing algorithms, which estimate the contiguous set of alleles that lie on a specific chromosome)
四、 展望
Although, as the content of genotyping chips, cohort sizes, and biobanks grow even larger, the challenges of data manipulation, quality control, strong study design, and strict phenotypic definitions grow more complex. Hence, moving forward human geneticists will have to develop bioinformatics infrastructure and expertise to overcome such challenges. Most importantly, scientists will have to combine their bioinformatics efforts with genetics, biochemistry and cell biology to confirm the functional consequence and biological relevance of the genotype–phenotype associations that are identified.
本文提纲挈领地阐明了医学临床上的GWAS分析基本概念和原理,关联算法模型的选择和使用,特别是指出了现有GWAS存在的不足以及我们在具体实践中应该如何避免误差。建议小伙伴在学习GWAS时先看这篇入门介绍,再根据个人水平去查陌生的专业名词的含义以及常用软件的使用方法。另一篇简书文章欢迎阅读GWAS基本分析内容
GWAS提出到现在已经十多年,发挥了重要的作用,存在很多问题 (参见扩展阅读),还有许多改进的空间。正如作者最后在Future Directions所说 ‘Ultimately, the translation of GWAS findings into clinical practice will rely upon correct assumptions regarding the genetic architecture of complex traits especially in the context of gene–gene and gene–environment interactions.’
参考文献:
见原文
扩展阅读:
网友评论
“感谢 Jimmy 的人工推送!我是计算机背景出身,生物背景薄弱,花了很多时间填坑,虽然接触GWAS有一段时间了,但是很多问题还是一知半解,如有错误还请指正。
上文介绍的这一章节涵盖的面非常广,涉及到了 GWAS 基本概念、实验设计和方法以及结果验证等诸多方面。就我个人理解补充一些实际操作中可能会遇到的东西吧~
1. 质控
质控的话主要有几个指标:Qual、MAF、HWE、--geno、--mind
Qual 是指 VCF 中的质量信息,不合格的位点在 VCF 这一步就要去掉,后面4个参数都是Plink支持的质控参数。如上文所说的 GWAS 是基于 CD-CV 假说的,所以 MAF 过低的位点话容易引入错误,这种位点需要去掉。--geno、--mind这两个参数是去除缺失率过高的个体和标记的。关于LD和tag SNP的问题我没怎么用过,我理解这两个条件这个除了删去一些点减轻计算压力之外不会带来结果准确性的提高,还有可能因为参数设的不太好导致最后定位到的区间不太精细,半天能出结果的情况下我一般不会动这两个东西。
2. 填充
质控完了之后还会有一些位点存在缺失,这一部分就需要做填充,不填充的话做关联分析的时候软件就会把这些位点删掉或者直接报错。这一步我一般会用 Beagle 来处理,上文提到的其余的几个没有用过不太了解。
3. 预处理
这一部分主要是在计算 Kinship 和 PCA ,会作为矫正加进模型里面。这里主要是因为两个假设:
1. 我们假设用于 GWAS 分析的个体来源于同一个大群体,但实际不是,因此 Q or PCs 可以用来校正群体分层。
2. 我们假设用于 GWAS 分析的个体之间是相互独立的,但实际不是,因此 K 可以用来校正个体之间的亲缘关系。
4. 关联分析
4.1 模型选择
人类数据样本大,个体间基本是相互独立的,符合上一点中的第二条假设,所以可以采用 GLM+PCs 的方式,矫正一下群体分层即可。动植物的话就必须要加 Kinship 来矫正亲缘关系。软件的话除了 PLINK 其实有很多的。我们实验室做的一个 MVP(R包)括 GLM、MLM 和 FarmCPU 3个模型 (Github),另外 C++写的 Gemma (Official Site) 提供二进制可执行文件,直接下载就可以使用,十分的简单易用。
4.2 结果解读
矫正方式:一般是用 Bonferroni,虽说 Bonferroni 条件比较严格,但也不能说是过线的就一定正确,我们实验室目前在基于 permutation 的矫正方法上做一些尝试。
多个模型:多个模型的分析结果中有一个位点都过矫正线了,这样并不能说这个点就更加正确了,一个错的位点如果能在一个模型里面过矫正线,那么它在其他模型里面也会很容易过矫正线,做多个模型的话只能说有点心理安慰的作用吧……主要还是要靠实验验证。
4.3 精细定位(FINE MAPPING)
SNP只是一个标记,GWAS得出结果具有统计意义但是不一定具有生物学意义,所以找到一些过了矫正线的位点之后要去看这些位点落在哪些区域中,把这些区域的基因提取出来再做过滤确定候选基因。其中确定区域的话主要是两个方法:1. 上下游一定的区间。2. 所处的LD Block。过滤基因的话就是看功能注释之类的东西和你的这个性状是否相关了,不相关的话就可以过滤掉了,还有多组学数据整合等手段。
时间仓促,写的比较简略,欢迎有兴趣的小伙伴们一起讨论~”