GWAS相关知识

作者: 期待未来 | 来源:发表于2021-08-10 13:48 被阅读0次

    Hardy-Weinberg equilibrium “哈迪-温伯格定律”

    是指在理想状态下,各等位基因的频率在遗传中是稳定不变的,即保持着基因平衡。该定律运用在生物学、生态学、遗传学。条件:①种群足够大;②种群个体间随机交配;③没有突变;④没有选择;⑤没有迁移;⑥没有遗传漂变。

    • 即需要满足五个条件:
    1. No selection
    2. No mutation
    3. No migration
    4. Large population
    5. Random mating
    • 所以在GWAS研究中仅在control组样品中进行。

    • 英文解释:The Hardy-Weinberg equilibrium is a principle stating that the genetic variation in a population will remain constant from one generation to the next in the absence of disturbing factors. When mating is random in a large population with no disruptive circumstances, the law predicts that both genotype and allele frequencies will remain constant because they are in equilibrium. The Hardy-Weinberg equilibrium can be disturbed by a number of forces, including mutations, natural selection, nonrandom mating, genetic drift, and gene flow. For instance, mutations disrupt the equilibrium of allele frequencies by introducing new alleles into a population. Similarly, natural selection and nonrandom mating disrupt the Hardy-Weinberg equilibrium because they result in changes in gene frequencies. This occurs because certain alleles help or harm the reproductive success of the organisms that carry them. Another factor that can upset this equilibrium is genetic drift, which occurs when allele frequencies grow higher or lower by chance and typically takes place in small populations. Gene flow, which occurs when breeding between two populations transfers new alleles into a population, can also alter the Hardy-Weinberg equilibrium. Because all of these disruptive forces commonly occur in nature, the Hardy-Weinberg equilibrium rarely applies in reality. Therefore, the Hardy-Weinberg equilibrium describes an idealized state, and genetic variations in nature can be measured as changes from this equilibrium state.

    • Removing SNPs out of Hardy-Weinberg equilibrium(p-value > 10−6 - 10−4 )
      Population genetic theory suggests that under ‘normal’ conditions, there is a predictable relationship between allele frequencies and genotype frequencies. In cases where the genotype distribution is different from what one would expect based on the allele frequencies, one potential explanation for this is genotyping error. Natural selection is another explanation. For this reason, we typically check for deviation from Hardy-Weinberg equilibrium in the controls for a case- control study. For a quantitative trait, PLINK just uses everyone. The following command generates p-values for deviation from HWE for each SNP. Low p-values indicate that a SNP is out of HWE.
      英文视频链接:https://www.youtube.com/watch?v=7S4WMwesMts&t=106s

    相关图片如下:


    image.png image.png image.png

    哈温平衡过滤 与 MAF过滤 的区别?

    之前,我对这两个概念有点混淆,后来明白过来了。这两个概念一个是对基因频率进行的筛选,一个是对基因型频率进行的筛选。对于一个位点“AA AT TT”,其中A的频率为基因频率,AA为基因型频率。MAF直接是对基因频率进行筛选,而哈温平衡检验,则是根据基因型推断出理想的(AA,AT,TT)的分布,然后和实际观察的进行适合性检验,然后得到P值,根据P值进行筛选。即P值越小,说明该位点越不符合哈温平衡。

    Mendelian randomization:

    • Mendelian randomization: Methods for using genetic variants in causal estimation。
    • Two sample Mendelian randomisation (2SMR) is a method to estimate the causal effect of an exposure on an outcome using only summary statistics from genome wide association studies (GWAS). Though conceptually straightforward, there are a number of steps that are required to perform the analysis properly, and they can be cumbersome. The TwoSampleMR package aims to make this easy by combining three important components

    linkage disequilibrium score regression(LDSC):

    1. 作用:
      • 估计遗传力的大小
      • 估计遗传相关:genetic correlation
      • 计算混淆因素的占比
    2. 英解释:
      • LDSCis a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores。In statistical genetics, is a technique that aims to quantify the separate contributions of polygenic effects and various confounding factors, such as population stratification, based on summary statistics from genome-wide association studies (GWASs).
      • We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage to distinguish between inflation from a true polygenic signal and bias
      • confounding biases: such as cryptic relatedness and population stratification
    3. 中文解释:
      • 通过GWAS分析可以识别到与表型相关的SNP位点,然而严格来讲这个结果并不一定真实客观的描述遗传因素对表型的效应,因为其结果是由以下两个因素共同构成的 polygenic effects 多基因效应 confounding factors 混淆因素:群体分层,亲缘关系。 尽管我们在GWAS分析中,可以通过协变量来校正群落分层等因素,但是混淆因素是无法完全消除的。为了保证分析结果的准确性,我们就需要评估GWAS分析结果中以上两个因素的占比,只有当混淆因素占比很低时,才能说明我们的分析结果是可靠的,此时我们就可以通过LDSC来探究这个混淆因素的占比
      • LDSC本质是一个线性回归,其输入数据为GWAS的分析结果,回归的自变量为SNP位点的LD score值,因变量是该算法的核心,自定义的一个符合卡方分布的统计量,通过线性回归拟合LD score和卡方统计量的关系,从而判断GWAS分析结果中是否存在混淆因素。
      • 通过LDSC回归分析的截距,可以判断GWAS结果中是否存在混淆因素。如果截距在1附近,说明没有混淆因素,如果截距超过这个范围,说明有混淆因素的存在。同时公式中涉及到了遗传力,通过LDSC也可以评估遗传力的大小。 针对单个表型的GWAS分析,LDSC可以鉴定是否存在混淆因素,估计遗传力的大小;对于多个表型,则可以根据对应的卡方统计量,计算表型间的遗传相似度。
    image.png
    • 其中N为样本总数,M为窗口内的其他SNP位点数,h²是遗传力,这几个值为常数,从公式可以看出,卡方统计量和LDscore之间是一个线性关系,而且对应到图像上,其截距为1。上述公式是只考虑遗传效应的前提下得到,如果存在混淆因素,那么最后的截距就不是1了。 通过LDSC回归分析的截距,可以判断GWAS结果中是否存在混淆因素。如果截距在1附近,说明没有混淆因素,如果截距超过这个范围,说明有混淆因素的存在。同时公式中涉及到了遗传力,通过LDSC也可以评估遗传力的大小。在下面这篇文章中,对LDSC进行了详细介绍
      • 此文章介绍了,LDSR计算的intercept比λGC更加有意义,因为λGC会随着样本量的变化而变化。
    1. In conclusion, we have developed LD Score regression, a method to distinguish between inflated test statistics from confounding bias and polygenicity

    主成分分析(principal component analysis)
    中文解释:
    将多个变量通过线性变换以选出较少个重要变量的一种多元统计分析方法,又称主分量分析。在实际课题中,为了全面分析问题,往往提出很多与此有关的变量(或因素),因为每个变量都在不同程度上反映这个课题的某些信息。但是,在用统计分析方法研究这个多变量的课题时,变量个数太多就会增加课题的复杂性。人们自然希望变量个数较少而得到的信息较多。在很多情形,变量之间是有一定的相关关系的,当两个变量之间有一定相关关系时,可以解释为这两个变量反映此课题的信息有一定的重叠。主成分分析是对于原先提出的所有变量,建立尽可能少的新变量,使得这些新变量是两两不相关的,而且这些新变量在反映课题的信息方面尽可能保持原有的信息。主成分分析首先是由K.皮尔森对非随机变量引入的,尔后H.霍特林将此方法推广到随机向量的情形。信息的大小通常用离差平方和或方差来衡量。

    • 主成分分析的基本思想就是将彼此相关的一组指标变量转化为彼此独立的一组新的指标变量,并用其中较少的几个新指标变量,综合反应原来多个指标变量中所包含的主要信息。
    • 何为主成分:简而言之,主成分实际就是由原来变量X1-Xm线性组合出来的m个互不相关,且未丢失任何信息的新变量,也称为综合变量。多指标的主成分分析常被用来寻找判断某种事物或现象的总和指标,并给综合指标所蕴藏的信息以恰当的解释,以便更深刻的揭示事物的内在规律。
      英文解释:
    • Its idea is simple—reduce the dimensionality of a dataset, while preserving as much ‘variability’ (i.e.statistical information) as possible.
    • Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
    • So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible.

    PCA算法
    总结一下PCA的算法步骤:
    设有m条n维数据。
    1)将原始数据按列组成n行m列矩阵X
    2)将X的每一行(代表一个属性字段)进行零均值化,即减去这一行的均值
    3)求出协方差矩阵
    4)求出协方差矩阵的特征值及对应的特征向量
    5)将特征向量按对应特征值大小从上到下按行排列成矩阵,取前k行组成矩阵P
    6)即为降维到k维后的数据

    根据上面对PCA的数学原理的解释,我们可以了解到一些PCA的能力和限制。PCA本质上是将方差最大的方向作为主要特征,并且在各个正交方向上将数据“离相关”,也就是让它们在不同正交方向上没有相关性。
    因此,PCA也存在一些限制,例如它可以很好的解除线性相关,但是对于高阶相关性就没有办法了,对于存在高阶相关性的数据,可以考虑Kernel PCA,通过Kernel函数将非线性相关转为线性相关,关于这点就不展开讨论了。另外,PCA假设数据各主特征是分布在正交方向上,如果在非正交方向上存在几个方差较大的方向,PCA的效果就大打折扣了。
    最后需要说明的是,PCA是一种无参数技术,也就是说面对同样的数据,如果不考虑清洗,谁来做结果都一样,没有主观参数的介入,所以PCA便于通用实现,但是本身无法个性化的优化。
    希望这篇文章能帮助朋友们了解PCA的数学理论基础和实现原理,借此了解PCA的适用场景和限制,从而更好的使用这个算法。
    英文视频讲解网址:
    网址1
    网址2

    Manhattan plot(曼哈顿图)

    它是把GWAS分析之后所有SNP位点的p-value在整个基因组上从左到右依次画出来。并且,为了可以更加直观地表达结果,通常都会将p-value转换为-log10(p-value)。这样的话,基因位点-log10(p-value)在Y轴的高度就对应了与表型性状或者疾病的关联程度,关联度越强(即,p-value越低)就越高。而且,一般而言,由于连锁不平衡(LD)关系的原因,那些在强关联位点周围的SNP也会跟着显示出类似的信号强度,并依次往两边递减。由于这个原因,我们在曼哈顿图上就会看到一个个整齐的信号峰(如下图红色部分)。而这些峰所处的位置一般也是整个研究中真正关心的地方。GWAS研究中,p-value阈值一般要在10-6次方甚至10-8次方以下,有些时候也要看你的实际数据表现。

    image.jpeg

    Q-Q plot(QQ图)

    虽然所用的数据和上面曼哈顿图的一样,但是它要表达的信息比起曼哈顿图来要丰富得多,而且在这两个图中更加能够体现GWAS结果好坏的是QQ plot——它是GWAS研究中更加重要的质控图。这也是我在这篇文章主要讨论的内容。
    其实,一直以来QQ plot是统计学分析中的常用图,在1968年Wilk.M.B的这篇文章(doi:10.1093/biomet/55.1.1)就提出了如何绘制这样的图已经它的用途。QQ plot全称是quantile-quantile plot,也就是分位图,是一种通过比较两个概率分布的分位数从而实现对两个概率分布进行比较的概率图方法(在统计学上较常用)。之所以可以这样做的原因是,如果两个概率分布相同,那么它们的分位数也应该相同或者重叠在同一条直线上。
    在GWAS分析中,当我们通过曼哈顿图看到某些SNP和表型性状(或者疾病)有着很强的相关信号(比如,p-value < 10-6甚至10-8)时,依然不能直接认为这些位点就与表型显著相关的。这是因为基因组上基因位点的突变通常有两个来源:

    膨胀系数lambda的解读:

    基因组膨胀因子λ定义为经验观察到的检验统计分布与预期中位数的中值之比,从而量化了因大量膨胀而造成结果的假阳性率。换句话说,λ定义为得到的卡方检验统计量的中值除以卡方分布的预期中值。预期的P值膨胀系数为1,当实际膨胀系数越偏离1,说明存在群体分层的现象越严重,容易有假阳性结果,需要重新矫正群体分层。

    测序深度/reads

    30X的测序深度,而人类基因组约为30亿个碱基,也就是我拿到了900亿个碱基,碱基以ATCG的字符表示,每一个碱基同样对应着一个质量值,同样也是字母表示(可自行搜索phred质量值),这就是说我会拿到1800亿的字母。因为我的测序策略是PE150,也就是我会拿到900亿/150=6亿条reads

    为何要过滤MAF:

    最小等位基因频率怎么计算?比如一个位点有AA或者AT或者TT,那么就可以计算A的基因频率和T的基因频率,qA + qT = 1,这里谁比较小,谁就是最小等位基因频率,比如qA = 0.3, qT = 0.7, 那么这个位点的MAF为0.3. 之所以用这个过滤标准,是因为MAF如果非常小,比如低于0.02,那么意味着大部分位点都是相同的基因型,这些位点贡献的信息非常少,增加假阳性。更有甚者MAF为0,那就是所有位点只有一种基因型,这些位点没有贡献信息,放在计算中增加计算量,没有意义,所以要根据MAF进行过滤


    可以看出,很多基因频率为0,说明没有分型,这些位点需要删掉

    MAF is the Minor Allele Frequency. It can be used to exclude SNPs which are not informative because they show little variation in the sample set being analyzed. For instance, if a SNP shows variation in only 1 of the 89 individuals, it is not useful statistically and should be removed.

    Epistasis

    In classical genetics, if genes A and B are mutated, and each mutation by itself produces a unique phenotype but the two mutations together show the same phenotype as the gene A mutation, then gene A is epistatic and gene B is hypostatic. For example, the gene for total baldness is epistatic to the gene for brown hair. In this sense, epistasis can be contrasted with genetic dominance, which is an interaction between alleles at the same gene locus. As the study of genetics developed, and with the advent of molecular biology, epistasis started to be studied in relation to quantitative trait loci (QTL) and polygenic inheritance.

    Heritability

    image.png
    • Heritability is a statistic used in the fields of breeding and genetics that estimates the degree of variation in a phenotypic trait in a population that is due to genetic variation between individuals in that population.[1] It measures how much of the variation of a trait can be attributed to variation of genetic factors, as opposed to variation of environmental factors. The concept of heritability can be expressed in the form of the following question: "What is the proportion of the variation in a given trait within a population that is not explained by the environment or random chance?"
    • Any particular phenotype can be modeled as the sum of genetic and environmental effects:
      Phenotype (P) = Genotype (G) + Environment (E).
    • Heritability measures the fraction of phenotype variability that can be attributed to genetic variation. This is not the same as saying that this fraction of an individual phenotype is caused by genetics. For example, it is incorrect to say that since the heritability of personality traits is about .6, that means that 60% of your personality is inherited from your parents and 40% comes from the environment. In addition, heritability can change without any genetic change occurring, such as when the environment starts contributing to more variation.

    what does the liability means in GWAS

    • When you use linear mixed models to estimate heritability you assume that the underlying trait is normally distributed which is called a disease liability scale.For continuous traits this is not a problem but for binary traits, this becomes an issue because you have a 0/1 value for a phenotype and usually there is a higher proportion of cases in the study sample than the general population prevalence of disease, which leads to an ascertainment bias.
    • The population prevalence also varies between populations, for example the prevalence of malaria in one continent is vastly different from another.
    • So to make the heritability estimates comparable and also to correct for the ascertainment bias, the observed scale heritabilities for dichotomous traits are transformed to the liability scale taking population prevalence and the case proportion into account.
    • Some good papers for reading:Falconer D.S. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann. Hum. Genet. 1965;29:51–71

    unbiased estamator:

    An unbiased estimator is an accurate statistic that's used to approximate a population parameter. “Accurate” in this sense means that it's neither an overestimate nor an underestimate. If an overestimate or underestimate does happen, the mean of the difference is called a “bias.”

    quantitative genetics

    • Quantitative genetics, or the genetics of complex traits, is the study of those characters which are not affected by the action of just a few major genes. Its basis is in statistical models and methodology, albeit based on many strong assumptions. While these are formally unrealistic, methods work.

    what is confounders

    Confounding variables (a.k.a. confounders or confounding factors) are a type of extraneous variable that are related to a study’s independent and dependent variables. A variable must meet two conditions to be a confounder:


    image.png
    • It must be correlated with the independent variable. This may be a causal relationship, but it does not have to be.
    • It must be causally related to the dependent variable.
    • Even if you correctly identify a cause-and-effect relationship, confounding variables can result in over- or underestimating the impact of your independent variable on your dependent variable.
      <img src="http://note.youdao.com/yws/res/4813/WEBRESOURCEff2b04a42e2101a9b1cd44760e76a67f" width = "500" height = "300" alt="图片名称" align=center />

    how to avoid confounder

    • restriction:按照匹配收集标本
    • matching:尽量匹配相同的样本,以去除混杂因素;(such as:age sex level of education)
      matching的特点:
      • 比restriction更多的标本
      • 但是匹配每一个混杂因素还是很困难
      • 避免不了有其他混杂因素

    statistical control

    if you have collected the data, you can include the possible confounders as control variables in your regression models.in this way, you will control for the impact of the confounding variable.
    statistical control特点:

    • easy to implement
    • you can do it after data collection
    • but you can only control for varianbles that you observe directly, but other confounding variables you have not accounted for might remian.

    summary statistics

    • Fine-mapping:conditional analysis of SNPs
    • causal inference
    • colocalization analysis
    • Multi-trait analysis for "pleiotropy"
    • heritability & genetic correlation
    • multi-omics

    Experimental Artifacts.

    Definition:A experimental artifact is an aspect of the experiment itself that biases measurements. Example. An early experiment finds that the heart rate of aquatic birds is higher when they are above water than when they are submerged
    Although often used interchangeably, confounds and artifacts refer to two different kinds of threats to the validity of social psychological research.
    Within a given social-psychological experiment, researchers are attempting to establish a relationship between a treatment (also known as an independent variable or a predictor) and an outcome (also known as a dependent variable or a criterion). Usually, but not always, they are trying to prove that the treatment causes the outcome and that differential levels of the treatment lead to differential levels.

    Confounds

    Confounds are threats to internal validity.[1] Confounds refer to variables that should have been held constant within a specific study but were accidentally allowed to vary (and covary with the independent/predictor variable). A confound exists when the treatment influences the outcome, but not for the theoretical reason proposed by the researchers. Confounds may be related to the "reactivity" of the study (e.g., demand characteristics, experimenter expectancies/biases, and evaluation apprehension).
    Suggestions for minimizing confounds include telling participants a believable and coherent cover story (to reduce demand characteristics or to attempt to keep them constant across conditions) and keeping researchers, research assistants, and others who have contact with participants "blind" to the experimental condition to which participants are assigned (to minimize experimenter expectancies/biases).

    Artifacts

    Artifacts, on the other hand, refer to variables that should have been systematically varied, either within or across studies, but that was accidentally held constant. Artifacts are thus threats to external validity. Artifacts are factors that covary with the treatment and the outcome. Campbell and Stanley[2] identify several artifacts. The major threats to internal validity are history, maturation, testing, instrumentation, statistical regression, selection, experimental mortality, and selection-history interactions.
    One way to minimize the influence of artifacts is to use a pretest-posttest control group design. Within this design, "groups of people who are initially equivalent (at the pretest phase) are randomly assigned to receive the experimental treatment or a control condition and then assessed again after this differential experience (posttest phase)".[3] Thus, any effects of artifacts are (ideally) equally distributed in participants in both the treatment and control conditions.
    Principal component analysis (PCA) is an effective means of extracting key information from phenotypically complex traits that are highly correlated while retaining the original information (7, 8). PCA can transform a set of correlated variables into a substantially smaller set of uncorrelated variables as principal components (PCs), which can capture most information from the original data (9).
    Principal component analysis (PCA) is an effective means of extracting key information from phenotypically complex traits that are highly correlated while retaining the original informa tion (7, 8). PCA can transform a set of correlated variables into a substantially smaller set of uncorrelated variables as principal
    components (PCs), which can capture most information from the original data (9). In this study, PCA was performed for rice ar chitecture, and a genome-wide association study (GWAS) using PC scores was utilized to identify genetic factors regulating plant architecture. This approach was validated as effective in identi
    fying causal genes associated with plant architecture

    Pleiotropy:

    Mechanism. Pleiotropy describes the genetic effect of a single gene on multiple phenotypic traits. The underlying mechanism is genes that code for a product that is either used by various cells or has a cascade-like signaling function that affects various targets.

    liner mixed models:

    A mixed model is a good choice here: it will allow us to use all the data we have (higher sample size) and account for the correlations between data coming from the sites and mountain ranges. We will also estimate fewer parameters and avoid problems with multiple comparisons that we would encounter while using separate regressions.

    Lasso regression:

    is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters)

    一般线性模型(GLM general liner model)

    • 用的是最小二乘法:least square
      Y = Xβ+Zβ+ε

    多元线性回归:

    • In essence, multiple regression is the extension of ordinary least-squares (OLS) regression because it involves more than one explanatory variable。
    • Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable.
      Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.MLR is used extensively in econometrics and financial inference.

    线性混合模型(LMM liner mixed model)

    -用的是最大似然法:maximum likelihood。
    fixed-effects, 固定效应; random efffects,随机效应;
    Y = Xβ+Zβ+ε
    上式由两部分组成,分别被称为固定部分和随机部分,可见和普通线型模型相比,混合线性模型主要是对原先的随机误差进行了更加精细的分解。

    如何理解混合线性模型?

    前面我们介绍了如何将方差分析通过模型来解读,也就是方差分析模型。例如单因素方差分析的模型解读:假设单个因素为不同职业;因变量为工资收入,那么单因素方差分析模型可以表示为:
    yij=u+aj+εij
    u表示所有受访者的平均月收入
    ai表示第i种职业对平均月收入的影响
    εij表示落实到这位受访者对第i种职业平均月收入的随机误差
    yij表示某位受访者的收入

    由此可见,方差分析的模型解读是更为精准的办法,回顾该部分内容可以点击链接:SPSS分析技术:单因素方差分析结果的模型解读。

    前面介绍方差分析时,我们逐步介绍了许多种方差分析类型,单因素方差分析,多因素方差分析、包括随机因素和协变量的方差分析等。如果以上情况都出现在一个分析环境中,应该如何分析呢?今天我们介绍混合效应模型中最基础的一种----混合线性模型,它就是解决这类情况的基础模型之一。
    视频网址:https://www.youtube.com/watch?v=zM4VZR0px8E

    混合线性模型

    混合线性模型要比前面介绍的方差分析模型更加复杂,为了通俗解释。我们引入例子进行说明。假设现在有来自100所学校的5000名学生的数据,该分数据包括以下变量:
    ==学生编号,学校名称,学校类型,座号,性别,入学成绩,中考成绩==
    现在假设分析的目的是想以入学成绩为自变量建立针对中考成绩的回归方程,则按照方差分析模型的标准思路:入学成绩(定距数据)为协变量。学校(100所学校)、学校类别(男校、女校和军事化管理学校)、性别(男和女)为因素,这些因素有的是固定因素,有的是随机因素。
    如果我们只考虑学校因素(school)和入学成绩(Rscores),建立中考成绩的回归模型。如果将学校看成是固定因素(100所学校),则建立的模型如下:
    yij=u+Rscores+schoolj+εij
    yij代表某个学生的中考成绩
    Rscores代表该生的入学成绩(学生基础)对中考成绩的影响
    schoolj代表学校因素对该生中考成绩的影响
    εij代表不同学生之间的随机误差

    将上式改写成回归模型的形式如下:
    yij=a+β1Rscoresij+ 求和βjschoolj+eij
    β1代表入学成绩的影响(回归系数)
    βj代表第j个学校对中考成绩的效应
    eij为第j个学校第i个学生的随机误差

    上面的回归方程看起来没什么问题,但若换个角度思考,就会发现它忽略了许多深层次的信息。可以看下面的两幅图:


    image.jpeg

    左边的散点图是只有1所学校数据的散点图,右边的散点图包括了4所学校的数据。从两幅图的趋势线可以发现,由学校因素引起的学生中考成绩(因变量)的差异既包括了截距的差异,也包括了斜率的差异。

    如果只考虑一所学校的差异引起的学生中考成绩的不同,那么方差回归模型可以表示为:

    yi=α+β1Rscoresi+ei
    其中下标i代表第i个学生。在单独考虑这一所学校时,上面的模型是非常完善的,但同时考虑多所学校时问题就出现了。从上图(右)可以发现,各个学校的教学水平是有差异的,也就是说同一所学校学生的成绩之间实际并不独立,好学校的学生成绩会普遍好一些,差学校学生的成绩会普遍差一些。

    上图(右)是包含四所学校的数据,可以发现四条回归线的截距不同,这种差异实际上反映了学校间教学水平的差异,即入学成绩相同的学生,在不同学校中学习后,最后的中考成绩的平均估计值可能是不同的。若考虑到截距的变异,则刚才的模型应扩展为:

    yij=(a0+u0j)+β1 Rscoresij +eij
    yij代表了第j所学校的第i个学生的中考成绩
    a0表示各学校总的平均水平
    u0j表示不同学校之间引起的中考成绩变异
    Rscoresij表示入学成绩,即学生的入学基础
    β1表示学生入学基础对中考成绩的影响程度
    eij表示不同学生之间的随机误差

    从上图(右)可以看出除了截距以外,各回归线的斜率也不相同。即成绩在学校间的聚集性除了表现为成绩的平均水平不同外,还表现在不同学校中成绩的离散度,即对中考层级的影响程度上。斜率高的学校对中考成绩影响程度较高,斜率低的则影响程度较低。根据以上推断,模型需要继续扩展:
    uij=(a0+u0j)+(β1+u1j)Rscoresij +eij
    u1j表示不同学校对中考成绩的影响系数
    对上面的式子进行整理,整理成下面的形式:
    yij=(a0+β1Rscoresij)+(u0j+u1jRscoresij+eij
    上式由两部分组成,分别被称为固定部分和随机部分,可见和普通线型模型相比,混合线性模型主要是对原先的随机误差进行了更加精细的分解。

    • GSA的介绍

    GWAS中的Gene Set Analysis,
    简称GSA分析,是从基因或者通路水平来进行关联分析,是建立在SNP水平的的GWAS分析结果基础上的,在更高的层次进行深入挖掘,以发现更加有用的信息。MAGMA是进行GSA分析的一款工具,其官网如下

    MAGMA软件介绍:

    Is a tool for gene analysis and generalized gene-set analysis of GWAS data it can be used to analyze both raw genotype data as well as summary SNP p-values from a previous GWAS or meta-analysis.

    MAGMA的分析步骤分为三步:

    1. annotation:就是把SNP映射到染色体位置上去。
      • SNP染色体位置文件
      • 直接采用blink中后缀为.bim的文件;
      • 纯文本格式:要求前三列为SNP ID,染色体名称,染色体位置
      • 基因染色体位置文件:对应gene-loc,对于human而言,官网提供三种基因组版本的该文件,文件内容如下:
      • 第一类为基因的Entrez ID
      • 第二列为染色体
      • 第三列为转录起始
      • 第四列为转录终止
      • 前四列信息是必须的
      • 第五列是基因的正负链
      • 第六列是gene symbol。
      • 运行成功后,会生成后缀为genes.annot的文件,内容有:第一列为基因的Entrez ID, 第二列为染色体位置,其他列为对应的SNP ID.
    2. gene analysis
      该软件支持两种模式:
      • 第一种直接从原始的分型结果开始;
      • 第二种从GWAS分析结果,也就是SNP的P值开始
      • gene-annot参数为第一步产生的SNP和基因的映射关系,pval参数为SNP对应的p值,第一列为SNP的ID,第二列为对应的p值,输出文件后缀为genes.out,同时还会产生一个后缀为genes.raw的文件,用于后续的gene set分析。
    3. gene set analysis
      • 在基因分析的基础上进行基因集的分析:
      • gene-results参数为第二步产生的文件,set-annot代表基因集,SET1表示基因集的名称,可以是pathway的编号,对应基因集合用EntrezID表示,输出结果后缀为.gsa.out

    人类参考基因组:

    • GRCh37 vs. GRCh38: are human genome assemblies by the Genome Reference Consortium (GRC). GRCh38 (also called “build 38”) was released four years after the GRCh37 release in 2009, so it can be viewed as a version with updated annotations to the earlier assembly.
    • Primarily, there are three updates in the GRCh38 version:
    • Repair of incorrect reads
    • Inclusion of model centromere sequences
    • Addition of alternate loci

    相关网站的介绍:

    ![GWAS网站软件]
    (https://note.youdao.com/src/82618652255B494594E3000ED751969C)
    GWAS网站软件网址

    image.jpeg image.png 参考文献:10 Years of GWAS Discovery:Biology, Function, and Translation

    GWAS分析有两大坑:
    坑1:关联分析的结果是假阳性(有结果,但结果是错的);
    坑2:目标性状多基因控制,每个基因效应太弱,结果中找不到显著相关的位点(干脆没结果)。
    应对以上两大坑,我们可以采取的常见方法包括:
    扩大样本量,提高检验功效。
    优化表型鉴定的体系。
    提高表型鉴定的精度;
    采用多维度的方法对表型进行评估,例如代谢组。
    充分利用先验信息。
    使用候选基因或已知内参基因的方法,合理减低阈值 。
    注意统计模型的控制和优化。
    校正群体结构、系统关系、离群样本的影响;
    计算其他因素,例如:性别,作息习惯,年龄等因素的影响。
    采用多阶段法验证候选基因。
    阶段I:使用宽松的阈值获得获选候选位点;
    阶段II~n:在独立群体进行验证。
    采用gene based/pathway based 关联分析的方法,提高检验功效。
    加入更多组学数据联合分析,例如,转录组、表观基因组。

    TWAS

    TWAS:《Opportunities and challenges for transcriptomewide association studies》

    image.jpeg

    《Integrative approaches for large-scale transcriptome-wide association studies》

    image.jpeg image.jpeg

    孟德尔随机化
    孟德尔随机化(Mendelian Randomization,MR)研究设计,遵循“亲代等位基因随机分配给子代”的孟德尔遗传规律,如果基因型决定表型,基因型通过表型而与疾病发生关联,因此可以使用基因型作为工具变量来推断表型与疾病之间的关联。


    示意图

    SNP is associated with the exposure
    SNP is not associated with confounding variables
    SNP only associated with outcome through the exposure

    相关文章

      网友评论

        本文标题:GWAS相关知识

        本文链接:https://www.haomeiwen.com/subject/wwixbltx.html