美文网首页
GWAS | 最佳线性无偏估计量 BLUE值计算(多年单点有重复

GWAS | 最佳线性无偏估计量 BLUE值计算(多年单点有重复

作者: iBioinformatics | 来源:发表于2023-04-19 06:48 被阅读0次

    GWAS中使用均值、BLUP值还是BLUE值,邓飞老师有两篇帖子进行了介绍,链接如下,根据我的数据,我选择了BLUE值。在这里提醒大家,不要一上来就忽略所有warning,很容易找不到问题。

    # 一些教程中,会在开头忽略所有的警告,即以下的code,建议不要这么做
    options(lmerControl=list(check.nobs.vs.rankZ = "warning",
                             check.nobs.vs.nlev = "warning",
                             check.nobs.vs.nRE = "warning",
                             check.nlev.gtreq.5 = "warning",
                             check.nlev.gtr.1 = "warning"))
    
    

    GWAS和GS使用BLUE值还是BLUP值作为表型值?_邓飞----育种数据分析之放飞自我-CSDN博客
    科学网-混合线性模型中BLUE值 VS BLUP值-邓飞的博文 (sciencenet.cn)
    不想看原理的话,就是下面这句总结:
    “整体而言,BLUP值会向均值收缩(shrinkage),虽然结果是最佳预测,但是校正值的方差变小,当你做GWAS时,不容易找到显著性位点,增加了噪音(noise)。而且在GWAS中,品种是作为随机因子,如果你使用BLUP值,相当于进行了两次收缩(shrinkage)。因此, 比较好的方式是,在one-stage中,将地点,年份,区组作为随机因子,将品种作为固定因子,计算BLUE值”。

    1.数据导入

    数据中包含188个品种,11个性状,两年,每年分别三个区组,每个区组内5个单株重复的数据。

    > setwd("D:/GWAS_phe")
    > raw_data <- read.table("phe_raw.txt", header = T, check.names = F, sep = "\t")
    > dat=raw_data
    > library('magrittr')
    > head(dat)
      Cul Rep Year    TL   PA    SA    AD     V  NT  UFW NR   AL  UDW   MTL
    1   1   1 2017 40.37 2.81  8.82 0.695 0.150  94 0.26 58 0.70 0.05 90.68
    2   1   1 2017 62.99 4.33 13.62 0.688 0.230 139 0.54 42 1.50 0.04 90.68
    3   1   1 2017 90.68 7.30 22.93 0.805 0.460 115 0.36 38 2.39 0.03 90.68
    4   1   1 2017 42.09 4.36 13.71 0.837 0.360  89 0.38 22 1.91 0.03 90.68
    5   1   1 2017 57.25 5.97 18.75 0.943 0.490  77 0.64 36 1.59 0.03 90.68
    6   1   2 2017 25.30 1.76  5.54 0.697 0.097  30 0.23 24 1.05 0.02 35.30
    > for(i in 1:3) dat[,i] = as.factor(dat[,i])  #前三列设置为factor
    > str(dat) 
    'data.frame':   5640 obs. of  14 variables:
     $ Cul : Factor w/ 188 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
     $ Rep : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...
     $ Year: Factor w/ 2 levels "2017","2020": 1 1 1 1 1 1 1 1 1 1 ...
     $ TL  : num  40.4 63 90.7 42.1 57.2 ...
     $ PA  : num  2.81 4.33 7.3 4.36 5.97 1.76 2.53 1.77 2.72 2.47 ...
     $ SA  : num  8.82 13.62 22.93 13.71 18.75 ...
     $ AD  : num  0.695 0.688 0.805 0.837 0.943 ...
     $ V   : num  0.15 0.23 0.46 0.36 0.49 0.097 0.168 0.141 0.191 0.136 ...
     $ NT  : int  94 139 115 89 77 30 37 21 34 29 ...
     $ UFW : num  0.26 0.54 0.36 0.38 0.64 0.23 0.22 0.3 0.26 0.24 ...
     $ NR  : int  58 42 38 22 36 24 28 33 23 18 ...
     $ AL  : num  0.7 1.5 2.39 1.91 1.59 1.05 1.06 0.53 1.33 1.96 ...
     $ UDW : num  0.05 0.04 0.03 0.03 0.03 0.02 0.02 0.03 0.02 0.02 ...
     $ MTL : num  90.7 90.7 90.7 90.7 90.7 ...
    
    

    2. R包lme4实现blue值计算

    2.1 单性状blue值计算 (以TL为例)

    > library('lme4')
    > m = lmer(TL ~ Cul + (1|Rep)+(1|Year:Rep) + (1|Year), data = dat)
    #固定因子:Cul
    #随机因子:Year + Rep + Year:Rep
    boundary (singular) fit: see ?isSingular
    
    

    此时出现了报错boundary (singular) fit: see ?isSingular,边界畸形拟合。
    在这个地方我卡了一周,完全不知道是哪里出了问题,后来发现,是Rep!Rep不是单独存在的因子,所以不能出现单独的Rep。
    之后参照教程,模型更改为以下写法:

    > m = lmer(TL ~ Cul + (1|Year) + (1|Rep%in%Year) + (1|Cul:Year), data = dat)
    错误: grouping factors must have > 1 sampled level
    
    

    再一次报错grouping factors must have > 1 sampled level,只适用于因子大于一组的模型。继续查资料,Rep%in%Year,这个写法是嵌套模型,而Rep表面来看是重复,但是实际是区组效应,很大一部分是由于地块间/地块内的微环境造成的,和年份应该是交叉或互作效应,并不存在嵌套。
    再次修改模型,如下:

    > m1 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year), data = dat)
    
    

    没有再出现报错了!!激动!!!

    2.2 模型检验

    解决了Rep的问题后,又想到了一个新的问题,各个因子之间的交互关系考不考虑?如果考虑的话,考虑多少?所以,引入了模型检验,根据数据,我列出了以下几种模型:

    > m1 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year), data = dat)
    > m2 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year) +(1|Year:Cul), data = dat)
    > m3 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year) +(1|Rep:Cul), data = dat)
    > m4 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year) +(1|Year:Cul) +(1|Rep:Cul), data = dat)
    > m5 = lmer(TL ~ Cul +(1|Year:Rep) + (1|Year) +(1|Year:Cul) +(1|Rep:Cul) +(1|Rep:Cul:Year), data = dat)
    
    

    五种模型都没有报错,那到底哪一个模型更好呢?可以用anova方法进行模型间的比较:

    > anova(m1,m2,m3,m4,m5)
    refitting model(s) with ML (instead of REML)
    Data: dat
    Models:
    m1: TL ~ Cul + (1 | Year:Rep) + (1 | Year)
    m2: TL ~ Cul + (1 | Year:Rep) + (1 | Year) + (1 | Year:Cul)
    m3: TL ~ Cul + (1 | Year:Rep) + (1 | Year) + (1 | Rep:Cul)
    m4: TL ~ Cul + (1 | Year:Rep) + (1 | Year) + (1 | Year:Cul) + (1 | Rep:Cul)
    m5: TL ~ Cul + (1 | Year:Rep) + (1 | Year) + (1 | Year:Cul) + (1 | Rep:Cul) + (1 | Rep:Cul:Year)
       npar   AIC   BIC logLik deviance   Chisq Df Pr(>Chisq)    
    m1  191 38747 39981 -19182    38365                          
    m2  192 37937 39177 -18776    37553 811.918  1  < 2.2e-16 ***
    m3  192 38749 39989 -19182    38365   0.000  0               
    m4  193 37866 39113 -18740    37480 885.047  1  < 2.2e-16 ***
    m5  194 37768 39022 -18690    37380  99.287  1  < 2.2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    

    上述结果看出,m2和m4、m5的效应达到极显著水平,m5的AIC值和BIC值也最低,因此,最优模型为m5。

    2.3 计算

    > as.data.frame(fixef(m5))
                   fixef(m5)
    (Intercept)  46.85666666
    Cul2        -13.87600506
    Cul3         -9.26492182
    Cul4        -22.73966667
    Cul5        -14.38666667
    Cul6        -19.41466667
    Cul7          4.93466667
    Cul8        -13.18400000
    Cul9         -9.34900000
    Cul10        -3.54966667
    
    

    植物中,需要使用lsmeans添加截距

    > library('lsmeans')
    > re=lsmeans(m5,"Cul")
    Note: D.f. calculations have been disabled because the number of observations exceeds 3000.
    To enable adjustments, add the argument 'pbkrtest.limit = 4733' (or larger)
    [or, globally, 'set emm_options(pbkrtest.limit = 4733)' or larger];
    but be warned that this may result in large computation time and memory use.
    Note: D.f. calculations have been disabled because the number of observations exceeds 3000.
    To enable adjustments, add the argument 'lmerTest.limit = 4733' (or larger)
    [or, globally, 'set emm_options(lmerTest.limit = 4733)' or larger];
    but be warned that this may result in large computation time and memory use.
    > re
     Cul lsmean   SE  df asymp.LCL asymp.UCL
     1     46.9 11.7 Inf    23.857      69.9
     2     33.0 11.7 Inf     9.967      56.0
     3     37.6 11.7 Inf    14.579      60.6
     4     24.1 11.7 Inf     1.117      47.1
     5     32.5 11.7 Inf     9.470      55.5
     6     27.4 11.7 Inf     4.442      50.4
     7     51.8 11.7 Inf    28.791      74.8
     8     33.7 11.7 Inf    10.673      56.7
     9     37.5 11.7 Inf    14.508      60.5
     10    43.3 11.7 Inf    20.307      66.3
    Degrees-of-freedom method: asymptotic 
    Confidence level used: 0.95 
    
    

    结果中的lsmeans即为各品种该性状的BLUE值。

    2.4 结果输出

    > blue.df<-data.frame(re)[,1:2]
    > colnames(blue.df)<-c('Cul','TL')
    > head(blue.df)
      Cul       TL
    1   1 46.85667
    2   2 32.98066
    3   3 37.59174
    4   4 24.11700
    5   5 32.47000
    6   6 27.44200
    > write.table(blue.df, file = "TL_blue.txt", row.names = F, col.names = T, quote = F, sep = "\t")
    
    

    转自:https://www.jianshu.com/p/f4f1b5b75830

    相关文章

      网友评论

          本文标题:GWAS | 最佳线性无偏估计量 BLUE值计算(多年单点有重复

          本文链接:https://www.haomeiwen.com/subject/nyzzddtx.html