美文网首页生信科研信息学
给女朋友写的生统资料_Part18

给女朋友写的生统资料_Part18

作者: 城管大队哈队长 | 来源:发表于2019-06-18 21:33 被阅读108次

    apply和function

    差异基因的检验估计会用到function和apply。不过差异基因表达的function和apply估计也不会有太大的难度,所以我这里就直接放上《R语言实战》的例子。详细地题目应对我还是放到后面具体题目(当然也是因为我R语言太菜了……)

    function

    function可以去看下《R语言实战》第二版的P436,我这里截个图

    18_1.png

    function的基本语法为

    functionname <- function(parameters){
      statements
      return(value)
    }
    

    functionname就是你的函数名字,你也可以写成wodehanshu这种名字。

    parameters里面就是设定你要用到的参数

    statement就是你执行的一系列操作,一般来说你这一系列操作会产生一个值。然后用return返回

    例子的话,上面已经举了。

    apply

    apply的使用我还是放一个《R语言实战》的例子,在P95和P96

    18_2.png 18_3.png

    差异基因表达

    差异基因估计是用t-test做检验,然后用p.adjust做矫正(虽然照理说不能用t-test做)。我这里就不详细地讲了。举几个例子吧

    题目1

    我们第4次作业的第四题

    Type 1 diabetes is a multigenic disease caused by T-cell mediated destruction of the insulin producing β-cells. Although conventional (targeted) approaches of identifying causative genes have advanced our knowledge of this disease, many questions remain unanswered.

    Here we have a gene data from NOD mouse after(case) and before(control) treatment. The data can be found in “Data.txt”.Use the information mentioned above to answer the following questions:

    1.use paired t-test to find genes which have significant expression (p<0.05) between case and control sample. Give the number of differential expressed genes and give the names of top 10 significantly differential expression genes. hint: “apply(data,1,function(x){…})” can apply function to every row in data more quickly than “for{}”, “names()” or “rownames()” can be used to extract names of differentially expressed genes

    这里是输入了基因表达的数据。分别是10个control和10个case,然后做配对的t-test。我们可以手动地对每个基因(即每一列)对t-test。但这样太麻烦,所以我们就可以用apply。apply可以对每一列做批量化的操作。

    # 读取和整理数据
    # paste0这个操作就是把“gene”这个字符串和对应的列名粘起来,这样使得列名比较直观一点
    test4 <- read.table("rawdata/test4.txt",header = T)
    rownames(test4) <- paste0("gene",rownames(test4))
    
    # 看下数据(我后来发现没有gene1了。。。。。但我最后的答案跟作业答案一样,应该是原始文件的锅)
    > head(test4)
          control1 control2 control3 control4 control5 control6  control7 control8 control9 control10    case1    case2
    gene2 6.917468 6.308350 5.318841 5.886811 5.082975 5.629453  4.919035 3.134226 4.242564  5.783208 3.574525 5.371273
    gene3 7.862730 7.065809 6.783732 6.275773 3.063104 5.131017  3.708938 2.766133 6.755942  3.954392 7.163509 5.600665
    gene4 7.425996 7.707939 7.550885 5.708736 5.900326 6.888329  5.987753 5.948979 7.281995  6.758037 5.367602 7.898202
    gene5 6.544029 8.364054 7.239746 9.037322 8.783692 8.863040 10.602350 9.046190 9.007177  7.866627 8.079780 8.128662
    gene6 4.986893 7.248867 7.098780 6.787237 6.252609 6.428099  7.758556 6.726613 7.123883  6.584567 6.642005 7.376994
    gene7 6.995863 6.448682 6.136518 7.098878 4.984325 5.896835  6.488555 5.299349 5.524948  6.763232 7.004367 6.478264
             case3    case4    case5    case6    case7    case8    case9   case10
    gene2 7.217279 6.786191 5.523913 5.355584 6.693710 4.364467 6.270432 6.993901
    gene3 6.079167 7.125393 5.169236 4.524597 4.715383 2.946439 4.200206 5.798403
    gene4 6.321147 7.810430 6.231793 6.367243 7.579535 7.207376 7.548390 6.522594
    gene5 8.716447 9.940214 9.224614 9.139514 9.429897 9.373779 9.287744 9.275545
    gene6 7.080039 7.100746 6.453575 6.618557 7.614747 7.141351 6.076726 7.369737
    gene7 6.273696 6.413472 4.983931 5.629765 5.213316 5.931045 5.776625 6.522169
    
    # 构建函数(我个人可能比较推荐先构建函数,然后把函数放到apply中,这样可能比较直观一点)
    # 我输入数据为x,然后我提取1:10个数为data1,11:20个数为data2。然后做t.test
    myFun <- function(x){
      data1 <- x[1:10]
      data2 <- x[11:20]
      t_result <- t.test(data1,data2,paired = T)
      p_value <- t_result$p.value
      return(p_value)
    }
    
    # 利用apply,一行行地把数据放入myFun里面,一行行地做t.test
    # 输出应该是个向量
    test4_result <- apply(test4_paired_result, 1, result)
    
    # 也可以直接一行代码
    apply(test4, 1, function(x) {t.test(x[1:10],x[11:20],paired = T)$p.value})
    
    # 数下差异基因的数量
    > sum(test4_result < 0.05)
    [1] 2296
    
    # 输出前10个
    ## 先对得到的p-value进行排序
    test4_result[order(test4_result)]
    
    ## 然后得到前10
    > test4_result[order(test4_result)][1:10]
       gene28801    gene27868    gene27438    gene21642    gene24019    gene12323    gene12962    gene28939     gene2387 
    4.277344e-06 6.093302e-05 8.229606e-05 9.889894e-05 1.124717e-04 1.261658e-04 1.298200e-04 1.462493e-04 1.522920e-04 
       gene18712 
    1.601297e-04 
    

    2.Adjust the p-values in question a) with bonferroni and FDR method to find differentially expressed genes in stringent way( list the differentially expressed gene names and the adjusted p-value)

    矫正的话,直接用p.adjust就可以了。这里再用sum或者mean数一数差异基因的个数

    test4_adjust_bonf <- p.adjust(test4_result, method = "bonferroni")
    mean(test4_adjust_bonf < 0.05)
    
    test4_adjust_fdr <- p.adjust(test4_result, method = "fdr")
    mean(test4_adjust_fdr < 0.05)
    

    题目2

    把题目中的paired数据变成非配对的数据。那么我们就要多一步方差齐性检验了。

    # 在函数里面做判断
    myFun <- function(x){
      data1 <- x[1:10]
      data2 <- x[11:20]
      if (var.test(data1,data2)$p.value > 0.05){
        t_result <- t.test(data1,data2,var.equal = T)
        p_value <- t_result$p.value
        return(p_value)
      } else {
        t_result <- t.test(data1,data2,var.equal = F)
        p_value <- t_result$p.value
        return(p_value)
      }
    
    }
    
    test4_result <- apply(test4, 1, myFun)
    

    题目3

    来自17年考试的第5题

    Nonalcoholic steatohepatitis or NASH is a common liver disease, and was found to be linked to obesity and diabetes, suggesting an important role of adipose tissue in the pathogenesis of NASH. Therefore, the mouse model was used to investigate the interaction between adipose tissue and liver. Wildtype male C57Bl/6 mice were fed low fat food (LFD) or high fat food (HFD) for 21 weeks. The detailed data can be found in GDS4013.txt. Hint:provide the R code and result calculated with R. (本题共20分)

    1.Read the data. Check whether there are NAs in the data? If NAs exist, you should deaf with them before next steps. Hint: delete the rows with NAs in data. (5分)

    # 先读取数据
    > test <- read.table("rawdata/GDS4013.txt",header = T)
    > head(test)
                  LFD    LFD.1    LFD.2    LFD.3    LFD.4    LFD.5    LFD.6    LFD.7    LFD.8    LFD.9      HFD    HFD.1    HFD.2
    COPG1    7.756187 7.513726 7.927167 7.742323 7.974731 7.632936 7.419638 8.022296 7.649602 7.694316 7.717317 7.721638 7.873688
    ATP6V0D1 8.968808 9.004967 9.025234 9.139161 9.142125 8.998360 9.097204 9.181852 9.127648 9.149562 9.029930 8.948781 9.095943
    GOLGA7   9.698317 9.798994 9.765514 9.581075 9.776361 9.829636 9.670068 9.675262 9.812124 9.637824 9.625283 9.822128 9.700576
    PSPH     6.897664       NA 6.744140 7.225862 7.343744 6.877918 6.422594 7.188237 6.819006 6.524627 6.514469 6.949181 7.044664
    TRAPPC4  5.406903 5.473714 5.749002 5.416623 5.351018 5.431658 5.606044 5.401849 5.374199 5.331412 5.439976 5.487543 5.290085
    DPM2     6.532734 6.215751 6.520986 6.143827 6.597586 6.731978 6.553594 6.659309 6.288114 6.561733 6.819006 6.616581 6.349064
                HFD.3    HFD.4    HFD.5    HFD.6    HFD.7
    COPG1    7.637193 7.570261 7.549871 7.687248 7.603990
    ATP6V0D1 9.092459 8.981442 9.014060 9.122049 9.000419
    GOLGA7   9.739647 9.679455 9.790654 9.817816 9.761968
    PSPH     6.708345 6.413345 6.617198 6.843793 6.467956
    TRAPPC4  5.485993 5.376865 5.522681 5.430362 5.490806
    DPM2     6.554808 6.416926 6.562119 6.524627 6.771978
    
    # 利用table看看缺失值的多少
    > table(is.na(test))
    
     FALSE   TRUE 
    481694      4 
    
    # 利用na.omit
    test_new <- na.omit(test)
    > table(is.na(test_new))
    
     FALSE 
    481626 
    
    
    

    2.Find differentially expressed genes (DEGs) (up-regulated, HF>LF) between LFD and HFD samples, list the DEGs with adjusted p-values less than 0.05. For expressions of each gene, check the homogeneity of the variances in two groups at first. Hint: FDR should be used to adjust p-values.

    # 因为不是配对检验,所以我们首先要做的是检验方差齐性
    ## 所以我们需要先构建一个函数来检验方差的齐性
    myFun_var <- function(data){
      LFD <- data[1:10]
      HFD <- data[11:18]
      var_result <- var.test(LFD,HFD)
      p_value <- var_result$p.value
    }
    
    ## 应用apply来得到p-value
    var_pvalue <- apply(test_new, 1, myFun_var)
    
    ## 然后往表达矩阵上多加一列p-value,得到有附加信息的表达矩阵
    > test_new_var <- cbind(test_new,var_pvalue)
    > head(test_new_var)
                   LFD    LFD.1    LFD.2     LFD.3     LFD.4     LFD.5     LFD.6     LFD.7     LFD.8     LFD.9       HFD     HFD.1
    COPG1     7.756187 7.513726 7.927167  7.742323  7.974731  7.632936  7.419638  8.022296  7.649602  7.694316  7.717317  7.721638
    ATP6V0D1  8.968808 9.004967 9.025234  9.139161  9.142125  8.998360  9.097204  9.181852  9.127648  9.149562  9.029930  8.948781
    GOLGA7    9.698317 9.798994 9.765514  9.581075  9.776361  9.829636  9.670068  9.675262  9.812124  9.637824  9.625283  9.822128
    TRAPPC4   5.406903 5.473714 5.749002  5.416623  5.351018  5.431658  5.606044  5.401849  5.374199  5.331412  5.439976  5.487543
    DPM2      6.532734 6.215751 6.520986  6.143827  6.597586  6.731978  6.553594  6.659309  6.288114  6.561733  6.819006  6.616581
    PSMB5    10.217143 9.992716 9.974211 10.201814 10.244172 10.023191 10.072604 10.147738 10.267069 10.024750 10.142353 10.119065
                 HFD.2     HFD.3     HFD.4     HFD.5     HFD.6     HFD.7 var_pvalue
    COPG1     7.873688  7.637193  7.570261  7.549871  7.687248  7.603990  0.1119611
    ATP6V0D1  9.095943  9.092459  8.981442  9.014060  9.122049  9.000419  0.5807521
    GOLGA7    9.700576  9.739647  9.679455  9.790654  9.817816  9.761968  0.6524620
    TRAPPC4   5.290085  5.485993  5.376865  5.522681  5.430362  5.490806  0.1770907
    DPM2      6.349064  6.554808  6.416926  6.562119  6.524627  6.771978  0.6063077
    PSMB5    10.086752 10.240642 10.086752 10.246507 10.291952 10.129569  0.3887486
    
    # 得到了方差检验的p-value,我们就可以来把这个条件加入到t.test里面,然后做t-test了
    ## 首先也要构建t-test的函数
    
    myFun_t_test <- function(data){
      LFD <- data[1:10] # 1:10列为LFD
      HFD <- data[11:18]    # 11:18列为HFD
      var_result <- data[19]    # 19列为方差齐性的p-value
      t_result <- t.test(LFD,HFD,alternative = "less",var.equal = var_result > 0.05) ## 这里var_result是p-value,p-value如果大于0.05了(方差齐性),就会输出TRUE。
      p_value <- t_result$p.value
    }
    
    ## 利用apply,一行行地输入函数里面,做t-test
    t_test_pvalue <- apply(test_new_var, 1, myFun_t_test)
    
    # 得到了p-value之后,就用fdr方法做矫正
    > t_test_padjust <- p.adjust(t_test_pvalue,method = "fdr")
    > t_test_padjust[t_test_padjust < 0.05]
        SEMA5B   BC048403 
    0.01698630 0.04004457 
    

    大家可能会奇怪这里为什么是less了,一般我们不都是two.side么

    选择单双边实际上是跟你的假设有关系的,我们这里是假设up-regulated, HF>LF。只是单边而已。那为啥又是less,而不是greater呢。因为t.test(a,b,alternative)里面的alternative是针对前一个比后一个,即如果是greater的话,就是a比b大。我们这里是LFD放在了前面,假设是HF>LF的话,那么我们写的时候应该是LFD less than HFD。

    举个例子的话

    # 模拟两个值
    > a <- rnorm(20)
    > b <- rnorm(20,mean = 1)
    
    
    # 可以看到下面的统计检验,如果是greater的话,即a>b,是不显著的。而less是显著的
    > t.test(a,b,alternative = "greater")
    
        Welch Two Sample t-test
    
    data:  a and b
    t = -1.5134, df = 30.883, p-value = 0.9298
    alternative hypothesis: true difference in means is greater than 0
    95 percent confidence interval:
     -0.7961539        Inf
    sample estimates:
    mean of x mean of y 
    0.4103696 0.7858315 
    
    > t.test(a,b,alternative = "less")
    
        Welch Two Sample t-test
    
    data:  a and b
    t = -1.5134, df = 30.883, p-value = 0.07017
    alternative hypothesis: true difference in means is less than 0
    95 percent confidence interval:
           -Inf 0.04523018
    sample estimates:
    mean of x mean of y 
    0.4103696 0.7858315 
    
    

    然后先方差齐性检验,后做t-tes的代码,除了上面的,也可以用题目2的代码

    # 用了if else
    myFun <- function(x){
      data1 <- x[1:10]
      data2 <- x[11:18]
      if (var.test(data1,data2)$p.value > 0.05){
        t_result <- t.test(data1,data2,var.equal = T,alternative = "less")
        p_value <- t_result$p.value
        return(p_value)
      } else {
        t_result <- t.test(data1,data2,var.equal = F,alternative = "less")
        p_value <- t_result$p.value
        return(p_value)
      }
      
    }
    
    
    t_test_pvalue <- apply(test_new_var, 1, myFun)
    
    
    
    t_test_padjust <- p.adjust(t_test_pvalue,method = "fdr")
    t_test_padjust[t_test_padjust < 0.05]
    

    也可以用答案里面简略的

    # 方差齐性检验
    p.vartest <- apply(GDS4013_new, 1, function(x)var.test(x[1:10],x[11:18])$p.value)
    
    # 利用cbind合并p-value和表达矩阵
    GDS4013_new_data <- cbind(GDS4013_new, p.vartest)
    
    # t.test检验
    p.value <- apply(GDS4013_new_data, 1, function(x)t.test(x[1:10],x[11:18],alternative = 'less', var.equal = (x[19] >= 0.05))$p.value)
    
    p.fdr <- p.adjust(p.value, "fdr")
    
    > sum(p.fdr<0.05)
    [1] 2
    > sig.p.fdr<-p.fdr[p.fdr<0.05]
    > names(sort(sig.p.fdr))
    [1] "SEMA5B"   "BC048403"
    
    

    富集分析

    富集分析的原理我应该会在考完试有空的时候再重新理一下,我这里就直接放上用fihser exact test(fisher 精确检验)做富集分析的代码。

    稍微提下富集的原理。

    一般差异表达的基因会有上千个,一个个手动看肯定不现实。所以我们就需要从宏观上来看这些基因,用到的一个方法就是富集分析。一般用的比较多的是GO富集分析,就是比如我们得到1000个差异基因,然后我们感兴趣的GO通路在差异基因中有50个基因总的基因有2w个,在这2w个基因中,我们感兴趣的GO有200个。我们想知道感兴趣GO相比于总体来说,在差异基因集中是不是富集的。

    fisher exact test输入的格式是列联表的格式,就像下面

    差异基因 无差异基因
    在通路中 50 200-50=150
    不在通路中 1000-50=950 19000-150=18850

    变成代码就是

    > test <- matrix(c(50,950,150,18850),2)
    > test
         [,1]  [,2]
    [1,]   50   150
    [2,]  950 18850
    > fisher.test(test)
    
        Fisher's Exact Test for Count Data
    
    data:  test
    p-value < 2.2e-16
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
     4.670869 9.230328
    sample estimates:
    odds ratio 
       6.61189 
    
    

    看下我们某年的一个GO富集题目

    Usually you are interested in the function indicated by differentially expressed genes, for which GO enrichment is a widely used method. In order to find whether the differentially expression genes (downregulated, p<=0.05) are enriched in “leukocyte activation during immune response” (GO term), please show a conclusive result using fisher exact test. Known genes annotated with this GO are listed in “GO_2_2.txt”

    全部基因总共是14082个基因,前面差异表达做出来的差异基因是617个。然后GO_2_2.txt里面给出了leukocyte activation during immune response这个GO的基因,基因数目为193。我们就需要找差异基因中有多少这个GO通路的基因,答案里用到的intersect,其实就是取交集。我举个例子

    > total <- c(1:10)
    > part <- c(1,4,5)
    
    > intersect(part,total)
    [1] 1 4 5
    
    # 还可以知道有多少
    > length(intersect(part,total))
    [1] 3
    

    然后答案里的intersect出来是10个。这样列联表的数据就全了。

    差异基因 无差异基因
    在通路中 10 193-10=183
    不在通路中 617-10=607 14082-607-183-10=13282

    然后就可以做fisher了

    > test <- matrix(c(10,607,183,13282),nrow = 2)
    > test
         [,1]  [,2]
    [1,]   10   183
    [2,]  607 13282
    > fisher.test(test)
    
        Fisher's Exact Test for Count Data
    
    data:  test
    p-value = 0.5923
    alternative hypothesis: true odds ratio is not equal to 1
    95 percent confidence interval:
     0.5609475 2.2650703
    sample estimates:
    odds ratio 
      1.195686
    

    然后有人提到答案里面其实是不准的,因为GO2_2的表中,有几个基因是不在总的基因里面的。

    画出列联表的步骤我觉得可以是这样的

    • 差异基因数目定下了,然后GO表和差异基因取交集,那么就得到了列联表的左上。
    • 然后差异基因减去左上,就得到了左下
    • 然后GO表减去左上,就得到了右上
    • 然后总的基因减去左上、左下、右上,就得到了右下。

    我只是按照了答案的思路来,不保证对哈。大家自行斟酌,原理就是这么个原理。

    相关文章

      网友评论

        本文标题:给女朋友写的生统资料_Part18

        本文链接:https://www.haomeiwen.com/subject/ywoqqctx.html