『生信技能树』R语言20题之我的答案版本

作者: 美式永不加糖 | 来源:发表于2019-06-07 22:14 被阅读3次

    SO HOT!


    原题目在这里

    1. 安装一些R包

    1.1 安装

    1.1.1 ALL

    > if (!requireNamespace("BiocManager", quietly = TRUE))
    +     install.packages("BiocManager")
    > 
    > BiocManager::install("ALL")
    

    1.1.2 pasilla

    > if (!requireNamespace("BiocManager", quietly = TRUE))
    +    install.packages("BiocManager")
    >
    > BiocManager::install("pasilla")
    

    1.1.3 airway

    > if (!requireNamespace("BiocManager", quietly = TRUE))
    +    install.packages("BiocManager")
    >
    > BiocManager::install("airway")
    

    1.1.4 limma

    > if (!requireNamespace("BiocManager", quietly = TRUE))
    +     install.packages("BiocManager")
    > 
    > BiocManager::install("limma")
    

    1.1.5 DESeq2

    > if (!requireNamespace("BiocManager", quietly = TRUE))
    +    install.packages("BiocManager")
    
    > BiocManager::install("DESeq2")
    

    1.1.6 clusterProfiler

    > if (!requireNamespace("BiocManager", quietly = TRUE))
    +     install.packages("BiocManager")
    > 
    > BiocManager::install("clusterProfiler")
    

    1.1.7 reshape2

    > install.packages('reshape2')
    

    1.1.8 ggplot2

    > install.packages('ggplot2')
    

    1.2 运行

    library()检查已安装的包是否能够运行

    e.g.

    > library(CLL)
    # 载入需要的程辑包:affy
    # 载入需要的程辑包:BiocGenerics
    # 载入需要的程辑包:parallel
    # 
    # 载入程辑包:‘BiocGenerics’
    # 
    # The following objects are masked from ‘package:parallel’:
    #   
    #   clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    # clusterExport, clusterMap, parApply, parCapply, parLapply,
    # parLapplyLB, parRapply, parSapply, parSapplyLB
    # 
    # The following objects are masked from ‘package:stats’:
    #   
    #   IQR, mad, sd, var, xtabs
    # 
    # The following objects are masked from ‘package:base’:
    #   
    #   anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    # dirname, do.call, duplicated, eval, evalq, Filter, Find, get,
    # grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match,
    # mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position,
    # rank, rbind, Reduce, rownames, sapply, setdiff, sort, table,
    # tapply, union, unique, unsplit, which, which.max, which.min
    # 
    # 载入需要的程辑包:Biobase
    # Welcome to Bioconductor
    # 
    # Vignettes contain introductory material; view with
    # 'browseVignettes()'. To cite Bioconductor, see
    # 'citation("Biobase")', and for packages 'citation("pkgname")'.
    

    若试图运行一个没有安装的包:

    > available.packages()
    #                              Package                         Version         
    # A3                            "A3"                            "1.0.0" 
    > library(A3)
    # Error in library(A3) : 不存在叫‘A3’这个名字的程辑包
    

    2. 了解ExpressionSet对象

    比如CLL包里面就有data(sCLLex) ,找到它包含的元素,提取其表达矩阵(使用exprs函数),查看其大小。

    提取表达矩阵:

    > library('CLL')
    > data("sCLLex")
    > exprSet <- exprs(sCLLex)
    

    查看大小:

    > dim(exprSet)
    # [1] 12625    22
    > str(exprSet)
    #  num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
     - attr(*, "dimnames")=List of 2
      ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
      ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
    

    3. 了解 str,head,help函数,作用于 第二步提取到的表达矩阵

    3.1 str()

    > str(exprSet)
    #  num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
     - attr(*, "dimnames")=List of 2
      ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
      ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
    

    Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object. It calls args for (non-primitive) function objects.

    用于查看矩阵的大小(列数、行数),查看列名、行名的示例。

    3.2 head()

    > head(exprSet)
    #          CLL11.CEL CLL12.CEL CLL13.CEL CLL14.CEL CLL15.CEL CLL16.CEL
    # 1000_at    5.743132  6.219412  5.523328  5.340477  5.229904  4.920686
    # 1001_at    2.285143  2.291229  2.287986  2.295313  2.662170  2.278040
    # 1002_f_at  3.309294  3.318466  3.354423  3.327130  3.365113  3.568353
    # 1003_s_at  1.085264  1.117288  1.084010  1.103217  1.074243  1.073097
    # 1004_at    7.544884  7.671801  7.474025  7.152482  6.902932  7.368660
    # 1005_at    5.083793  7.610593  7.631311  6.518594  5.059087  4.855161
    # ...
    

    Returns the first or last parts of a vector, matrix, table, data frame or function. Since head() and tail() are generic functions, they may also have been extended to other classes.

    用于查看对象的前几行,默认为前6行,可通过 n = xL 调整。

    3.3 help()

    > help("exprs")
    
    R Documentation

    help is the primary interface to the help systems.

    用于调出 R Documentation 对某个函数的说明页面,?fuctionhelp() 的简略写法

    4. 安装并了解 hgu95av2.db 包

    看看 ls(“package:hgu95av2.db”) 后显示的哪些变量?

    安装:

    > if (!requireNamespace("BiocManager", quietly = TRUE))
    + install.packages("BiocManager")
    >  
    > BiocManager::install("hgu95av2.db")
    
    > library(hgu95av2.db)
    > ls("package:hgu95av2.db")
    # [1] "hgu95av2"              "hgu95av2.db"           "hgu95av2_dbconn"     
    # [4] "hgu95av2_dbfile"       "hgu95av2_dbInfo"       "hgu95av2_dbschema"   
    # [7] "hgu95av2ACCNUM"        "hgu95av2ALIAS2PROBE"   "hgu95av2CHR"         
    # [10] "hgu95av2CHRLENGTHS"    "hgu95av2CHRLOC"        "hgu95av2CHRLOCEND" 
    # [13] "hgu95av2ENSEMBL"       "hgu95av2ENSEMBL2PROBE" "hgu95av2ENTREZID"   
    # [16] "hgu95av2ENZYME"        "hgu95av2ENZYME2PROBE"  "hgu95av2GENENAME"   
    # [19] "hgu95av2GO"            "hgu95av2GO2ALLPROBES"  "hgu95av2GO2PROBE"   
    # [22] "hgu95av2MAP"           "hgu95av2MAPCOUNTS"     "hgu95av2OMIM"       
    # [25] "hgu95av2ORGANISM"      "hgu95av2ORGPKG"        "hgu95av2PATH"       
    # [28] "hgu95av2PATH2PROBE"    "hgu95av2PFAM"          "hgu95av2PMID"       
    # [31] "hgu95av2PMID2PROBE"    "hgu95av2PROSITE"       "hgu95av2REFSEQ"     
    # [34] "hgu95av2SYMBOL"        "hgu95av2UNIGENE"       "hgu95av2UNIPROT"   
    
    > as.list(hgu95av2GO[1]
    # $`1000_at`$`GO:0097110`
    # $`1000_at`$`GO:0097110`$GOID
    # [1] "GO:0097110"
    
    # $`1000_at`$`GO:0097110`$Evidence
    # [1] "IEA"
    
    # $`1000_at`$`GO:0097110`$Ontology
    # [1] "MF"
    

    1000_at为探针,分别显示GOID, Evidence(基因注释证据代码), Ontology(基因本体论中的三个分类,BP = Biological Process; CC = Cellular Component; MF = Molecular Function).

    > as.list(hgu95av2UNIGENE[1])
    # $`1000_at`
    # [1] "Hs.861"
    

    "Hs.861"为NCBI UniGene数据库的ID.

    提取其中的数据:

    > get(featureNames(sCLLex)[1],hgu95av2GO)[1]
    # $`GO:0000165`
    # $`GO:0000165`$GOID
    # [1] "GO:0000165"
    
    # $`GO:0000165`$Evidence
    # [1] "NAS"
    
    # $`GO:0000165`$Ontology
    # [1] "BP"
    

    5. 理解 head(toTable(hgu95av2SYMBOL)) 的用法

    找到TP53基因对应的探针ID

    > head(toTable(hgu95av2SYMBOL))
    #    probe_id  symbol
    # 1   1000_at   MAPK3
    # 2   1001_at    TIE1
    # 3 1002_f_at CYP2C19
    # 4 1003_s_at   CXCR5
    # 5   1004_at   CXCR5
    # 6   1005_at   DUSP1
    

    找到TP53基因对应的探针ID:

    > ids <- toTable(hgu95av2SYMBOL)
    > grep("TP53",ids$symbol)
    # [1]   732   884   966   997  1420  2675  3322  4120  4304  5475  5526 10084
    > ids[grep("TP53",ids$symbol),]
    #         probe_id  symbol
    # 732      1711_at TP53BP1
    # 884      1860_at TP53BP2
    # 966      1939_at    TP53
    # 997    1974_s_at    TP53
    # 1420    31618_at    TP53
    # 2675    33025_at TP53TG5
    # 3322    33749_at TP53TG1
    # 4120    34629_at TP53I11
    # 4304    34822_at TP53BP2
    # 5475    36079_at  TP53I3
    # 5526    36136_at TP53I11
    # 10084 40986_s_at TP53BP1
    

    6. 理解探针与基因的对应关系

    总共多少个基因,基因最多对应多少个探针,是哪些基因,是不是因为这些基因很长,所以在其上面设计多个探针呢?

    > summary(hgu95av2SYMBOL)
    # SYMBOL map for chip hgu95av2 (object of class "ProbeAnnDbBimap")
    # |
    # | Lkeyname: probe_id (Ltablename: probes)
    # |    Lkeys: "1000_at", "1001_at", ... (total=12625/mapped=11460)
    # |
    # | Rkeyname: symbol (Rtablename: gene_info)
    # |    Rkeys: "A1BG", "A2M", ... (total=61468/mapped=8585)
    # |
    # | direction: L --> R
    

    共有61468个基因,对应上8585个。

    > table(ids$symbol) %>% sort() %>% tail()
    
    # YME1L1  GAPDH INPP4A    MYB PTGER3  STAT1 
    #     7      8      8      8      8      8 
    

    列出 ids 的 'symbol' 列及每个元素出现的个数,排序,显示最后6个;

    得到对应最多探针的基因。

    基因长度与设计在上面的探针数量无关。

    7. 第二步提取到的表达矩阵,找到那些不在 hgu95av2.db 包收录的对应着SYMBOL的探针

    > mapped_probes <- mappedkeys(hgu95av2SYMBOL)
    > exprSetpb <- row.names(exprSet)
    > length(grep("FALSE",(exprSetpb %in% mapped_probes)))
    # [1] 1165
    > grep("FALSE",(exprSetpb %in% mapped_probes))
    # [1]     8    52    98    99   100   109   115   117   128   135   157   168
    # [13]   169   171   174   195   197   199   203   219   225   282   283   292
    # ...
    > head(exprSetpb[index])
    # [1] "1007_s_at" "1047_s_at" "1089_i_at" "108_g_at"  "1090_f_at" "1099_s_at"
    

    不确定对不对....

    8. 过滤表达矩阵

    删除那1165个没有对应基因名字的探针。

    > e = exprSet[rownames(exprSet) %in% mapped_probes,]
    > str(exprSet)
    # num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
    # - attr(*, "dimnames")=List of 2
    #  ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
    #  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
    > str(e)
    # num [1:11460, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
    # - attr(*, "dimnames")=List of 2
    #  ..$ : chr [1:11460] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
    #  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
    

    9. 整合表达矩阵

    多个探针对应一个基因的情况下,只保留在所有样本里面平均表达量最大的那个探针。

    > maxid = by(e,ids$symbol,function(x) rownames(x)[which.max(rowMeans(x))])
    # 按照 ids$symbol 将 e 分组,取行平均值最大的行 的 行名
    # 但,为什么要用 symbol 分组呢 😂
    > uniid = as.character(maxid)
    > uni_e = e[rownames(e) %in% uniid,]
    
    > maxid = by(e,ids$probe_id,function(x) rownames(x)[which.max(rowMeans(x))])
    > uniid = as.character(maxid)
    > uni_e = e[rownames(e) %in% uniid,]
    > str(uni_e)
    # num [1:11460, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
    # - attr(*, "dimnames")=List of 2
    #  ..$ : chr [1:11460] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
    #  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
    # emmmmm 结局是一样的,ok我懂了
    

    10. 把过滤后的表达矩阵更改行名为基因的symbol

    因为这个时候探针和基因是一对一关系了。

    > rownames(uni_e) = ids[match(rownames(uni_e),ids$probe_id),2]
    # 将ids的第2列(symbol)对应到 uni_e 的行名
    > library(reshape2)
    > m_e = melt(uni_e)
    > colnames(m_e) = c('symbol','sample','value')
    
    reshape2_melt()

    分组:

    > pd = pData(sCLLex)
    > Disease = pd[,2]
    > table(Disease)
    # Disease
    # progres.   stable 
    #      14        8 
    > m_e$group = rep(Disease,each=nrow(uni_e))
    

    11. 对第10步得到的表达矩阵进行探索

    先画第一个样本的所有基因的表达量的boxplot,hist,density,然后画所有样本的这些图

    > suppressPackageStartupMessages(library(ggplot2))
    > ggplot(m_e,aes(x=sample,y=value,fill=group))+geom_boxplot()
    
    ggplot_box
    > ggplot(m_e,aes(value,fill=group)) + geom_histogram(bins = 200)+facet_wrap(~sample, nrow = 4)
    
    ggplot_facetwrap
    > ggplot(m_e,aes(value,col=group)) + geom_density()
    

    [图片上传失败...(image-e00840-1559916805026)]

    12. 理解统计学指标mean,median,max,min,sd,var,mad

    并计算出每个基因在所有样本的这些统计学指标,最后按照mad值排序,取top 50 mad值的基因,得到列表。(注意:这个题目出的并不合规,请仔细看。)

    12.1 mean

    算术平均值

    > meanlist=apply(uni_e, 1, mean)
    > head(meanlist)
    #  UTF1        EVX1      ADGRB1        SYN1      CHRNB4        ARR3 
    # -0.14880683 -0.13705268  0.05817044  0.07589677  0.08317654  0.20826165 
    

    12.2 median

    中位数

    > mdlist=apply(uni_e, 1, median)
    > head(mdlist)
    #    MAPK3     TIE1  CYP2C19    CXCR5    DUSP1    MMP10 
    # 5.368993 2.333562 3.378740 7.402258 5.902370 3.320628
    

    12.3 max

    最大值

    > maxlist=apply(uni_e, 1, max)
    > head(maxlist)
    #    MAPK3     TIE1  CYP2C19    CXCR5    DUSP1    MMP10 
    # 6.251678 2.662170 3.798335 8.802729 9.919125 3.574332 
    

    12.4 min

    最小值

    > minlist=apply(uni_e, 1, min)
    > head(minlist)
    #    MAPK3     TIE1  CYP2C19    CXCR5    DUSP1    MMP10 
    # 4.826131 2.222276 3.247276 6.456285 4.097580 3.213493 
    

    12.5 sd

    标准差

    > sdlist=apply(uni_e, 1, sd)
    > head(sdlist)
    #      MAPK3       TIE1    CYP2C19      CXCR5      DUSP1      MMP10 
    # 0.36652641 0.09046121 0.12801488 0.58908623 1.73368583 0.10074804 
    

    12.6 var

    方差

    > varlist=apply(uni_e, 1, var)
    > head(varlist)
    #      MAPK3       TIE1    CYP2C19      CXCR5      DUSP1      MMP10 
    # 0.13434161 0.00818323 0.01638781 0.34702258 3.00566656 0.01015017 
    

    12.7 mad

    绝对中位差Median Absolute Deviation

    > madlist=apply(uni_e, 1, mad)
    > head(madlist)
    #      MAPK3       TIE1    CYP2C19      CXCR5      DUSP1      MMP10 
    # 0.22779776 0.06516670 0.08294023 0.38497146 1.43854685 0.09476130 
    

    top 50 mad值的基因:

    > tail(sort(madlist),50)
    #     PFN2   TNFSF9 ARHGAP44   P2RY14  THEMIS2      LPL    ANXA4    DUSP6 
    # 1.481294 1.485155 1.488032 1.505107 1.507287 1.532212 1.533327 1.551320 
    #    DUSP5     H1FX     FLNA   CLEC2B   TSPYL2   ZNF266   S100A9    NR4A2 
    # 1.553782 1.557412 1.574436 1.578055 1.582430 1.587748 1.608285 1.612875 
    #    TGFBI     ARF6    APBB2     VCAN    RBM38     CAPG   PLXNC1     RGS2 
    # 1.643149 1.654548 1.674443 1.681098 1.703638 1.713747 1.718297 1.770151 
    #   RNASE6    VAMP5     CYBB     GNLY     CCL3     OAS1    TRIB2  ZNF804A 
    # 1.775430 1.791017 1.811973 1.814364 1.862311 1.883509 1.937294 1.986163 
    #      IGH    PCDH9    VIPR1   COBLL1  GUSBP11   S100A8      HBB   LHFPL2 
    # 2.090866 2.144223 2.171912 2.179666 2.204212 2.220970 2.267136 2.317045 
    #     FCN1    ZAP70    IGLC1   LGALS1      FOS   SLAMF1     TCF7      DMD 
    # 2.371590 2.579046 2.590895 2.600604 2.938078 2.944105 2.993352 3.071541 
    #  IGF2BP3   FAM30A 
    # 3.234011 3.982191 
    

    13. 根据第12步骤得到top 50 mad值的基因列表来取表达矩阵的子集

    并且热图可视化子表达矩阵。试试看其它5种热图的包的不同效果。

    取子集:

    > top50gene=tail(sort(madlist),50)
    > top50gene=as.data.frame(top50gene)
    > top50genelist=rownames(top50gene)
    > top50matrix=uni_e[top50genelist,]
    > ct=scale(top50matrix,center=T,scale=F)  ## 中心化
    > nmlztop50matrix=scale(ct,center=T,scale=T) ## 标准化
    
    > pheatmap::pheatmap(nmlztop50matrix)
    
    pheatmap
    > heatmap(nmlztop50matrix)
    
    heatmap
    > ggplot(ml_nmtop50,aes(sample,symbol))+
    +   geom_tile(aes(fill=value),colour = "white")+
    +   scale_fill_gradient(low = "white",high = "steelblue")
    
    ggplot_heatmap
    > library(gplots)
    > heatmap.2(nmlztop50matrix,key = T,symkey = FALSE,density.info="none",trace="none")
    
    gplots_heatmap.2
    > library(lattice)
    > library(latticeExtra)
    > x  <- t(as.matrix(scale(nmlztop50matrix)))
    > dd.row <- as.dendrogram(hclust(dist(x)))
    > row.ord <- order.dendrogram(dd.row)
    > levelplot(t(nmlztop50matrix),aspect = "fill",xlab="sample",ylab="symbol",colorkey = list(space = "left"),legend=list(right=list(fun=dendrogramGrob,args=list(x=dd.row,rod=row.ord,side='right',size=5))))
    > levelplot(t(nmlztop50matrix),aspect =
    +             "fill",xlab="sample",ylab="symbol",colorkey =
    +             list(space = "left"),legend=
    +             list(right=list(fun=dendrogramGrob,args=
    +                               list(x=dd.row,rod=row.ord,side='right',size=5))))
    
    lattice_levelplot

    emmm PDF示例里的代码跑出来top50matirx里是很多基因的重复,so这是我琢磨出来的结果

    14. 取不同统计学指标mean,median,max,mean,sd,var,mad的各top50基因列表

    使用UpSetR包来看他们之间的overlap情况。

    各列表:

    > e_mean=tail(sort(meanlist),50)
    > e_md=tail(sort(mdlist),50)
    > e_max=tail(sort(maxlist),50)
    > e_min=tail(sort(minlist),50)
    > e_sd=tail(sort(sdlist),50)
    > e_var=tail(sort(varlist),50)
    > e_mad=tail(sort(madlist),50)
    
    > suppressPackageStartupMessages(library("UpSetR"))
    > library(magrittr)
    > e_all = data.frame(all,
    +                    e_mean=ifelse(all %in% names(e_mean),1,0),
    +                    e_md=ifelse(all %in% names(e_md),1,0),
    +                    e_max=ifelse(all %in% names(e_max),1,0),
    +                    e_min=ifelse(all %in% names(e_min),1,0),
    +                    e_sd=ifelse(all %in% names(e_sd),1,0),
    +                    e_var=ifelse(all %in% names(e_var),1,0),
    +                    e_mad=ifelse(all %in% names(e_mad),1,0)
    + )
    > upset(e_all,nsets = 7,sets.bar.color = "#BD1111")
    
    UpSetR_magrittr

    15. 在第二步的基础上面提取CLL包里面的data(sCLLex) 数据对象的样本的表型数据。

    > pd=pData(sCLLex)
    > grouplist=as.character(pd[,2])
    > table(grouplist)
    # grouplist
    # progres.   stable 
    #       14        8 
    

    16. 对所有样本的表达矩阵进行聚类并且绘图

    然后添加样本的临床表型数据信息(更改样本名)

    > clust = t(exprSet)
    > rownames(clust) = colnames(exprSet)
    > View(clust)
    > View(exprSet)
    > clust_dist = dist(clust,method = "euclidean")
    > hc = hclust(clust_dist,"ward.D")
    > suppressPackageStartupMessages(library(factoextra))
    > fviz_dend(hc, k = 4 ,cex = 0.5,k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),color_labels_by_k = TRUE, rect = TRUE)
    
    hclust_fivzdend

    17.对所有样本的表达矩阵进行PCA分析并且绘图,添加表型信息

    > pca_data <- prcomp(t(exprSet),scale=TRUE)
    > pcx <- data.frame(pca_data$x)
    > pcr <- cbind(samples=rownames(pcx),grouplist, pcx) 
    > ggplot(pcr, aes(PC1, PC2))+geom_point(size=5, aes(color=grouplist)) +
    +   geom_text(aes(label=samples),hjust=-0.1, vjust=-0.3)
    
    procmp_ggplot

    18. 根据表达矩阵及样本分组信息进行批量T检验,得到检验结果表格

    分组,求p值

    > gl = as.factor(grouplist)
    > group1 = which(grouplist == levels(gl)[1])
    > group2 = which(grouplist == levels(gl)[2])
    # e = exprSet[rownames(exprSet) %in% mapped_probes,]
    > et1 = e[,group1]
    > et2 = e[,group2]
    > data_t = cbind(et1,et2)
    > pvals = apply(e, 1, function(x){
    +   t.test(as.numeric(x)~grouplist)$p.value
    + })    ## p值
    > p.adj = p.adjust(pvals, method = "BH")  ## 校正后的p值
    

    求log2FC, 做批量t检验

    > data_mean_c = rowMeans(et1) ## control
    > data_mean_t = rowMeans(et2) ## treatment
    > log2FC = data_mean_t-data_mean_c
    > DEG_t.test = cbind(data_mean_c, data_mean_t, log2FC, pvals, p.adj)
    > DEG_t.test=DEG_t.test[order(DEG_t.test[,4]),] ##排序
    > DEG_t.test=as.data.frame(DEG_t.test)
    > head(DEG_t.test)
    #         data_mean_c data_mean_t     log2FC        pvals     p.adj
    # 36129_at    7.875615    8.791753  0.9161377 1.629755e-05 0.1867699
    # 37676_at    6.622749    7.965007  1.3422581 4.058944e-05 0.2211373
    # 33791_at    7.616197    5.786041 -1.8301554 6.965416e-05 0.2211373
    # 39967_at    4.456446    2.152471 -2.3039752 8.993339e-05 0.2211373
    # 34594_at    5.988866    7.058738  1.0698718 9.648226e-05 0.2211373
    # 32198_at    4.157971    3.407405 -0.7505660 2.454557e-04 0.3192169
    

    19. 使用limma包对表达矩阵及样本分组信息进行差异分析,得到差异分析表格

    重点看logFC和P值,画火山图(就是logFC和-log10(P值)的散点图)。

    > suppressPackageStartupMessages(library(limma))
    > design1=model.matrix(~factor(grouplist))  ## 设计矩阵
    > colnames(design1)=levels(factor(grouplist))
    > rownames(design1)=colnames(exprSet)
    > fit1 = lmFit(exprSet,design1)
    > fit1=eBayes(fit1)
    > options(digits = 3) 
    > mtx1 = topTable(fit1,coef=2,adjust='BH',n=Inf) 
    > DEG_mtx1 = na.omit(mtx1) ##去除缺失值
    > head(DEG_mtx1)
    #           logFC AveExpr     t  P.Value adj.P.Val    B
    # 39400_at  1.028    5.62  5.84 8.34e-06    0.0334 3.23
    # 36131_at -0.989    9.95 -5.77 9.67e-06    0.0334 3.12
    # 33791_at -1.830    6.95 -5.74 1.05e-05    0.0334 3.05
    # 1303_at   1.384    4.46  5.73 1.06e-05    0.0334 3.04
    # 36122_at -0.780    7.26 -5.14 4.21e-05    0.1062 1.93
    # 36939_at -2.547    6.92 -5.04 5.36e-05    0.1128 1.74
    

    画图:

    > DEG=DEG_mtx1
    > logFC_cutoff <- with(DEG,mean(abs(logFC)) + 2*sd(abs(logFC)) ) 
    > DEG$result = as.factor(ifelse(DEG$P.Value < 0.05 & abs(DEG$logFC) > logFC_cutoff,  ## p.Value<0.05,|logFC|>1
    +                               ifelse(DEG$logFC > logFC_cutoff ,'UP','DOWN'),'NOT'))
    > title <- paste0('Cutoff for logFC is ',round(logFC_cutoff,3), #round保留小数位数
    +                     
    +                     '\nThe number of up gene is ',nrow(DEG[DEG$result =='UP',]) ,
    +                     
    +                     '\nThe number of down gene is ',nrow(DEG[DEG$result =='DOWN',])
    + )
    > ggplot(data=DEG, aes(x=logFC, y=-log10(P.Value), color=result)) +
    +   geom_point(alpha=0.4, size=1.75) +
    +   theme_set(theme_set(theme_bw(base_size=20)))+
    +   xlab("log2 fold change") + ylab("-log10 p-value") +
    +   ggtitle( title ) + theme(plot.title = element_text(size=15,hjust = 0.5))+
    +   scale_colour_manual(values = c('blue','black','red'))
    
    ggplot_volcano

    20.对T检验结果的P值和limma包差异分析的P值画散点图,看看哪些基因相差很大?

    > DEG_t.test = DEG_t.test[rownames(DEG_mtx1),]
    > colnames(DEG_t.test)
    # [1] "data_mean_c" "data_mean_t" "log2FC"      "pvals"      
    # [5] "p.adj"      
    > colnames(DEG_mtx1)
    # [1] "logFC"     "AveExpr"   "t"         "P.Value"   "adj.P.Val"
    # [6] "B"        
    

    画图:

    > plot(DEG_t.test[,4],DEG_mtx1[,4])
    > plot(-log10(DEG_t.test[,4]),-log10(DEG_mtx1[,4]))
    
    ggplot_dot ggplot_dot_lg

    最后,向大家隆重推荐生信技能树的一系列干货!

    1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
    2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
    3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw

    相关文章

      网友评论

        本文标题:『生信技能树』R语言20题之我的答案版本

        本文链接:https://www.haomeiwen.com/subject/vhnvxctx.html