GO富集之topGO

作者: x2yline | 来源:发表于2018-04-26 11:35 被阅读267次

    topGO手册中的实例实现

    手册地址:http://bioconductor.uib.no/2.7/bioc/vignettes/topGO/inst/doc/topGO.pdf

    快速入门部分可以参考:https://rpubs.com/aemoore62/TopGo_colMap_Func_Troubleshoot

    1. 导入基因和注释数据

    library(topGO)导入topGO包后,会自动创建三个环境,即GOBPTerm,GOCCterm和GOMFTerm,这些环境是有GO.db包中的GOTERM环境为基础创建的,以方便载入GO

    library(topGO)
    BPterms <- ls(GOBPTerm)
    head(BPterms)
    ## [1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000011" "GO:0000012" "GO:0000017"
    

    genefilter包

    genefilter包用来过滤基因,第一个参数为矩阵或者ExpressionSet对象,第二个参数flist接受一系列函数(经过filterfun合并的),这些函数的参数必须以向量(对应矩阵的每一行)为对象,返回逻辑值,pOverA函数也是genefilter包内置的构造函数,p表示比例,A表示数值,表示必须要有p比例的值超过A

    library(genefilter)
    ff<- pOverA(p=.1, 10)
    flist <- filterfun(ff)
    
    set.seed(2018-4-24)
    exprA <- matrix(rnorm(1000, 7), ncol = 10)
    
    ans <- genefilter(exprA, 
                      flist=filterfun(ff, 
                                      function(x) any(x<6)))
    exprB <- exprA[ans,]
    

    2. topGOdata对象

    构建topGOdata对象的3个数据

    1. 基因某种ID的列表(可以有另一个对应的分数值,如p值或t统计量,或者是差异表达值)
    2. 基因的这种ID与GO的映射表,在ID为芯片的探针ID时,可以直接使用bioconductor的芯片注释包如hgu95av2.db
    3. GO的层次关系数据,这个结果可以从GO.db包获得,topGO也只支持GO.db包定义的层次结构

    构建topGOdata对象的参数:

    1. ontology:字符串,代表所关注的ontology类别包括“BP”,“MF”或“CC”
    2. description:字符串,对该研究的简介
    3. allGenes:带名字的vector可以是数值或factor类型,vector的name属性为基因的某种ID号,这些基因代表所有的基因总数
    4. geneSelectionFun:根据allGenes的数值选出显著的目标基因的函数,如果allGenes是数值向量则该参数不可省略,如果是factor(0,1)的向量则不需要指定
    5. nodeSize:过滤掉一些低富集的GO term,根据nodeSize的值过滤
    6. annotationFun:注释函数,把gene的某种ID号映射为GO terms的编号,其选项包括
      • annFUN.db:表示从安装的包如hgu95av2.db中获取对应的注释
        • 若为annFUN.db,则还需加参数affyLib,值为芯片注释包的名称
    • annFUN.org:表示从安装的包如org.XXX中获取对应的注释,目前该函数支持Entrez,Genebank,Alias,Ensembl,GeneSymbol,GeneName,UniGene的ID
      • 若为annFUN.org,则还需加参数如mapping="org.Hs.eg.db", ID="Ensembl"
    • annFUN.gene2GO:当用户提供gene-to-GOs的注释数据时使用该函数
      • 若为annFUN.gene2GO,则还需加参数gene2GO,值为读入的list变量
    • annFUN.GO2gene:当用户提供GO-to-genes的注释数据时使用该函数
      • 若为annFUN.GO2gene,则还需加参数GO2gene,值为读入的list变量
    • annFUN.file:表示从文件读取注释数据如gene2GO文本文件或GO2genes文本文件
      • 若为annFUN.file,则还需加file参数,值为相应的ID和GO号(多个GO以逗号隔开)的文本文件路径

    2.1 构建注释(可以自定义选择某种证据强度的注释)

    2.1.1 从NCBI下载go annotation,并处理为topGO可识别的list数据

    下载压缩包并解压:https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz

    该文件包含所有物种的entrzID和GO的对应关系,人的物种编号为“9606”

    library(data.table)
    geneID2go <- fread('grep "^9606\t" NCBI_data/gene2go')
    colnames(geneID2go) <- unlist(read.delim("NCBI_data/gene2go",
                                           header=F, nrow=1, 
                                           stringsAsFactors=F))
    geneID2go_list <- by(geneID2go$GO_ID, geneID2go$GeneID,
                         function(x) as.character(x))
    head(geneID2go_list)
    ## $`1`
    ##  [1] "GO:0002576" "GO:0003674" "GO:0005576" "GO:0005576" "GO:0005615"
    ##  [6] "GO:0008150" "GO:0031012" "GO:0031093" "GO:0034774" "GO:0043312"
    ## [11] "GO:0070062" "GO:0072562" "GO:1904813"
    library(topGO)
    goID2gene_list <- inverseList(geneID2go_list)
    

    2.1.2 从geneontology官网下载go annotation

    下载并解压:http://geneontology.org/gene-associations/goa_human.gaf.gz

    library(data.table)
    geneSymbol2go <- fread(
      'awk \'{print $3 "," $4}\' gene_ontology_data/goa_human.gaf | grep "GO:"', 
      header=FALSE, sep=",")
    
    geneSymbol2go_list <- by(geneSymbol2go$V2, geneSymbol2go$V1,
                             function(x) as.character(x))
    
    head(geneSymbol2go_list)
    ## $A0A075B6Q4
    ## [1] "GO:0000056" "GO:0005634" "GO:0030688" "GO:0031902" "GO:0034448" "GO:0042274"
    library(topGO)
    goSymbol2gene_list <- inverseList(geneSymbol2go_list)      
    

    2.1.3 使用实例数据与readMappings函数

    readMappings函数是topGO包中的函数,读入文件格式为

    068724  GO:0005488, GO:0003774, GO:0001539, GO:0006935, GO:0009288
    119608  GO:0005634, GO:0030528, GO:0006355, GO:0045449, GO:0003677, GO:0007275
    133103  GO:0015031, GO:0005794, GO:0016020, GO:0017119, GO:0000139
    121005  GO:0005576
    155158  GO:0005488
    

    注:topGO包中的geneid2go实例数据只有100个基因

    file = system.file("examples/geneid2go.map",
                     package="topGO")
    
    gene2GO_data <- readMappings(file)
    

    2.2 读入(测序研究或芯片研究中的)所有基因【不带score】

    geneNames <- names(geneSymbol2go_list)
    tail(geneNames)
    ## [1] "ZXDC"   "ZYG11A" "ZYG11B" "ZYX"    "ZZEF1"  "ZZZ3" 
    

    假设我们的芯片中所有基因为geneNames向量中的所有基因,所得的差异基因为200个:

    set.seed(2018-04-25)
    myInterestingGenes <- sample(geneNames, 200)
    tail(myInterestingGenes)
    ## [1] "PARG"   "MIOX"   "ERGIC2" "CHST10" "FOXR2"  "FAM89B"
    

    构建的geneLIst_nscore变量如下,其中geneLIst包括所有的基因,差异基因的值为1,否则为0,其名称为基因的ID:

    geneList_nscore <- factor(as.integer(geneNames %in% myInterestingGenes))
    names(geneList_nscore) <- geneNames
    tail(geneList_nscore)
    ## ZXDC ZYG11A ZYG11B    ZYX  ZZEF1   ZZZ3 
    ##    0      0      0      0      0      0 
    ## Levels: 0 1
    

    2.4 读入带score的全部基因

    library(ALL)
    data(ALL)
    library(genefilter)
    selectedProbes <- genefilter(ALL, filterfun(pOverA(0.2, log2(100)),
                                                function(x) (IQR(x) > 0.25)))
    eset <- ALL[selectedProbes, ]
                                                
    y <- as.integer((sapply(eset$BT, 
                            function(x) return(substr(x, 1, 1)=="T"))))
    y
    ##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    ##  [45] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    ##  [89] 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    
    ## 需要安装multtest包才能自动调用该函数,该函数返回值为FDR
    geneList <- getPvalues(exprs(eset), classlabel=y, alternative="two.sided")
    

    其中geneList的值为p值,名称为探针名;读入geneList后再构建筛选差异基因的筛选函数

    diffGenesFUN <- function(allScore) {
        return(allScore < 0.05)
    }
    x <- diffGenesFUN(geneList)
    sum(x)
    ## [1] 1018
    

    geneList中有1018个是差异基因

    2.3 构建topGOdata对象

    GOdata_nscore_MF <- new("topGOdata",
                  ontology="MF",
                  allGenes=geneList_nscore,
                  annot=annFUN.gene2GO, gene2GO=geneSymbol2go_list,
                  nodeSize=5)
    GOdata_nscore_MF
    ## 
    ## ------------------------- topGOdata object -------------------------
    ## 
    ##  Description:
    ##    -   
    ## 
    ##  Ontology:
    ##    -  MF 
    ## 
    ##  20057 available genes (all genes from the array):
    ##    - symbol:  A0A075B6Q4 A0A087WUJ7 A0A087WUU8 A0A087WUV0 A0A087WV48  ...
    ##    - 200  significant genes. 
    ## 
    ##  17630 feasible genes (genes that can be used in the analysis):
    ##    - symbol:  A0A087WUJ7 A0A087WUU8 A0A087WUV0 A0A087WV48 A0A087WW49  ...
    ##    - 171  significant genes. 
    ## 
    ##  GO graph (nodes with at least  5  genes):
    ##    - a graph with directed edges
    ##    - number of nodes = 1867 
    ##    - number of edges = 2501 
    ## 
    ## ------------------------- topGOdata object -------------------------
    

    说明我们构建的topGOdata对象中共有20057个基因,其中差异基因有200个,经注释的总基因有17630个,经注释的差异基因有171个

    GOdata_score_MF <- new("topGOdata", 
                           ontology = "MF", 
                           allGenes = geneList, 
                           geneSelectionFun = diffGenesFUN, 
                           annot = annFUN.db, affyLib = "hgu95av2.db", 
                           nodeSize = 5)
    
    GOdata_score_MF
    ## 
    ## ------------------------- topGOdata object -------------------------
    ## 
    ##  Description:
    ##    -   
    ## 
    ##  Ontology:
    ##    -  MF 
    ## 
    ##  4101 available genes (all genes from the array):
    ##    - symbol:  1000_at 1005_at 1007_s_at 1008_f_at 1009_at  ...
    ##    - score :  0.0068656 0.53047 2.148059e-09 1 0.00022923  ...
    ##    - 1018  significant genes. 
    ## 
    ##  3875 feasible genes (genes that can be used in the analysis):
    ##    - symbol:  1000_at 1005_at 1007_s_at 1008_f_at 1009_at  ...
    ##    - score :  0.0068656 0.53047 2.148059e-09 1 0.00022923  ...
    ##    - 974  significant genes. 
    ## 
    ##  GO graph (nodes with at least  5  genes):
    ##    - a graph with directed edges
    ##    - number of nodes = 993 
    ##    - number of edges = 1289 
    ## 
    ## ------------------------- topGOdata object -------------------------
    ## 
    

    对有注释的和没有注释的基因数目进行可视化

    allGenes <- featureNames(ALL)
    group <- integer(length(allGenes))+1
    group[allGenes %in% genes(GOdata_score_MF)] <- 0
    group[!selectedProbes] <- 2
    
    group <- factor(group, 
                    labels=c("Used", "Not annotated", "Filtered"))
    table(group)
    ## group
    ##          Used Not annotated      Filtered 
    ##          3875           226          8524 
    
    pValues <- getPvalues(exprs(ALL), classlabel=y, alternative="two.sided")
    geneVar <- apply(exprs(ALL), 1, var)
    dd <- data.frame(x = geneVar[allGenes], y=log10(pValues[allGenes]), groups=group)
    lattice::xyplot(y~x|group, data=dd, groups=group)
    

    一个理想的图是Used中的点基本都在右下,其余都在左上,由图中可以看出被过滤掉的基因有很多都是显著差异的,因此在实际应用中,过滤过程更[图片上传中...(1524667952526.png-4ef403-1524713389307-0)]
    加保守

    2.4 toGOdata对象的操作

    2.4.1 描述
    description(GOdata_score_MF)
    ## [1] "ALL data analysis Object modified on: 18-0425"
    
    description(GOdata_score_MF) <- paste("ALL data analysis.",
                                          "Object modified on:",
                                          format(Sys.time(), "%y-%m%d"))
    
    description(GOdata_score_MF)
    ## [1] "ALL data analysis. Object modified on: 18-0425"
    
    2.4.2 获取注释的基因
    head(genes(GOdata_score_MF))
    ## [1] "1000_at"   "1005_at"   "1007_s_at" "1008_f_at" "1009_at"   "100_g_at"
    
    numGenes(GOdata_score_MF)
    ## [1] 3875
    
    2.4.3 获取基因(有GO注释)的分数(p值)
    a <- geneScore(GOdata_score_MF, 
                   whichGenes = names(geneList),
                   use.names=FALSE)
    length(a)
    ## [1] 3875
    length(geneList)
    ## [1] 4101
    
    2.4.4 更新topGOdata数据
    .geneList <- geneScore(GOdata_score_MF, use.names = TRUE)
    GOdata_score_MF
    ## 
    ## ------------------------- topGOdata object -------------------------
    ## 
    ##  Description:
    ##    -  ALL data analysis Object modified on: 18-0425 
    ## 
    ##  Ontology:
    ##    -  MF 
    ## 
    ##  4101 available genes (all genes from the array):
    ##    - symbol:  1000_at 1005_at 1007_s_at 1008_f_at 1009_at  ...
    ##    - score :  0.0068656 0.53047 2.148059e-09 1 0.00022923  ...
    ##    - 1018  significant genes. 
    ## 
    ##  3875 feasible genes (genes that can be used in the analysis):
    ##    - symbol:  1000_at 1005_at 1007_s_at 1008_f_at 1009_at  ...
    ##    - score :  0.0068656 0.53047 2.148059e-09 1 0.00022923  ...
    ##    - 974  significant genes. 
    ## 
    ##  GO graph (nodes with at least  5  genes):
    ##    - a graph with directed edges
    ##    - number of nodes = 993 
    ##    - number of edges = 1289 
    ## 
    ## ------------------------- topGOdata object -------------------------
    ## 
    
    updateGenes(GOdata_score_MF, .geneList, diffGenesFUN)
    ## 
    ## ------------------------- topGOdata object -------------------------
    ## 
    ##  Description:
    ##    -  ALL data analysis Object modified on: 18-0425 
    ## 
    ##  Ontology:
    ##    -  MF 
    ## 
    ##  3875 available genes (all genes from the array):
    ##    - symbol:  1000_at 1005_at 1007_s_at 1008_f_at 1009_at  ...
    ##    - score :  0.0068656 0.53047 2.148059e-09 1 0.00022923  ...
    ##    - 974  significant genes. 
    ## 
    ##  3875 feasible genes (genes that can be used in the analysis):
    ##    - symbol:  1000_at 1005_at 1007_s_at 1008_f_at 1009_at  ...
    ##    - score :  0.0068656 0.53047 2.148059e-09 1 0.00022923  ...
    ##    - 974  significant genes. 
    ## 
    ##  GO graph (nodes with at least  5  genes):
    ##    - a graph with directed edges
    ##    - number of nodes = 993 
    ##    - number of edges = 1289 
    ## 
    ## ------------------------- topGOdata object -------------------------
    ## 
    
    2.4.5 GO terms的相关操作
    1. GO图信息
    graph(GOdata_score_MF)
    ## A graphNEL graph with directed edges
    ## Number of Nodes = 993 
    ## Number of Edges = 1289 
    
    1. 提取所有相关GO
    allRelatedGO <- usedGO(GOdata_score_MF)
    head(allRelatedGO)
    ## [1] "GO:0000049" "GO:0000149" "GO:0000166" "GO:0000175" "GO:0000217" "GO:0000287"
    
    1. 根据GO ID提取相关基因
    selected.terms <- sample(usedGO(GOdata_score_MF), 10)
    
    num.ann.genes <- countGenesInTerm(GOdata_score_MF, selected.terms)
    num.ann.genes
    ## GO:0019209 GO:0000983 GO:0032357 GO:0019213 GO:0004198 GO:0042301 GO:0036459 GO:0001846 
    ##         35         15          5         20          5          5         33          7 
    ## GO:0005125 GO:0032451 
    ##         27         18 
    
    ann.genes <- genesInTerm(GOdata_score_MF, selected.terms)
    ann.genes[1:2]
    ## $`GO:0098973`
    ## [1] "32318_s_at"              "34160_at"                "AFFX-HSAC07/X00351_3_at"
    ## [4] "AFFX-HSAC07/X00351_3_st" "AFFX-HSAC07/X00351_5_at" "AFFX-HSAC07/X00351_M_at"
    
    ## $`GO:0099106`
    ##  [1] "1158_s_at"  "1336_s_at"  "155_s_at"   "160029_at"  "1675_at"    "31694_at"  
    ##  [7] "31900_at"   "31901_at"   "32498_at"   "32558_at"   "32715_at"   "32749_s_at"
    ## [13] "33458_r_at" "34608_at"   "34609_g_at" "34759_at"   "34981_at"   "36900_at"  
    ## [19] "36935_at"   "37184_at"   "38516_at"   "38604_at"   "38774_at"   "38831_f_at"
    ## [25] "39010_at"   "39011_at"   "41143_at"   "41288_at"   "457_s_at"   "755_at"    
    ## [31] "911_s_at"   "955_at"    
    ## 
    
    scoresInTerm(GOdata_score_MF, selected.terms)[1]
    ## $`GO:0098973`
    ## [1] 1.0000000 0.1708372 1.0000000 1.0000000 1.0000000 1.0000000
    ## 
    
    scoresInTerm(GOdata_score_MF, selected.terms, use.names = TRUE)[1]
    ## $`GO:0098973`
    ##              32318_s_at                34160_at AFFX-HSAC07/X00351_3_at 
    ##               1.0000000               0.1708372               1.0000000 
    ## AFFX-HSAC07/X00351_3_st AFFX-HSAC07/X00351_5_at AFFX-HSAC07/X00351_M_at 
    ##               1.0000000               1.0000000               1.0000000 
    ## 
    
    1. 选择相应的terms对基因及差异基因数目进行统计
    termStat(GOdata_score_MF, selected.terms)
    ##            Annotated Significant Expected
    ## GO:0098973         6           0     1.51
    ## GO:0099106        32          12     8.04
    ## GO:0009055        46          11    11.56
    ## GO:0001104        45          13    11.31
    ## GO:0005161         5           2     1.26
    ## GO:0017160         7           3     1.76
    ## GO:0051536        16           2     4.02
    ## GO:0016866         6           3     1.51
    ## GO:0005509       136          45    34.18
    ## GO:0098772       510         146   128.19
    

    3. 富集统计分析

    topGO支持的统计分析分为3类

    • 根据基因数目进行的统计分析,只需提供一列基因名称就能进行统计分析,Fisher精确检验,超几何分布检验,二项分布检验都属于这个家族
    • 根据基因对应的分数或排序进行的检验,包括Kolmogorov-Smirnov检验(也称为GSEA,ks检验),Gentleman分类,t检验等
    • 根据基因表达数据进行统计检验,如Goeman全局检验等

    topGOdata对象可以通过两种方式运行统计检验,第一种可以让用户自己定义统计检验过程(高级R用户),第二种更容易但也缺乏更多自定义的操作。

    注:weight01是weight和elim的混合算法,默认模式即为weight01

    3.1 方式1:定义和进行统计检验

    进行富集分析的主要函数是getSigGroups(),包括两个参数,第一个是topGOdata对象,第二个是groupStats类或他的衍生类,groupStats类及其子类的关系图如下:

    3.1.1 groupStats

    使用Fisher精确检验计算GO:0046961这个脂代谢term的富集程度,需要先定义所有基因,差异基因

    goID <- "GO:0046961"
    gene.universe <- genes(GOdata_score_MF)
    go.genes <- genesInTerm(GOdata_score_MF, goID)[[1]]
    go.genes
    ##  [1] "32444_at" "33875_at" "34889_at" "35770_at" "36028_at" "36167_at" "36994_at" "37367_at"
    ##  [9] "37395_at" "38686_at" "39326_at"
    
    sig.genes <- sigGenes(GOdata_score_MF)
    

    接下来就可以构建groupStats类了,classicCount类是groupStats的一个子类

    my.group <- new("classicCount", testStatistic = GOFisherTest, name = "fisher", allMembers = gene.universe, 
        groupMembers = go.genes, sigMembers = sig.genes)
    
    contTable(my.group)
    ##         sig notSig
    ## anno      4      7
    ## notAnno 970   2894
    
    # 可以直接使用runTest函数对groupStats对象进行统计检验
    runTest(my.group)
    ## [1] 0.2902583
    

    testStatistic参数代表进行统计量计算的函数,GOFisherTest是topGO包定义的函数,能进行Fisher精确检验,用户能自定义该计算统计量的函数,name是注释信息

    除了ClassicCount类,elimCount类也是groupStats的子类,随机排除25%的注释

    set.seed(2018 - 4 - 25)
    elim.genes <- sample(go.genes, length(go.genes)/4)
    elim.group <- new("elimCount", testStatistic = GOFisherTest, name = "fisher_elim", allMembers = gene.universe, 
        groupMembers = go.genes, sigMembers = sig.genes, elim = elim.genes)
    contTable(elim.group)
    ##         sig notSig
    ## anno      3      6
    ## notAnno 970   2894
    
    # 可以直接使用runTest函数对groupStats对象进行统计检验
    runTest(elim.group)
    ## [1] 0.1682705
    

    注:groupStats类并不依赖于GO(由我们传入的参数即可知道不管是GO还是KEGG等其他类型的集合均可适用)

    3.1.2 进行假设检验

    如果只适用基因列表进行假设检验则构建classicCount或elimCount对象

    test.stat <- new("classicCount", testStatistic=GOFisherTest, 
                     name="Fisher test")
    
    resultFisher <- getSigGroups(GOdata_score_MF, test.stat)
    
    resultFisher
    ## 
    ## Description: ALL data analysis Object modified on: 18-0425 
    ## Ontology: MF 
    ## 'classic' algorithm with the 'Fisher test' test
    ## 993 GO terms scored: 50 terms with p < 0.01
    ## Annotation data:
    ##     Annotated genes: 3875 
    ##     Significant genes: 974 
    ##     Min. no. of genes annotated to a GO: 5 
    ##     Nontrivial nodes: 905 
    

    如果同时适用score,则需构建classicScore或elimScore对象(如上图所示)

    test.stat <- new("classicScore", testStatistic = GOKSTest, name = "KS test")
    resultKS <- getSigGroups(GOdata_score_MF, test.stat)
    
    resultKS
    ## 
    ## Description: ALL data analysis Object modified on: 18-0425 
    ## Ontology: MF 
    ## 'classic' algorithm with the 'KS test' test
    ## 993 GO terms scored: 65 terms with p < 0.01
    ## Annotation data:
    ##     Annotated genes: 3875 
    ##     Significant genes: 974 
    ##     Min. no. of genes annotated to a GO: 5 
    ##     Nontrivial nodes: 993 
    
    test.stat <- new("elimScore", testStatistic = GOKSTest, name = "KS test elim")
    resultKSElim <- getSigGroups(GOdata_score_MF, test.stat)
    
    resultKSElim
    ## 
    ## Description: ALL data analysis Object modified on: 18-0425 
    ## Ontology: MF 
    ## 'elim' algorithm with the 'KS test elim : 0.01' test
    ## 993 GO terms scored: 32 terms with p < 0.01
    ## Annotation data:
    ##     Annotated genes: 3875 
    ##     Significant genes: 974 
    ##     Min. no. of genes annotated to a GO: 5 
    ##     Nontrivial nodes: 993 
    

    注意选择的class与检验方法的兼容性,如上图所示,weight不能与score相关的检验兼容,而与Fisher test兼容

    test.stat <- new("weightCount", testStatistic = GOFisherTest, 
                     name = "Fisher test", sigRatio = "ratio")
    resultFisherWeight <- getSigGroups(GOdata_score_MF, test.stat)
    resultFisherWeight
    ## 
    ## Description: ALL data analysis Object modified on: 18-0425 
    ## Ontology: MF 
    ## 'weight' algorithm with the 'Fisher test : ratio' test
    ## 993 GO terms scored: 20 terms with p < 0.01
    ## Annotation data:
    ##     Annotated genes: 3875 
    ##     Significant genes: 974 
    ##     Min. no. of genes annotated to a GO: 5 
    ##     Nontrivial nodes: 905 
    

    3.2 P值是否校正的问题

    注意检验所得p值为原始p值,未经过多重检验校正,不校正的原因如下:

    1. 在很多情况下,富集分析得到的p值的分布可能不太极端,在这些情况下FDR或FWER校正方法产生较为保守的p值,导致没有“显著”的p值,丢失重要的GO terms及相关信息,在这种情况下,研究者往往关注GO terms的排序,而不是它们是否有一个显著的FDR。
    2. 富集分析包括了多个步骤和许多假设,如对GO terms进行的Fisher精确检验。进行多重检验校正远远不足以控制误差率。
    3. 对elim和weight的检验方法来说,多重检验校正的方法变得更加不可靠,因为用这些方法就算的p值是依赖于相邻的GO term的,而多重检验校正的前提假设是这些检验都是独立的。

    3.3 更高层面的统计检验(用户友好)

    用runTest函数可以很快速的进行统计检验,统计方法的选择通过algorithm(默认为weigth01)和statistc参数决定,如

    resultFis <- runTest(GOdata_score_MF, 
                         algorithm="classic",
                         statistic="fisher")
    
    weight01.fisher <- runTest(GOdata_score_MF, 
                               statistic = "fisher")
    
    weight01.t <- resultt<- runTest(GOdata_score_MF, 
                         algorithm="weight01",
                         statistic="t")
    
    elim.ks <- resultt<- runTest(GOdata_score_MF, 
                         algorithm="elim",
                         statistic="ks")
    

    可用的参数如下(也可以看上面的图),注意有些组合不兼容

    whichTests()
    ## [1] "fisher"     "ks"         "t"          "globaltest" "sum"        "ks.ties"   
    
    whichAlgorithms()
    ## [1] "classic"     "elim"        "weight"      "weight01"    "lea"         "parentchild"
    

    4. 富集结果及可视化

    4.1 topGOresult对象

    toGOresult对象非常简单,只有p值或统计量(统称为score),score函数并没有对返回值进行排序

    pvalFis <- score(resultFis)
    head(pvalFis)
    ## GO:0000049 GO:0000149 GO:0000166 GO:0000175 GO:0000217 GO:0000287 
    ##  0.9383324  0.9779475  0.3904532  0.8241708  0.3247581  0.3125296 
    

    可用统计这些score的分布

    hist(pvalFis, 100, xlab="p-values")
    

    score函数还有一个参数whichGO,可以指定GO ID

    pvalWeight <- score(resultFisherWeight, whichGO = names(pvalFis))
    head(pvalWeight)
    ## GO:0000049 GO:0000149 GO:0000166 GO:0000175 GO:0000217 GO:0000287 
    ##  0.9383324  0.9806822  0.7372081  0.8241708  0.3247581  0.3125296 
    

    可以看一下不同方法的结果相关性:

    cor(pvalFis, pvalWeight)
    ## [1] 0.6590391
    
    plot(pvalFis, pvalWeight, 
         xlab = "p-value classic", ylab = "p-value elim",
         pch = 19, cex = gSize, col = 1:2)
    

    也可以对结果进行简单统计(总共注释的基因,注释的差异基因,最少包含的基因数,包含差异基因的GO term)

    geneData(resultFisherWeight)
    ##   Annotated Significant    NodeSize    SigTerms 
    ##        3875         974           5         905 
    

    4.2 汇总结果

    使用GenTable函数可以对结果进行汇总,参数为toGOdata和toGOresult,及制定排序的列及包含的条目

    allRes <- GenTable(GOdata_score_MF, classic = resultFis, 
                       KS = resultKS, weight = resultFisherWeight,
                       orderBy = "weight", 
                       ranksOf = "classic", topNodes = 20)
    

    4.3 分析单个GO term

    最直观地查看某个GO term是否有差异基因富集的方法就是观察score的密度分布

    ## 选择第一个GO term
    goID <- allRes[1, "GO.ID"]
    print(showGroupDensity(GOdata_score_MF, goID, ranks=TRUE))
    

    如图所示,横坐标表示由score产生的排序位置,纵坐标表示密度

    上图表示注释到该特定GO中的所有基因的密度分布,下图表示除了上图基因之外的其他基因的密度分布,该特定GO中的基因大多分布在p值较低的位置区域,而下方的图表示p值的分布基本都在一个比较均匀的水平,说明富集比较显著

    另一个比较方便的功能是把该GO中的所有基因及其注释信息和p值汇总成表,可以使用whichTerms参数指定GO terms,如果有过个GO则返回一个含有dataframe的list,还可以传递file参数指定输出文件(注:只有该芯片有注释包时才能使用该函数,其余自定义的注释不能使用该函数

    gt <- printGenes(GOdata_score_MF, 
                     whichTerms = goID, 
                     chip = affyLib, 
                     numChar = 40)
    
    

    4.4 可视化GO层级结构

    两个函数实现,一是showSigOfNodes,firstSigNodes表示指定的显著节点数目,useInfo表示每个节点的信息显示包括“def”(GOid及定义文字)和“all”(GOid定义文字,score和注释数目【包括该term的基因在总基因中国的数目,和该term的基因在差异基因中的数目】),其中显著富集的GO用长方形表示,黑色箭头代表is_a关系,红色箭头代表part_a关系,节点的颜色代表其显著性的程度。

    showSigOfNodes(GOdata_score_MF, 
                   score(resultFisherWeight), 
                   firstSigNodes = 5, 
                   useInfo = 'def')
    

    printGraph函数自动输出为pdf文件保存在当前目录,fn.prefix表示输出文件的前缀文字

    printGraph(GOdata_score_MF, resultFis,
               firstSigNodes = 10,
               fn.prefix = "tGO",
               useInfo = "all", 
               pdfSW = TRUE)
    

    本次GO分析的缺陷:
    1.对探针进行GO分析而不是基因名进行GO分析,这样会导致相应GO term的counts数目不正确,如一个探针对应多个基因的情况

    1. 可以通过构建限定level和证据级别的GO注释对象,从而限定level等级和证据等级进行富集分析

    相关文章

      网友评论

        本文标题:GO富集之topGO

        本文链接:https://www.haomeiwen.com/subject/xueplftx.html