美文网首页GEOGEOGEO
用R获取芯片探针与基因的对应关系三部曲

用R获取芯片探针与基因的对应关系三部曲

作者: 因地制宜的生信达人 | 来源:发表于2018-01-17 20:44 被阅读3479次

    现有的基因芯片种类 不要太多了!
    但是重要而且常用的芯片 并不多!
    一般分析芯片数据都需要把 探针的ID切换成基因的ID,我一般喜欢用基因的entrez ID。

    一般有 三种 方法可以得到芯片探针与gene的对应关系

    • 金标准当然是去基因芯片的厂商的官网直接去下载啦!!!
    • 一种是直接用bioconductor的包
    • 一种是从NCBI里面下载文件来解析好!

    首先,我们说官网,肯定可以找到,不然这种芯片出来就没有意义了!

    然后,我们看看NCBI下载的,会比较大
    http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL6947

    这两种方法都比较麻烦,需要一个个的来!

    R的bioconductor包来直接得到芯片探针与gene的对应关系!

    一般重要的芯片在R的bioconductor里面都是有包的,用一个R包可以批量获取有注释信息的芯片平台,我选取了常见的物种,如下:

    gpl organism bioc_package
    1 GPL32 Mus musculus mgu74a
    2 GPL33 Mus musculus mgu74b
    3 GPL34 Mus musculus mgu74c
    6 GPL74 Homo sapiens hcg110
    7 GPL75 Mus musculus mu11ksuba
    8 GPL76 Mus musculus mu11ksubb
    9 GPL77 Mus musculus mu19ksuba
    10 GPL78 Mus musculus mu19ksubb
    11 GPL79 Mus musculus mu19ksubc
    12 GPL80 Homo sapiens hu6800
    13 GPL81 Mus musculus mgu74av2
    14 GPL82 Mus musculus mgu74bv2
    15 GPL83 Mus musculus mgu74cv2
    16 GPL85 Rattus norvegicus rgu34a
    17 GPL86 Rattus norvegicus rgu34b
    18 GPL87 Rattus norvegicus rgu34c
    19 GPL88 Rattus norvegicus rnu34
    20 GPL89 Rattus norvegicus rtu34
    22 GPL91 Homo sapiens hgu95av2
    23 GPL92 Homo sapiens hgu95b
    24 GPL93 Homo sapiens hgu95c
    25 GPL94 Homo sapiens hgu95d
    26 GPL95 Homo sapiens hgu95e
    27 GPL96 Homo sapiens hgu133a
    28 GPL97 Homo sapiens hgu133b
    29 GPL98 Homo sapiens hu35ksuba
    30 GPL99 Homo sapiens hu35ksubb
    31 GPL100 Homo sapiens hu35ksubc
    32 GPL101 Homo sapiens hu35ksubd
    36 GPL201 Homo sapiens hgfocus
    37 GPL339 Mus musculus moe430a
    38 GPL340 Mus musculus mouse4302
    39 GPL341 Rattus norvegicus rae230a
    40 GPL342 Rattus norvegicus rae230b
    41 GPL570 Homo sapiens hgu133plus2
    42 GPL571 Homo sapiens hgu133a2
    43 GPL886 Homo sapiens hgug4111a
    44 GPL887 Homo sapiens hgug4110b
    45 GPL1261 Mus musculus mouse430a2
    49 GPL1352 Homo sapiens u133x3p
    50 GPL1355 Rattus norvegicus rat2302
    51 GPL1708 Homo sapiens hgug4112a
    54 GPL2891 Homo sapiens h20kcod
    55 GPL2898 Rattus norvegicu adme16cod
    60 GPL3921 Homo sapiens hthgu133a
    63 GPL4191 Homo sapiens h10kcod
    64 GPL5689 Homo sapiens hgug4100a
    65 GPL6097 Homo sapiens illuminaHumanv1
    66 GPL6102 Homo sapiens illuminaHumanv2
    67 GPL6244 Homo sapiens hugene10sttranscriptcluster
    68 GPL6947 Homo sapiens illuminaHumanv3
    69 GPL8300 Homo sapiens hgu95av2
    70 GPL8490 Homo sapiens IlluminaHumanMethylation27k
    71 GPL10558 Homo sapiens illuminaHumanv4
    72 GPL11532 Homo sapiens hugene11sttranscriptcluster
    73 GPL13497 Homo sapiens HsAgilentDesign026652
    74 GPL13534 Homo sapiens IlluminaHumanMethylation450k
    75 GPL13667 Homo sapiens hgu219
    76 GPL15380 Homo sapiens GGHumanMethCancerPanelv1
    77 GPL15396 Homo sapiens hthgu133b
    78 GPL17897 Homo sapiens hthgu133a

    这些包首先需要都下载

    gpl_info=read.csv("GPL_info.csv",stringsAsFactors = F)
    # first download all of the annotation packages from bioconductor
    for (i in 1:nrow(gpl_info)){
      print(i)
      platform=gpl_info[i,4]
      platform=gsub('^ ',"",platform) #主要是因为我处理包的字符串前面有空格
      #platformDB='hgu95av2.db'
      platformDB=paste(platform,".db",sep="")
      if( platformDB  %in% rownames(installed.packages()) == FALSE) {
        BiocInstaller::biocLite(platformDB)
        #source("<http://bioconductor.org/biocLite.R>");
        #biocLite(platformDB )
      }
    }
    

    下载完了所有的包, 就可以进行批量导出芯片探针与gene的对应关系!

    for (i in 1:nrow(gpl_info)){
      print(i)
      platform=gpl_info[i,4]
      platform=gsub('^ ',"",platform)
      #platformDB='hgu95av2.db'
      platformDB=paste(platform,".db",sep="")
      if( platformDB  %in% rownames(installed.packages()) != FALSE) {
       library(platformDB,character.only = T)
       #tmp=paste('head(mappedkeys(',platform,'ENTREZID))',sep='')
       #eval(parse(text = tmp))
    #重点在这里,把字符串当做命令运行
       all_probe=eval(parse(text = paste('mappedkeys(',platform,'ENTREZID)',sep='')))
       EGID <- as.numeric(lookUp(all_probe, platformDB, "ENTREZID"))
    #自己把内容写出来即可
      }
    }
    

    NCBI的GEO数据库下载GPL平台文件

    NCBI现有的GPL已经过万了,但是bioconductor的芯片注释包不到一千,虽然bioconductor可以解决我们大部分的需要,比如affymetrix的95,133系列,深圳1.0st系列,HTA2.0系列,但是如果碰到比较生僻的芯片,bioconductor也不会刻意为之制作一个bioconductor的包,这时候就需要自行下载NCBI的GPL信息了,也可以通过R来解决:

    本质上是下载一个文件,读进R里面,然后解析行列式,得到芯片探针与基因的对应关系,看下面的代码,你就能理解了。

    ## A-AGIL-28 - Agilent Whole Human Genome Microarray 4x44K 014850 G4112F (85 cols x 532 rows)
    library(Biobase)
    library(GEOquery)
    #Download GPL file, put it in the current directory, and load it:
    gpl <- getGEO('GPL6480', destdir=".")
    colnames(Table(gpl)) ## [1] 41108 17
    head(Table(gpl)[,c(1,6,7)]) ## you need to check this , which column do you need
    write.csv(Table(gpl)[,c(1,6,7)],"GPL6400.csv")
    #platformDB='hgu133plus2.db'
    #library(platformDB, character.only=TRUE)
    probeset <- featureNames(GSE32575[[1]])
    library(Biobase)
    library(GEOquery)
    #Download GPL file, put it in the current directory, and load it:
    gpl <- getGEO('GPL6102', destdir=".")
    colnames(Table(gpl)) ## [1] 41108 17
    head(Table(gpl)[,c(1,10,13)]) ## you need to check this , which column do you need
    probe2symbol=Table(gpl)[,c(1,13)]
    ## GPL15207 [PrimeView] Affymetrix Human Gene Expression Array
    probeset <- featureNames(GSE58979[[1]])
    library(Biobase)
    library(GEOquery)
    #Download GPL file, put it in the current directory, and load it:
    gpl <- getGEO('GPL15207', destdir=".")
    colnames(Table(gpl)) ## [1] 49395 24
    head(Table(gpl)[,c(1,15,19)]) ## you need to check this , which column do you need
    probe2symbol=Table(gpl)[,c(1,15)]
    
    ## GPL10558 Illumina HumanHT-12 V4.0 expression beadchip
    library(Biobase)
    library(GEOquery)
    #Download GPL file, put it in the current directory, and load it:
    gpl <- getGEO('GPL10558', destdir=".")
    colnames(Table(gpl)) ## [1] 41108 17
    head(Table(gpl)[,c(1,10,13)]) ## you need to check this , which column do you need
    probe2symbol=Table(gpl)[,c(1,13)]
    

    相关文章

      网友评论

        本文标题:用R获取芯片探针与基因的对应关系三部曲

        本文链接:https://www.haomeiwen.com/subject/mqxpoxtx.html