现有的基因芯片种类 不要太多了!
但是重要而且常用的芯片 并不多!
一般分析芯片数据都需要把 探针的ID切换成基因的ID,我一般喜欢用基因的entrez ID。
一般有 三种 方法可以得到芯片探针与gene的对应关系
- 金标准当然是去基因芯片的厂商的官网直接去下载啦!!!
- 一种是直接用bioconductor的包
- 一种是从NCBI里面下载文件来解析好!
首先,我们说官网,肯定可以找到,不然这种芯片出来就没有意义了!
然后,我们看看NCBI下载的,会比较大
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL6947
这两种方法都比较麻烦,需要一个个的来!
R的bioconductor包来直接得到芯片探针与gene的对应关系!
一般重要的芯片在R的bioconductor里面都是有包的,用一个R包可以批量获取有注释信息的芯片平台,我选取了常见的物种,如下:
gpl | organism | bioc_package | |
---|---|---|---|
1 | GPL32 | Mus musculus | mgu74a |
2 | GPL33 | Mus musculus | mgu74b |
3 | GPL34 | Mus musculus | mgu74c |
6 | GPL74 | Homo sapiens | hcg110 |
7 | GPL75 | Mus musculus | mu11ksuba |
8 | GPL76 | Mus musculus | mu11ksubb |
9 | GPL77 | Mus musculus | mu19ksuba |
10 | GPL78 | Mus musculus | mu19ksubb |
11 | GPL79 | Mus musculus | mu19ksubc |
12 | GPL80 | Homo sapiens | hu6800 |
13 | GPL81 | Mus musculus | mgu74av2 |
14 | GPL82 | Mus musculus | mgu74bv2 |
15 | GPL83 | Mus musculus | mgu74cv2 |
16 | GPL85 | Rattus norvegicus | rgu34a |
17 | GPL86 | Rattus norvegicus | rgu34b |
18 | GPL87 | Rattus norvegicus | rgu34c |
19 | GPL88 | Rattus norvegicus | rnu34 |
20 | GPL89 | Rattus norvegicus | rtu34 |
22 | GPL91 | Homo sapiens | hgu95av2 |
23 | GPL92 | Homo sapiens | hgu95b |
24 | GPL93 | Homo sapiens | hgu95c |
25 | GPL94 | Homo sapiens | hgu95d |
26 | GPL95 | Homo sapiens | hgu95e |
27 | GPL96 | Homo sapiens | hgu133a |
28 | GPL97 | Homo sapiens | hgu133b |
29 | GPL98 | Homo sapiens | hu35ksuba |
30 | GPL99 | Homo sapiens | hu35ksubb |
31 | GPL100 | Homo sapiens | hu35ksubc |
32 | GPL101 | Homo sapiens | hu35ksubd |
36 | GPL201 | Homo sapiens | hgfocus |
37 | GPL339 | Mus musculus | moe430a |
38 | GPL340 | Mus musculus | mouse4302 |
39 | GPL341 | Rattus norvegicus | rae230a |
40 | GPL342 | Rattus norvegicus | rae230b |
41 | GPL570 | Homo sapiens | hgu133plus2 |
42 | GPL571 | Homo sapiens | hgu133a2 |
43 | GPL886 | Homo sapiens | hgug4111a |
44 | GPL887 | Homo sapiens | hgug4110b |
45 | GPL1261 | Mus musculus | mouse430a2 |
49 | GPL1352 | Homo sapiens | u133x3p |
50 | GPL1355 | Rattus norvegicus | rat2302 |
51 | GPL1708 | Homo sapiens | hgug4112a |
54 | GPL2891 | Homo sapiens | h20kcod |
55 | GPL2898 | Rattus norvegicu | adme16cod |
60 | GPL3921 | Homo sapiens | hthgu133a |
63 | GPL4191 | Homo sapiens | h10kcod |
64 | GPL5689 | Homo sapiens | hgug4100a |
65 | GPL6097 | Homo sapiens | illuminaHumanv1 |
66 | GPL6102 | Homo sapiens | illuminaHumanv2 |
67 | GPL6244 | Homo sapiens | hugene10sttranscriptcluster |
68 | GPL6947 | Homo sapiens | illuminaHumanv3 |
69 | GPL8300 | Homo sapiens | hgu95av2 |
70 | GPL8490 | Homo sapiens | IlluminaHumanMethylation27k |
71 | GPL10558 | Homo sapiens | illuminaHumanv4 |
72 | GPL11532 | Homo sapiens | hugene11sttranscriptcluster |
73 | GPL13497 | Homo sapiens | HsAgilentDesign026652 |
74 | GPL13534 | Homo sapiens | IlluminaHumanMethylation450k |
75 | GPL13667 | Homo sapiens | hgu219 |
76 | GPL15380 | Homo sapiens | GGHumanMethCancerPanelv1 |
77 | GPL15396 | Homo sapiens | hthgu133b |
78 | GPL17897 | Homo sapiens | hthgu133a |
这些包首先需要都下载
gpl_info=read.csv("GPL_info.csv",stringsAsFactors = F)
# first download all of the annotation packages from bioconductor
for (i in 1:nrow(gpl_info)){
print(i)
platform=gpl_info[i,4]
platform=gsub('^ ',"",platform) #主要是因为我处理包的字符串前面有空格
#platformDB='hgu95av2.db'
platformDB=paste(platform,".db",sep="")
if( platformDB %in% rownames(installed.packages()) == FALSE) {
BiocInstaller::biocLite(platformDB)
#source("<http://bioconductor.org/biocLite.R>");
#biocLite(platformDB )
}
}
下载完了所有的包, 就可以进行批量导出芯片探针与gene的对应关系!
for (i in 1:nrow(gpl_info)){
print(i)
platform=gpl_info[i,4]
platform=gsub('^ ',"",platform)
#platformDB='hgu95av2.db'
platformDB=paste(platform,".db",sep="")
if( platformDB %in% rownames(installed.packages()) != FALSE) {
library(platformDB,character.only = T)
#tmp=paste('head(mappedkeys(',platform,'ENTREZID))',sep='')
#eval(parse(text = tmp))
#重点在这里,把字符串当做命令运行
all_probe=eval(parse(text = paste('mappedkeys(',platform,'ENTREZID)',sep='')))
EGID <- as.numeric(lookUp(all_probe, platformDB, "ENTREZID"))
#自己把内容写出来即可
}
}
NCBI的GEO数据库下载GPL平台文件
NCBI现有的GPL已经过万了,但是bioconductor的芯片注释包不到一千,虽然bioconductor可以解决我们大部分的需要,比如affymetrix的95,133系列,深圳1.0st系列,HTA2.0系列,但是如果碰到比较生僻的芯片,bioconductor也不会刻意为之制作一个bioconductor的包,这时候就需要自行下载NCBI的GPL信息了,也可以通过R来解决:
本质上是下载一个文件,读进R里面,然后解析行列式,得到芯片探针与基因的对应关系,看下面的代码,你就能理解了。
## A-AGIL-28 - Agilent Whole Human Genome Microarray 4x44K 014850 G4112F (85 cols x 532 rows)
library(Biobase)
library(GEOquery)
#Download GPL file, put it in the current directory, and load it:
gpl <- getGEO('GPL6480', destdir=".")
colnames(Table(gpl)) ## [1] 41108 17
head(Table(gpl)[,c(1,6,7)]) ## you need to check this , which column do you need
write.csv(Table(gpl)[,c(1,6,7)],"GPL6400.csv")
#platformDB='hgu133plus2.db'
#library(platformDB, character.only=TRUE)
probeset <- featureNames(GSE32575[[1]])
library(Biobase)
library(GEOquery)
#Download GPL file, put it in the current directory, and load it:
gpl <- getGEO('GPL6102', destdir=".")
colnames(Table(gpl)) ## [1] 41108 17
head(Table(gpl)[,c(1,10,13)]) ## you need to check this , which column do you need
probe2symbol=Table(gpl)[,c(1,13)]
## GPL15207 [PrimeView] Affymetrix Human Gene Expression Array
probeset <- featureNames(GSE58979[[1]])
library(Biobase)
library(GEOquery)
#Download GPL file, put it in the current directory, and load it:
gpl <- getGEO('GPL15207', destdir=".")
colnames(Table(gpl)) ## [1] 49395 24
head(Table(gpl)[,c(1,15,19)]) ## you need to check this , which column do you need
probe2symbol=Table(gpl)[,c(1,15)]
## GPL10558 Illumina HumanHT-12 V4.0 expression beadchip
library(Biobase)
library(GEOquery)
#Download GPL file, put it in the current directory, and load it:
gpl <- getGEO('GPL10558', destdir=".")
colnames(Table(gpl)) ## [1] 41108 17
head(Table(gpl)[,c(1,10,13)]) ## you need to check this , which column do you need
probe2symbol=Table(gpl)[,c(1,13)]
网友评论