一个有点难的探针注释
0.起因
芯片的探针注释指的是探针与基因之间的对应关系,有四种不同的获取方法:
- Biocoductor的注释包
- GPL的表格文件解析
- 官网下载对应产品的注释表格
- 自主注释
善良的曾老板写了一个包,叫AnnoProbe,也可以快速获取探针注释,不仅把1~2的部分注释囊括在内,还做了一些平台的自主注释。结合tinyarray的函数find_anno,直接写出现成的代码:
例如GPL6244
if(!require(AnnoProbe))install.packages("AnnoProbe")
if(!require(tinyarray))install.packages("tinyarray")
library(tinyarray)
find_anno("GPL6244")
## [1] "`library(hugene10sttranscriptcluster.db);ids <- toTable(hugene10sttranscriptclusterSYMBOL)` and `ids <- AnnoProbe::idmap('GPL6244')` are both avaliable"
两句代码都可以得到它的注释。
当然也不是所有平台都能找到,比如瞎写一个。
find_anno("GPL624")
## [1] "no annotation avliable in Bioconductor and AnnoProbe"
这个号码是瞎编的。如果你找的时候得到这种反馈,而编号没有错,那就是没有,认命吧您呐。
但是呢。我的一个学生富有探索精神,他想要从GPL表格里提取,一方面因为想练练手,另一方面确实有些探针注释是这样组织的,查不到其他渠道的注释。
那我要是不惯着他能行吗~搞!
1.端详数据
↓此文件来自GEO数据库的GPL6244页面表格。
ta = read.delim("GPL6244-17930.txt",
check.names = F,
comment.char = "#")
colnames(ta)
## [1] "ID" "GB_LIST" "SPOT_ID"
## [4] "seqname" "RANGE_GB" "RANGE_STRAND"
## [7] "RANGE_START" "RANGE_STOP" "total_probes"
## [10] "gene_assignment" "mrna_assignment" "category"
head(ta$gene_assignment,3)
## [1] "---"
## [2] "ENST00000328113 // OR4G2P // olfactory receptor, family 4, subfamily G, member 2 pseudogene // --- // --- /// ENST00000492842 // OR4G11P // olfactory receptor, family 4, subfamily G, member 11 pseudogene // --- // --- /// ENST00000588632 // OR4G1P // olfactory receptor, family 4, subfamily G, member 1 pseudogene // --- // ---"
## [3] "NM_001004195 // OR4F4 // olfactory receptor, family 4, subfamily F, member 4 // 15q26.3 // 26682 /// NM_001005240 // OR4F17 // olfactory receptor, family 4, subfamily F, member 17 // 19p13.3 // 81099 /// NM_001005484 // OR4F5 // olfactory receptor, family 4, subfamily F, member 5 // 1p36.33 // 79501 /// ENST00000318050 // OR4F17 // olfactory receptor, family 4, subfamily F, member 17 // 19p13.3 // 81099 /// ENST00000326183 // OR4F4 // olfactory receptor, family 4, subfamily F, member 4 // 15q26.3 // 26682 /// ENST00000335137 // OR4F5 // olfactory receptor, family 4, subfamily F, member 5 // 1p36.33 // 79501 /// ENST00000585993 // OR4F17 // olfactory receptor, family 4, subfamily F, member 17 // 19p13.3 // 81099 /// BC136848 // OR4F17 // olfactory receptor, family 4, subfamily F, member 17 // 19p13.3 // 81099 /// BC136867 // OR4F17 // olfactory receptor, family 4, subfamily F, member 17 // 19p13.3 // 81099 /// BC136907 // OR4F4 // olfactory receptor, family 4, subfamily F, member 4 // 15q26.3 // 26682 /// BC136908 // OR4F4 // olfactory receptor, family 4, subfamily F, member 4 // 15q26.3 // 26682"
2. 拆分那一列
可以看到,这一列里包含的信息比较凌乱,其中第二部分就是gene_symbol。需要想办法提取出来,并分割开来
ids = ta[,c("ID","gene_assignment")]
library(stringr)
a = str_split(ids$gene_assignment," // ",simplify = T)
head(a[,1:3])
## [,1] [,2]
## [1,] "---" ""
## [2,] "ENST00000328113" "OR4G2P"
## [3,] "NM_001004195" "OR4F4"
## [4,] "NR_024437" "LOC728323"
## [5,] "NM_001005221" "OR4F29"
## [6,] "ENST00000387377" "MT-TM"
## [,3]
## [1,] ""
## [2,] "olfactory receptor, family 4, subfamily G, member 2 pseudogene"
## [3,] "olfactory receptor, family 4, subfamily F, member 4"
## [4,] "uncharacterized LOC728323"
## [5,] "olfactory receptor, family 4, subfamily F, member 29"
## [6,] "mitochondrially encoded tRNA methionine"
ids = cbind(ids,symbol = a[,2])
3. 最后整理
写着---的是没有对应到基因的探针,需要去掉。
library(dplyr)
kk = ids$gene_assignment !="---";table(kk)
## kk
## FALSE TRUE
## 8004 25293
ids = ids[kk,]
保留探针与基因的两列就可以了。
ids = ids[,c(1,3)]
colnames(ids) = c("probe_id","symbol")
ids$probe_id = as.character(ids$probe_id)
head(ids)
## probe_id symbol
## 2 7896738 OR4G2P
## 3 7896740 OR4F4
## 4 7896742 LOC728323
## 5 7896744 OR4F29
## 6 7896746 MT-TM
## 7 7896748 MT-TW
nrow(ids)
## [1] 25293
网友评论