美文网首页
一个R函数解决生物学ID转换的问题

一个R函数解决生物学ID转换的问题

作者: 小光amateur | 来源:发表于2020-01-19 16:59 被阅读0次

前言:

生物ID转换是我们在处理各种生物数据时经常遇到的问题。通常有两种方法:一种是使用在线网站,最著名的是biomartdb2db;另一种是使用本地软件clusterProfiler::bitr

在线转换过程很麻烦,需要上传和下载文件,并且需要进行二次处理。另外,如果转换次数很多,将很难完成。本地转换的数据库更新速度很慢,许多转换无法完成,并且转换次数很少。

举一个简单的例子,在此项目下有一个示例文件test_name.txt。文件是100个 Ensmebl Trans ID。如果要执行下游分析,则必须转换为Gene Symbol

如果使用bitr函数,我们只能得到少量映射:

library(clusterProfiler)
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
# [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
# [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
#[13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
#[19] "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
#[25] "UNIGENE"      "UNIPROT" 
result<-bitr(data$gene,fromType = "ENSEMBLTRANS",toType = "SYMBOL",OrgDb = org.Hs.eg.db)
#'select()' returned 1:1 mapping between keys and columns
#Warning message:
#In bitr(data$gene, fromType = "ENSEMBLTRANS", toType = "SYMBOL",  :
#  84% of input gene IDs are fail to map...
head(result)
#      ENSEMBLTRANS    SYMBOL
#7  ENST00000418724    ZBTB22
#17 ENST00000374458    GGNBP1
#21 ENST00000588265     FXYD7
#25 ENST00000458629     CXCR6
#34 ENST00000595168 LOC400499
#38 ENST00000368547     ECHS1

但是,如果我们从bioDBnet网站获得信息,我们只有2个不匹配的ID,因此,我希望通过打包网站的api来减少在线转换的弊端并提高转换效率。

使用方法

library(RCurl)
#library(httr) 
## if your compute is windows,you should use httr instead of rcurl
library(rjson)
library(tidyr)
###read example data
data<-read.table("test_name.txt",header = FALSE,stringsAsFactors = FALSE)
colnames(data)<-"gene"

## you can get all input characters you can by inputting "getinputs" as the first parameter
bitr_db2db("getinputs")
# [1] "Affy GeneChip Array"          "Affy ID"                      "Affy Transcript Cluster ID"  
# [4] "Agilent ID"                   "Biocarta Pathway Name"        "CodeLink ID"                 
# [7] "dbSNP ID"                     "DrugBank Drug ID"             "DrugBank Drug Name"          
#[10] "EC Number"                    "Ensembl Gene ID"              "Ensembl Protein ID"
# ........

## you can get all output characters you can got by inputting "getoutputsforinput" as the first parameter。
bitr_db2db("getoutputsforinput","Ensembl Transcript ID")

#  [1] "Affy ID"                        "Agilent ID"                     "Allergome Code"                
#  [4] "ApiDB_CryptoDB ID"              "Biocarta Pathway Name"          "BioCyc ID"                     
#  [7] "CCDS ID"                        "Chromosomal Location"           "CleanEx ID"                    
# [10] "CodeLink ID"                    "COSMIC ID"                      "CPDB Protein Interactor"  
# ....

##to get ensmebl trans 2 symbol,you can input the following cmd.
haha<-bitr_db2db("","Ensembl Transcript ID",data$gene,"Gene Symbol")
#[1] "your id have 1:1 mapping!"
head(haha)
#             from      to
#1 ENST00000532435   GDPD5
#2 ENST00000513185    RGMB
#3 ENST00000569370 CIAPIN1
#4 ENST00000451562    PPIA
#5 ENST00000289865   USP21
#6 ENST00000409411   PREPL

#when you make one 2 more like gene HTT
haha2<-bitr_db2db("","Gene Symbol","HTT","Ensembl Transcript ID")
[1] "waring:your id have more than one mapping!"
head(haha2)
#  from                 to
#1  HTT ENSCJAT00000027377
#2  HTT ENSBMUT00000040493
#3  HTT ENSBMUT00000040501
#4  HTT ENSBMUT00000040486
#5  HTT ENSBMUT00000040494
#6  HTT ENSBMUT00000040497

最后,代码见bitr_db2db.R

注意:如果你使用的是Windows并且报错的话,建议尝试这个bitr_db2db_forwindows.R

相关文章

网友评论

      本文标题:一个R函数解决生物学ID转换的问题

      本文链接:https://www.haomeiwen.com/subject/kagrzctx.html