使用biomaRt进行ID转化

作者: 一只烟酒僧 | 来源:发表于2020-05-20 12:22 被阅读0次

使用biomaRt进行ID转化
如何将RefSeq mRNA ID转换为Entrez ID
R bioMart 进行基因ID转换
ID 转换（BioMart）
BioMart在线转换ID
biomaRt物种间同源基因查询
基因各种ID相互对应的获取
BiomaRt-进行人基因到小鼠基因的ID转换
常用Gene ID 转换工具的用法
查看biomaRt支持ID转换的物种

biomart使用说明：http://www.bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html
前置注意：使用biomart包中的listensembl/usemart函数的时候，一定注意看一下帮助文章，一般来讲默认的数据库是最新的hg38（以人为例），而目前很多的ID都来自上一版本的基因组hg37，因此我们需要访问上版本的host即http://grch37.ensembl.org/，默认的host是www.ensembl.org，这一点一定要注意！
另外有关hg38和hg37的区别，在最新版的ensembl中，移除了hg37中的部分ID，同时将其赋予了新的id或与其它基因信息合并（由官方名变为别名），因此需要格外注意！

######################################################## 
#-------------------------------------------------------
# Topic:biomaRt的使用
# Author:Wang Haiquan
# Date:Wed May 20 11:11:39 2020
# Mail:mg1835020@smail.nju.edu.cn
#-------------------------------------------------------
########################################################

library(biomaRt)
listMarts()#检查biomart是否可以访问，以及查看可以使用的web service
ensembl1 <- useMart("ensembl")
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")#选择使用的biomart，需要选择listmarts输出的biomart
identical(ensembl,ensembl1)#usemart中传入ensembl，等价于传入ENSEMBL_MART_ENSEMBL
listDatasets(ensembl)#查看选中的biomart中包含的dataset，在ensembl 100版本中ENSEMBL_MART_ENSEMBL biomart中共有203个数据库
head(listDatasets(ensembl))
#    dataset(物种数据集)                   description(物种名及参考基因组版本)          version(参考基因组版本)
# 1   acalliptera_gene_ensembl             Eastern happy genes (fAstCal1.2)             fAstCal1.2
# 2 acarolinensis_gene_ensembl               Anole lizard genes (AnoCar2.0)              AnoCar2.0
# 3  acchrysaetos_gene_ensembl              Golden eagle genes (bAquChr1.2)             bAquChr1.2
# 4  acitrinellus_gene_ensembl               Midas cichlid genes (Midas_v5)               Midas_v5
# 5  amelanoleuca_gene_ensembl                        Panda genes (ailMel1)                ailMel1
# 6    amexicanus_gene_ensembl Mexican tetra genes (Astyanax_mexicanus-2.0) Astyanax_mexicanus-2.0
#小鼠的ensembl名： mmusculus_gene_ensembl，参考基因组为GRCm38.p6
#人的ensembl名：hsapiens_gene_ensembl,参考基因组为GRCh38.p13
ensembl <- useDataset("hsapiens_gene_ensembl",mart = ensembl)

#-------------------------------------------------------
#Function:使用getBM函数进行ID转换
#-------------------------------------------------------

#getBM一共四个参数
#attributes : is a vector of attributes that one wants to retrieve (=the output of the query)
#需要转化成什么样的ID类型，可以使用listAttributes(ensembl)查看可以转化的ID类型
attributes<-listAttributes(ensembl)
head(attributes)
#                            name                  description         page
# 1               ensembl_gene_id               Gene stable ID feature_page
# 2       ensembl_gene_id_version       Gene stable ID version feature_page
# 3         ensembl_transcript_id         Transcript stable ID feature_page
# 4 ensembl_transcript_id_version Transcript stable ID version feature_page
# 5            ensembl_peptide_id            Protein stable ID feature_page
# 6    ensembl_peptide_id_version    Protein stable ID version feature_page
#filters : is a vector of filters that one wil use as input to the query
#限制转化的id的范围,如果我只想转化X染色体上的基因，可以通过设置chromosome_name=X来设置，可以使用listFilters(ensembl)查看可以选择的filter类型
filters = listFilters(ensembl)
head(filters)
#                 name                            description
# 1    chromosome_name               Chromosome/scaffold name
# 2              start                                  Start
# 3                end                                    End
# 4             strand                                 Strand
# 5 chromosomal_region               e.g. 1:100:10000:-1, 1:100000:200000:1
# 6          with_ccds                        With CCDS ID(s)

# values : a vector of values for the filters. In case multple filters are
# in use, the values argument requires a list of values where each
# position in the list corresponds to the position of the filters in the
# filters argument (see examples below)
#query的ID，如将探针转化为基因名，则value就是探针名
# mart : is an object of class Mart , which is created by the
# useMart() function
#使用的mart类型，即上文中提到的ensembl
values = c("202763_at","209310_s_at","207500_at")
attributes = c('affy_hg_u133_plus_2', 'entrezgene_id')
filters = 'affy_hg_u133_plus_2'#提供的id是这类芯片
mart = ensembl
new_id<-getBM(attributes,filters,values,mart)#注意，网络问题可能会连接失败，也可以使用select函数代替
注意！！由于getBM中输出的结果是乱序的（相比于你输入的顺序）因此在这里，为了得到更准确的结果，我们最好加上用于将结果验证的id（attributes），比如我的filters的类型是affy_hg_u133_plus_2，那我的attributes中也最好加上affy_hg_u133_plus_2这个标识。同时由于函数本身会把相同的行去掉（uniqueRows=T），因此我们可以把它关上
new_id<-getBM(c('affy_hg_u133_plus_2', 'entrezgene_id'),
                           'affy_hg_u133_plus_2',
                           c("202763_at","209310_s_at","207500_at"),
                           mart,
                            uniqueRows=F)

#-------------------------------------------------------
#Function:使用dataset的搜索函数
#-------------------------------------------------------
#搜索函数有三个：searchDatasets() searchAttributes() searchFilters()
#分别用来查找数据集名称，attributes名称以及filter名称
#都含有两个参数，分别为mart和pattern，其中pattern接受正则表达式
searchDatasets(mart = ensembl,pattern = "hsa.*")

#-------------------------------------------------------
#Function:在biomart包的教程中通过使用http及开放端口使用非ensembl库中的数据注释
#-------------------------------------------------------
#这里使用线虫数据库
listMarts(host = "parasite.wormbase.org")
wormbase = useMart(biomart = "parasite_mart",
                   host = "https://parasite.wormbase.org",
                   port = 443)
listDatasets(wormbase)
wormbase <- useDataset(mart = wormbase, dataset = "wbps_gene")
head(listFilters(wormbase))
getBM()

#-------------------------------------------------------
#Function:可以使用columns、select、keys、keytypes等函数对mart进行操作
#-------------------------------------------------------
columns(ensembl)
keytypes(ensembl)
keys(mart,keytype = "chromosome_name")
select(mart,keys = values,columns = attributes,keytype = "affy_hg_u133_plus_2")