cMAP在线分析——旧版build2中Tag list的制作？

作者: Clariom | 来源:发表于2020-07-09 13:10 被阅读0次

cMAP在线分析——旧版build2中Tag list的制作？
cMAP新版clue的使用——List Marker
cMAP新版clue的使用——query
cMAP新版clue的使用——Touchstone
cMAP新版clue的使用——Data Library
cMAP新版clue的使用——CELL APP
cMap新版clue的使用——Morpheus
Connectivity Map(cMap)的探索应用（二）
包含字典dict的列表list的排序方法
Git 删除远程 tag 防止恢复

之前分享过cMap分析专题：

然后，有部分小伙伴咨询在“cMAP在线分析——旧版build2的使用”这篇教程中tag list的制作问题，今天就这部分做简要介绍！

首先来看下官方说明：

tag list: A list of probe sets from the feature set representing either the 'up' genes in a signature (the 'up tag list') or the 'down' genes in a signature (the 'down tag list'). One up tag list and one down tag list together represent a signature and both are required to execute a query using the quick query page or to upload a signature using the load signature page. Note that the total number of probe sets in the up tag list and down tag lists may not together exceed 1000. Text files with the .grp extension are used to contain tag lists. A .grp file has one probe set on each line.
Tag lists may be conveniently constructed from a signature populated with UniGene identifiers, RefSeq identifiers, Genbank identifiers, gene symbols, etc using the 'Batch Query' tool on NetAffx and selecting "HG-U133A" from the 'Select a GeneChip Array' menu. A variety of tools are available for the generation of tag lists from signatures from non-human species. These include GeneCruiser (http://www.genecruiser.org). Our preferred method uses UniGene and HomoloGene from NBCI (http://www.ncbi.nlm.nih.gov/) to obtain human UniGene identifiers for signature genes for submission to NetAffx, as described above.

关键大意：tag list通常分上调基因list和下调基因list，两者的基因总和不能大于1000个，同时要以.grp格式保存。基因名称可以是 UniGene ID, RefSeq ID, Genbank ID, gene symbols等，然后通过NetAffx的Batch Query功能进行转换，得到HG-U133A芯片中基因对应的探针ID。如何是其他物种，需要先通过同源性比较转化为人的基因，然后再提交给NetAffx完成探针ID的转化。至于如何进行同源性比较，官网提及 GeneCruiser（这个网址没打开）和NCBI HomoloGene等方式。

probe set: The collection of match and mismatch oligonucleotides on an Affymetrix GeneChip microarray designed against a given transcript which together allow the relative level of that transcript to be estimated. Probe sets are uniquely identified with a code number that, by convention, ends with "_at" ( eg 200800_s_at). Tag lists and instances are populated with probe sets from the feature set. Detailed descriptions of individual probe sets can be found through NetAffx.
feature set: The universe of probe sets recognized by cmap, defined as the 22,283 probe sets on the HG-U133A array.
HG-U133A: Affymetrix GeneChip Human Genome U133A Array (part number 510681). The probe sets on this array define the feature set.

关键大意：taglist的探针ID的来源主要是HG-U133A这款芯片，共计22283条探针信息，每个探针ID都有独一无二的数字编码，并以_at结尾为特征。

从GEO Database中检索GPL96，即可获取该芯片的注释信息（如下图），这时候可以通过对应关系，将人的基因名字（可以是官方gene symbols/RefSeq ID ）转化为芯片的探针ID。

image

目前待解决的问题：

1. 常规情况下，如何将基因转化为HG-U133A芯片的探针ID?

方式一：根据官方说明，拿到人的基因名字（UniGene ID/RefSeq ID/ Genbank ID/ gene symbols）后，这里采用的是 gene symbols，将基因名字提交给NetAffx（http://www.affymetrix.com/analysis/netaffx/index.affx）的Batch Query，完成基因名字—探针ID的转换（如下图）。

image

方式二：本地下载好HG-U133A芯片的注释文件（即上文中的GPL96平台信息），然后通过R语言直接将提供的基因名称与注释文件匹配，得到基因对应的探针ID。

rm(list = ls())
GPL96=read.table("GPL96-57554.txt",  header = TRUE,fill = T,sep = "\t",
                 comment.char = "#",
                 stringsAsFactors = FALSE,
                 quote = "")
colnames(GPL96)#1，11列为探针ID和gene symbol
data=read.table("基因名称.txt")
colnames(data)<- "Gene Symbol"
##获取基因名称和探针ID之间的关系
gene_tran_ID <- GPL96[GPL96$Gene.Symbol %in% data$`Gene Symbol`,c(1,11)]
# ID Gene.Symbol
# 1087  201559_s_at       CLIC4
# 1088    201560_at       CLIC4
# 1706    202178_at       PRKCZ
# 3406    203879_at      PIK3CD
# 3702    204175_at      ZNF593
# 5154    205627_at         CDA
# 8728    209234_at       KIF1B
# 10668 211230_s_at      PIK3CD
# 11733 212348_s_at       KDM1A
# 12649   213268_at      CAMTA1
# 17927 218562_s_at      TMEM57
# 18495   219131_at      UBIAD1
# 19228 219864_s_at       RCAN3
# 21150 221790_s_at     LDLRAP1
# 21241 221881_s_at       CLIC4
# 22116    57082_at     LDLRAP1

2. 如何将其它物种的基因转化为HG-U133A芯片的探针ID?

关键步骤在于将其他物种基因通过同源性映射到人的基因，然后再通过NetAffx的Batch Query，完成基因名字—探针ID的转换。那如何进行不同物种之间的同源基因转化呢？

在线同源基因转化：http://refdic.rcai.riken.jp/tools/matchom.cgi。如下图，提交了一份小鼠的基因名称（gene symbol）,想将其转化为同源的人基因名称，最后结果：提交17个小鼠的基因，成功匹配到14个人的同源基因。

image

本地转化—homologene包，该包基于NCBI的homologene数据库开发，旨在用于匹配不同物种之间的同源基因。该包的中包含多个函数，主要使用的是homologene，它的使用方式如下：
Usage：homologene(genes, inTax, outTax, db = homologene::homologeneData)
Arguments（参数）
genes：A vector of gene symbols or NCBI ids
inTax：taxid of the species that the input genes are coming from
outTax：taxid of the species that you are seeking homology
db：Homologene database to use.
Examples：homologene(c('Eno2','17441'), inTax = 10090, outTax = 9606)

从上述使用方式来看，只需要准备gene symbols or NCBI ids，然后说清楚，由什么物种（inTax）转化到什么物种（outTax）即可。这里用NCBI的taxid来表示不同物种。NCBI人为的给自然界所有物种都给了一个编号，这个编号就是taxid，比如其中人类的编号是 9606，10090是小鼠。

10090   Mus musculus
10116   Rattus norvegicus
28985   Kluyveromyces lactis
318829  Magnaporthe oryzae
33169   Eremothecium gossypii
3702    Arabidopsis thaliana
4530    Oryza sativa
4896    Schizosaccharomyces pombe
4932    Saccharomyces cerevisiae
5141    Neurospora crassa
6239    Caenorhabditis elegans
7165    Anopheles gambiae
7227    Drosophila melanogaster
7955    Danio rerio
8364    Xenopus (Silurana) tropicalis
9031    Gallus gallus
9544    Macaca mulatta
9598    Pan troglodytes
9606    Homo sapiens
9615    Canis lupus familiaris
9913    Bos taurus

通过homologene()函数可以非常快速的进行不同物种间同源基因的转化，且比在线工具的匹配度更高。

rm(list = ls())
options(download.file.method = 'libcurl')
options(url.method='libcurl')
install.packages("homologene")
library("homologene")
data=read.table("小鼠基因名称.txt")
dat=as.character(t(data))
genelist <- dat
# [1] "Prkcz"   "Camta1"  "Pik3cd"  "Kif1b"   "Ubiad1"  "Fbxo6"   "Agtrap"  "Rsc1a1"  "Cda"     "Kdm1a"   "Pithd1"  "Rcan3"   "Clic4"  
# [14] "Tmem57"  "Ldlrap1" "Pdik1l"  "Zfp593" 
output <- homologene(genelist, inTax = 10090, outTax = 9606)
# 10090    9606 10090_ID 9606_ID
# 1    Prkcz   PRKCZ    18762    5590
# 2   Camta1  CAMTA1   100072   23261
# 3   Pik3cd  PIK3CD    18707    5293
# 4    Kif1b   KIF1B    16561   23095
# 5   Ubiad1  UBIAD1    71707   29914
# 6    Fbxo6   FBXO6    50762   26270
# 7   Agtrap  AGTRAP    11610   57085
# 8   Rsc1a1  RSC1A1    69994    6248
# 9      Cda     CDA    72269     978
# 10   Kdm1a   KDM1A    99982   23028
# 11  Pithd1  PITHD1    66193   57095
# 12   Rcan3   RCAN3    53902   11123
# 13   Clic4   CLIC4    29876   25932
# 14  Tmem57  TMEM57    66146   55219
# 15 Ldlrap1 LDLRAP1   100017   26119
# 16  Pdik1l  PDIK1L   230809  149420
# 17  Zfp593  ZNF593    68040   51042

遇到的问题？

问：有500个gene（都是gene名），在转换探针ID后只剩下不到一半了，匹配率很低是为什么？
答：可能的原因：

这500个基因可能是其它物种的，如果是其它物种需要转化成人的同源基因，再行转化探针ID。
基因名称有很多别名，可能正好你提供的名字和平台注释提供的名字不同而被过滤。但实际操作过程中发现，这个影响因素很小。撇开匹配度的问题，本身HG-U133A芯片收录的基因本身就是不全面的，总共收录的基因也就两万多个。现在很多基因在HG-U133A芯片中就没有涉及相应探针。
若是由芯片本身基因设计不足带来的低匹配度，这该怎么办？个人觉得，技术的革新是正常的，基因检测数量的增加也很正常。但cmap分析基于的小分子表达数据库的数据大多是基于HG-U133A这款芯片，是将这个数据库作为比较参照的，所以这就导致我们要做这个分析，那么就要保证咱们的数据最好也是HG-U133A芯片数据，但显然不可能。因此，建议咱们的差异数据结果出来以后，先与HG-U133A芯片中收录的基因取交集，在交集的差异基因基础上，在分别取上调500基因，下调500基因去进行cmap分析。这样可以有效避免由于技术差异带来的低匹配基因，从而避免由于技术革新对cmap分析结果带来的影响。

往期回顾
Connectivity Map(cMap)的探索应用（一）
Connectivity Map(cMap)的探索应用（二）
cMAP在线分析——旧版build2的使用
 Connectivity Map(cMap)的探索应用（三）
cMAP新版clue的使用——List Marker
cMAP新版clue的使用——query
cMAP新版clue的使用——Touchstone
cMAP新版clue的使用——Data Library
cMap新版clue的使用——Morpheus
cMAP新版clue的使用——CELL APP