大家在下载多个数据集的单细胞转录组数据的基因表达矩阵的时候,要注意不同数据集间的矩阵的基因名,有可能同一个基因,名字不一样。例如,CXCL8,又可以叫IL8。如果合并的时候,仅用基因名求交,就会把这个基因给漏掉。
现作者写写程序解决这个问题,作者先下载了GCF_000001405.40_GRCh38.p14_genomic.gtf.gz,搜索IL8,也就是less -S GCF_000001405.40_GRCh38.p14_genomic.gtf.gz | grep IL8,发现了这么一行:
NC_000004.12 BestRefSeq gene 73740569 73743716 . + . gene_id "CXCL8"; transcript_id ""; db_xref "GeneID:3576"; db_xref "HGNC:HGNC:6025"; db_xref "MIM:146930"; description "C-X-C motif chemokine ligand 8"; gbkey "Gene"; gene "CXCL8"; gene_biotype "protein_coding"; gene_synonym "GCP-1"; gene_synonym "GCP1"; gene_synonym "IL8"; gene_synonym "LECT"; gene_synonym "LUCT"; gene_synonym "LYNAP"; gene_synonym "MDNCF"; gene_synonym "MONAP"; gene_synonym "NAF"; gene_synonym "NAP-1"; gene_synonym "NAP1"; gene_synonym "SCYB8";
也就是说,这个gtf文件记录了每个基因的同名名字。我们只要建立一个数据库做映射,就可以解决同一个基因不同名问题。
还要考虑基因名及同义名字相同的情况,例如APITD1基因:
NC_000001.11 BestRefSeq gene 10430433 10442808 . + . gene_id"CENPS"; transcript_id ""; db_xref"GeneID:378708"; db_xref "HGNC:HGNC:23163"; db_xref"MIM:609130"; description "centromere protein S"; gbkey"Gene"; gene "CENPS"; gene_biotype"protein_coding"; gene_synonym "APITD1"; gene_synonym"CENP-S"; gene_synonym "FAAP16"; gene_synonym"MHF1";
NC_000001.11 BestRefSeq gene 10430433 10452153 . + . gene_id"CENPS-CORT"; transcript_id ""; db_xref"GeneID:100526739"; db_xref "HGNC:HGNC:38843"; description"CENPS-CORT readthrough"; gbkey "Gene"; gene"CENPS-CORT"; gene_biotype "protein_coding"; gene_synonym"APITD1"; gene_synonym "APITD1-CORT"; gene_synonym"CENP-S"; gene_synonym "CENPS"; gene_synonym"FAAP16"; gene_synonym "MHF1";
这个CENPS应该是一个家族基因
网友评论