- ID转换的目的是寻找两列ID的对应关系,一列是当前数据你所有的,另一列ID是所需要的ID即可,不多不少,就两列即可。那么linux命令也是可以处理的,找一个这些信息的文本,用linux命令处理好。前提是linux基础命令你掌握了,掌握要求,培训课的第二天所讲内容。
- 下面我处理的是ENSG转gene name的gff文件,因为这个文件里的信息含义这两列,我只需提取出来就好了。
下载gtf
你可以用gencode的gtf或者ncbi的gff,里面含有你想要的ID信息就好,只是处理的代码不同,因此linux基础和进阶命令都要学会,很有用,很高效
ENSG ID与gene nama对应关系
$ zless -S gencode.v31.annotation.gtf.gz|grep -w 'gene'|cut -f 9|awk -v OFS="\t" '{print $2,$6,$4}'|sed 's/[";]//g'|sed '1i #gene_id\tgene_name\tgene_type'|less
# 或者用下面awk的gsub来做
zless -S gencode.v31.annotation.gtf.gz|grep -w 'gene'|cut -f 9|awk -v OFS="\t" '{gsub(/[";]/,"");print $2,$6,$4}'|sed '1i #gene_id\tgene_name\tgene_type'|less
image.png
ENST ID与ENSG ID对应关系
$ zless -S gencode.v31.annotation.gtf.gz|grep -w 'transcript'|cut -f 9|awk -v OFS="\t" 'BEGIN{print "gene_id","transcript_id","gene_name","transcript_type"}{print $2,$4,$8,$10}'|sed 's/[";]//g'|less -S
# zless -S gencode.v31.annotation.gtf.gz|grep -w 'transcript'|cut -f 9|awk -v OFS="\t" 'BEGIN{print "gene_id","transcript_id","gene_name","transcript_type"}{print $2,$4,$8,$10}'|sed 's/[";]//g'|sort|uniq|less -S
# gene_id transcript_id gene_name transcript_type4列信息
示例
如果你需要过滤类型查看的话
$ zless -S gencode.v31.annotation.gtf.gz|grep -w 'transcript'|awk '{print $18}'|sort|uniq
# 查看所有类型
"IG_C_gene";
"IG_C_pseudogene";
"IG_D_gene";
"IG_J_gene";
"IG_J_pseudogene";
"IG_pseudogene";
"IG_V_gene";
"IG_V_pseudogene";
"lncRNA";
"miRNA";
"misc_RNA";
"Mt_rRNA";
"Mt_tRNA";
"nonsense_mediated_decay";
"non_stop_decay";
"polymorphic_pseudogene";
"processed_pseudogene";
"protein_coding";
"pseudogene";
"retained_intron";
"ribozyme";
"rRNA";
"rRNA_pseudogene";
"scaRNA";
"scRNA";
"snoRNA";
"snRNA";
"sRNA";
"TEC";
"transcribed_processed_pseudogene";
"transcribed_unitary_pseudogene";
"transcribed_unprocessed_pseudogene";
"translated_processed_pseudogene";
"translated_unprocessed_pseudogene";
"TR_C_gene";
"TR_D_gene";
"TR_J_gene";
"TR_J_pseudogene";
"TR_V_gene";
"TR_V_pseudogene";
"unitary_pseudogene";
"unprocessed_pseudogene";
"vaultRNA";
# 然后用grep过滤上面的结果文件即可
网友评论