美文网首页生信生物信息编程基因组
ID转换专题之linux命令处理1:gtf

ID转换专题之linux命令处理1:gtf

作者: Amy_Cui | 来源:发表于2019-07-08 14:55 被阅读20次
    • ID转换的目的是寻找两列ID的对应关系,一列是当前数据你所有的,另一列ID是所需要的ID即可,不多不少,就两列即可。那么linux命令也是可以处理的,找一个这些信息的文本,用linux命令处理好。前提是linux基础命令你掌握了,掌握要求,培训课的第二天所讲内容。
    • 下面我处理的是ENSG转gene name的gff文件,因为这个文件里的信息含义这两列,我只需提取出来就好了。

    下载gtf

    你可以用gencode的gtf或者ncbi的gff,里面含有你想要的ID信息就好,只是处理的代码不同,因此linux基础和进阶命令都要学会,很有用,很高效

    ENSG ID与gene nama对应关系

    $ zless -S gencode.v31.annotation.gtf.gz|grep -w 'gene'|cut -f 9|awk -v OFS="\t" '{print $2,$6,$4}'|sed 's/[";]//g'|sed '1i #gene_id\tgene_name\tgene_type'|less
    # 或者用下面awk的gsub来做
    zless -S gencode.v31.annotation.gtf.gz|grep -w 'gene'|cut -f 9|awk -v OFS="\t" '{gsub(/[";]/,"");print $2,$6,$4}'|sed '1i #gene_id\tgene_name\tgene_type'|less
    
    image.png

    ENST ID与ENSG ID对应关系

    $ zless -S gencode.v31.annotation.gtf.gz|grep -w 'transcript'|cut -f 9|awk -v OFS="\t" 'BEGIN{print "gene_id","transcript_id","gene_name","transcript_type"}{print $2,$4,$8,$10}'|sed 's/[";]//g'|less -S
    # zless -S gencode.v31.annotation.gtf.gz|grep -w 'transcript'|cut -f 9|awk -v OFS="\t" 'BEGIN{print "gene_id","transcript_id","gene_name","transcript_type"}{print $2,$4,$8,$10}'|sed 's/[";]//g'|sort|uniq|less -S
    # gene_id transcript_id   gene_name       transcript_type4列信息
    
    示例

    如果你需要过滤类型查看的话

    $ zless -S gencode.v31.annotation.gtf.gz|grep -w 'transcript'|awk '{print $18}'|sort|uniq
    # 查看所有类型
    "IG_C_gene";
    "IG_C_pseudogene";
    "IG_D_gene";
    "IG_J_gene";
    "IG_J_pseudogene";
    "IG_pseudogene";
    "IG_V_gene";
    "IG_V_pseudogene";
    "lncRNA";
    "miRNA";
    "misc_RNA";
    "Mt_rRNA";
    "Mt_tRNA";
    "nonsense_mediated_decay";
    "non_stop_decay";
    "polymorphic_pseudogene";
    "processed_pseudogene";
    "protein_coding";
    "pseudogene";
    "retained_intron";
    "ribozyme";
    "rRNA";
    "rRNA_pseudogene";
    "scaRNA";
    "scRNA";
    "snoRNA";
    "snRNA";
    "sRNA";
    "TEC";
    "transcribed_processed_pseudogene";
    "transcribed_unitary_pseudogene";
    "transcribed_unprocessed_pseudogene";
    "translated_processed_pseudogene";
    "translated_unprocessed_pseudogene";
    "TR_C_gene";
    "TR_D_gene";
    "TR_J_gene";
    "TR_J_pseudogene";
    "TR_V_gene";
    "TR_V_pseudogene";
    "unitary_pseudogene";
    "unprocessed_pseudogene";
    "vaultRNA";
    
    
    # 然后用grep过滤上面的结果文件即可
    

    相关文章

      网友评论

        本文标题:ID转换专题之linux命令处理1:gtf

        本文链接:https://www.haomeiwen.com/subject/mrvohctx.html