美文网首页
生信习题 | 一

生信习题 | 一

作者: kkkkkkang | 来源:发表于2020-08-26 14:31 被阅读0次

    以下是分析和学习过程中遇到的值得记录一下的生信习题库

    题目1:下载最新版的KEGG注释文本文件,编写脚本整理成kegg的pathway的ID与基因ID的对应格式。

    #数据下载参考技能树教程http://www.bio-info-trainee.com/2550.html
    我的解题思路:因为C开头的是KEGG pathway ID, D开头的是基因ID,C开头的第二列拿出来当作变量值赋给a,碰到D开头的第二列就把这俩一块输出,\t分隔,直到遇见下一个C开头的KEGG pathway ID,循环直到文件末
    awk '{if ($0 ~ /^C/) a=$2; else if ($0 ~ /^D/) print a"\t"$2}' hsa00001.keg | less
    00010   3101
    00010   3098
    00010   3099
    00010   80201
    00010   2645
    00010   2821
    00010   5213
    00010   5214
    00010   5211
    00010   2203
    00010   8789
    00010   230
    00010   226
    00010   229
    00010   7167
    00010   2597
    00010   26330
    

    题目2: 解析go-basic.obo为tab键分隔的三列表

    原始go-basic.obo是这样

    format-version: 1.2
    data-version: releases/2020-08-11
    subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
    subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
    subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
    subsetdef: goslim_agr "AGR slim"
    subsetdef: goslim_aspergillus "Aspergillus GO slim"
    subsetdef: goslim_candida "Candida GO slim"
    subsetdef: goslim_chembl "ChEMBL protein targets summary"
    subsetdef: goslim_drosophila "Drosophila GO slim"
    subsetdef: goslim_flybase_ribbon "FlyBase Drosophila GO ribbon slim"
    subsetdef: goslim_generic "Generic GO slim"
    subsetdef: goslim_metagenomics "Metagenomics GO slim"
    subsetdef: goslim_mouse "Mouse GO slim"
    subsetdef: goslim_pir "PIR GO slim"
    subsetdef: goslim_plant "Plant GO slim"
    subsetdef: goslim_pombe "Fission yeast GO slim"
    subsetdef: goslim_synapse "synapse GO slim"
    subsetdef: goslim_yeast "Yeast GO slim"
    synonymtypedef: syngo_official_label "label approved by the SynGO project"
    synonymtypedef: systematic_synonym "Systematic synonym" EXACT
    default-namespace: gene_ontology
    remark: cvs version: use data-version
    remark: Includes Ontology(OntologyID(OntologyIRI(<http://purl.obolibrary.org/obo/go/never_in_taxon.owl>))) [Axioms: 18 Logical Axioms: 0]
    ontology: go
    
    [Term]
    id: GO:0000001
    name: mitochondrion inheritance
    namespace: biological_process
    def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria
    and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
    synonym: "mitochondrial inheritance" EXACT []
    is_a: GO:0048308 ! organelle inheritance
    is_a: GO:0048311 ! mitochondrion distribution
    name: reproduction
    namespace: biological_process
    alt_id: GO:0019952
    alt_id: GO:0050876
    def: "The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms." [GOC:go_curators, GOC:isa_complete, GOC:jl, ISBN:0198506732]
    subset: goslim_agr
    subset: goslim_chembl
    subset: goslim_flybase_ribbon
    subset: goslim_generic
    subset: goslim_pir
    subset: goslim_plant
    synonym: "reproductive physiological process" EXACT []
    xref: Wikipedia:Reproduction
    is_a: GO:0008150 ! biological_process
    
    [Term]
    id: GO:0000002
    name: mitochondrial genome maintenance
    namespace: biological_process
    def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw
    ]
    is_a: GO:0007005 ! mitochondrion organization
    
    [Term]
    id: GO:0000003
    

    想要转换成这样

    Go Decription Level
    GO:0000001 mitochondrion inheritance biological_process
    GO:0000002 mitochondrial genome maintenance biological_process
    GO:0000003 reproduction biological_process
    GO:0000005 obsolete ribosomal chaperone activity molecular_function
    GO:0000006 high-affinity zinc transmembrane transporter activity molecular_function
    GO:0000007 low-affinity zinc ion transmembrane transporter activity molecular_function
    • 我的思路很简单,找一下规律就可以看出来:想要转换成的第一列的内容在原文件的id行;想要转换成的第二列的内容在原文件的name行;想要转换成的第三列的内容在原文件的namespace行;
      依次提出来,前两个按tab键分隔,最后一个默认换行符'\n'分隔就行了。
    • But, 这里有一坑。。如果以python的startswith语句匹配的话,,startswith("name")直接把namespace包含了。。
      很简单,以name:开头就行了🤢
    #!/usr/bin/env python
    #-*-utf-8-*-
    import argparse
    parser = argparse.ArgumentParser(description = "\nThis python script is used to parse the go-basic.obo file into tab delimited table", add_help = False, usage = "\n python3 parse_go.py -i [go-basic.obo] -o [go.txt] \n python3 parse_go.py --input [go-basic.obo] --output [go.txt]")
    required = parser.add_argument_group("Required options")
    optional = parser.add_argument_group("Optional options")
    required.add_argument("-i", "--input", metavar = "[go-basic.obo]", help = "input file format: go_obo", required = True)
    required.add_argument("-o", "--output", metavar = "[stat.txt]", help = "output file format: tab delimited table", required = True)
    optional.add_argument("-h", "--help", action = "help", help = "help.info")
    args = parser.parse_args()
    with open(args.input,"r") as fi:
        with open(args.output,"w") as fo:
            print("GO\tDescription\tLevel",file = fo)
            for line in fi:
                if line.startswith("id"):
                    print(''.join(line.split(" ")[1]).strip(), file = fo, end = '\t')
                elif line.startswith("name:"):
                    print(' '.join(line.split(" ")[1:]).strip(), file = fo, end = '\t')
                elif line.startswith("namespace"):
                    print(line.split(" ")[1].strip(), file = fo)
    

    统计人类外显子区域的长度

    #!/usr/bin/env python3
    #2020.8
    #-*-utf-8-*-
    import argparse
    parser = argparse.ArgumentParser(description = '\nThis is a python3 script used to count the length of human genome extron', add_help = False, usage = '\npython3 extron_len.py -i [human_extron.txt]')
    required = parser.add_argument_group('Required options')
    optional = parser.add_argument_group('Optional options')
    required.add_argument('-i', metavar = '[human_extron.txt]', help = 'input: extron can be downloaded at https://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/CCDS.current.txt', required = True)
    optional.add_argument('-h','--help', action = 'help', help = 'help.info')
    args = parser.parse_args()
    extron = {}
    with open(args.i,'r') as f:
        next(f) #跳过第一行
        for line in f:
            line = line.split('\t')
            if line[9] != '-':
                extron_cor = line[9].lstrip('[').rstrip(']').split(', ') 
            for i in range(len(extron_cor)):
                start = extron_cor[i].split('-')[0]
                extron[start] = extron_cor[i].split('-')[1]
    length = 0
    for i,j in extron.items():
        length += int(j.strip("'")) - int(i.strip("'")) + 1
    print(length)
    

    相关文章

      网友评论

          本文标题:生信习题 | 一

          本文链接:https://www.haomeiwen.com/subject/zrcorktx.html