美文网首页
biopython解析genbank文件获取物种分类信息

biopython解析genbank文件获取物种分类信息

作者: 小明的数据分析笔记本 | 来源:发表于2020-08-09 10:45 被阅读0次

    NCBI的线粒体基因组数据库

    ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/

    看了前面几个物种好像都是动物,这里也提供了genbank格式的文件,所以应该可以批量看下这个数据里面有没有植物的线粒体。

    那么如何根据genbank文件获得物种所属的分类信息呢?
    biopython里提供解析genbank文件的方法

    示例genbank文件

    LOCUS       NC_035240                114 bp    DNA     linear   PLN 14-JUL-2017
    DEFINITION  Punica granatum chloroplast, complete genome.
    ACCESSION   NC_035240 REGION: 70545..70658
    VERSION     NC_035240.1
    DBLINK      BioProject: PRJNA394497
    KEYWORDS    RefSeq.
    SOURCE      chloroplast Punica granatum (pomegranate)
      ORGANISM  Punica granatum
                Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
                Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
                Pentapetalae; rosids; malvids; Myrtales; Lythraceae; Punica.
    REFERENCE   1  (bases 1 to 114)
      AUTHORS   Rabah,S.O., Lee,C., Hajrah,N.H., Makki,R.M., Alharby,H.F.,
                Alhebshi,A.M., Sabir,J.S.M., Sabir,M.J., Jansen,R.K. and
                Ruhlman,T.A.
      TITLE     Plastome sequencing of 10 non-model crop species reveals multiple
                inversions, gene transfers to the nucleus and a recent, large
                mitochondrial insertion in the tree species cashew (Anacardium,
                Anacardiaceae)
      JOURNAL   Unpublished
    REFERENCE   2  (bases 1 to 114)
      CONSRTM   NCBI Genome Project
      TITLE     Direct Submission
      JOURNAL   Submitted (14-JUL-2017) National Center for Biotechnology
                Information, NIH, Bethesda, MD 20894, USA
    REFERENCE   3  (bases 1 to 114)
      AUTHORS   Rabah,S.O., Lee,C., Hajrah,N.H., Makki,R.M., Alharby,H.F.,
                Alhebshi,A.M., Sabir,J.S.M., Sabir,M.J., Jansen,R.K. and
                Ruhlman,T.A.
      TITLE     Direct Submission
      JOURNAL   Submitted (17-FEB-2017) Biological Sciences, King Abdulaziz
                University, P.O.Box 80141, Jeddah 21589, Saudi Arabia
    COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
                NCBI review. The reference sequence is identical to KY635883.
                
                ##Assembly-Data-START##
                Assembly Method       :: Velvet v. 1.2.08
                Sequencing Technology :: Illumina
                ##Assembly-Data-END##
                COMPLETENESS: full length.
    FEATURES             Location/Qualifiers
         source          1..114
                         /organism="Punica granatum"
                         /organelle="plastid:chloroplast"
                         /mol_type="genomic DNA"
                         /db_xref="taxon:22663"
         gene            1..114
                         /gene="petG"
                         /locus_tag="CGW82_pgp045"
                         /db_xref="GeneID:33351918"
         CDS             1..114
                         /gene="petG"
                         /locus_tag="CGW82_pgp045"
                         /codon_start=1
                         /transl_table=11
                         /product="cytochrome b6/f complex subunit V"
                         /protein_id="YP_009390828.1"
                         /db_xref="GeneID:33351918"
                         /translation="MIEVFLFGIVLGLIPITLAGLFVTAYLQYRRGDQLDF"
    ORIGIN      
            1 atgattgaag tttttctatt tggaattgtc ttaggtctaa ttcctattac tttagctgga
           61 ttatttgtaa ctgcatattt acaatacaga cgtggtgatc agttggactt ttga
    //
    

    FEATURES Location/Qualifiers这行以前的内容会以字典的形式存储在annotations里,比如我要获取这部分内容,可以写一个简单的命令

    for rec in SeqIO.parse('sequence.gb','gb'):
        print(rec.annotations)
    

    获得的内容是

    {'molecule_type': 'DNA', 'topology': 'linear', 'data_file_division': 'PLN', 'date': '14-JUL-2017', 'accessions': ['NC_035240', 'REGION:', '70545..70658'], 'sequence_version': 1, 'keywords': ['RefSeq'], 'source': 'chloroplast Punica granatum (pomegranate)', 'organism': 'Punica granatum', 'taxonomy': ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'eudicotyledons', 'Gunneridae', 'Pentapetalae', 'rosids', 'malvids', 'Myrtales', 'Lythraceae', 'Punica'], 'references': [Reference(title='Plastome sequencing of 10 non-model crop species reveals multiple inversions, gene transfers to the nucleus and a recent, large mitochondrial insertion in the tree species cashew (Anacardium, Anacardiaceae)', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)], 'comment': 'PROVISIONAL REFSEQ: This record has not yet been subject to final\nNCBI review. The reference sequence is identical to KY635883.\nCOMPLETENESS: full length.', 'structured_comment': OrderedDict([('Assembly-Data', OrderedDict([('Assembly Method', 'Velvet v. 1.2.08'), ('Sequencing Technology', 'Illumina')]))])}
    

    物种所属分类信息的键是taxonomy,值对应的是一个列表,判断这个物种是不是植物就判断Viridiplanta在不在这个列表里应该就可以了

    欢迎大家关注我的公众号
    小明的数据分析笔记本

    公众号二维码.jpg

    相关文章

      网友评论

          本文标题:biopython解析genbank文件获取物种分类信息

          本文链接:https://www.haomeiwen.com/subject/ngkldktx.html