NCBI的线粒体基因组数据库
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/
看了前面几个物种好像都是动物,这里也提供了genbank格式的文件,所以应该可以批量看下这个数据里面有没有植物的线粒体。
那么如何根据genbank文件获得物种所属的分类信息呢?
biopython里提供解析genbank文件的方法
示例genbank文件
LOCUS NC_035240 114 bp DNA linear PLN 14-JUL-2017
DEFINITION Punica granatum chloroplast, complete genome.
ACCESSION NC_035240 REGION: 70545..70658
VERSION NC_035240.1
DBLINK BioProject: PRJNA394497
KEYWORDS RefSeq.
SOURCE chloroplast Punica granatum (pomegranate)
ORGANISM Punica granatum
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
Pentapetalae; rosids; malvids; Myrtales; Lythraceae; Punica.
REFERENCE 1 (bases 1 to 114)
AUTHORS Rabah,S.O., Lee,C., Hajrah,N.H., Makki,R.M., Alharby,H.F.,
Alhebshi,A.M., Sabir,J.S.M., Sabir,M.J., Jansen,R.K. and
Ruhlman,T.A.
TITLE Plastome sequencing of 10 non-model crop species reveals multiple
inversions, gene transfers to the nucleus and a recent, large
mitochondrial insertion in the tree species cashew (Anacardium,
Anacardiaceae)
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 114)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (14-JUL-2017) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 3 (bases 1 to 114)
AUTHORS Rabah,S.O., Lee,C., Hajrah,N.H., Makki,R.M., Alharby,H.F.,
Alhebshi,A.M., Sabir,J.S.M., Sabir,M.J., Jansen,R.K. and
Ruhlman,T.A.
TITLE Direct Submission
JOURNAL Submitted (17-FEB-2017) Biological Sciences, King Abdulaziz
University, P.O.Box 80141, Jeddah 21589, Saudi Arabia
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence is identical to KY635883.
##Assembly-Data-START##
Assembly Method :: Velvet v. 1.2.08
Sequencing Technology :: Illumina
##Assembly-Data-END##
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..114
/organism="Punica granatum"
/organelle="plastid:chloroplast"
/mol_type="genomic DNA"
/db_xref="taxon:22663"
gene 1..114
/gene="petG"
/locus_tag="CGW82_pgp045"
/db_xref="GeneID:33351918"
CDS 1..114
/gene="petG"
/locus_tag="CGW82_pgp045"
/codon_start=1
/transl_table=11
/product="cytochrome b6/f complex subunit V"
/protein_id="YP_009390828.1"
/db_xref="GeneID:33351918"
/translation="MIEVFLFGIVLGLIPITLAGLFVTAYLQYRRGDQLDF"
ORIGIN
1 atgattgaag tttttctatt tggaattgtc ttaggtctaa ttcctattac tttagctgga
61 ttatttgtaa ctgcatattt acaatacaga cgtggtgatc agttggactt ttga
//
FEATURES Location/Qualifiers
这行以前的内容会以字典的形式存储在annotations
里,比如我要获取这部分内容,可以写一个简单的命令
for rec in SeqIO.parse('sequence.gb','gb'):
print(rec.annotations)
获得的内容是
{'molecule_type': 'DNA', 'topology': 'linear', 'data_file_division': 'PLN', 'date': '14-JUL-2017', 'accessions': ['NC_035240', 'REGION:', '70545..70658'], 'sequence_version': 1, 'keywords': ['RefSeq'], 'source': 'chloroplast Punica granatum (pomegranate)', 'organism': 'Punica granatum', 'taxonomy': ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'eudicotyledons', 'Gunneridae', 'Pentapetalae', 'rosids', 'malvids', 'Myrtales', 'Lythraceae', 'Punica'], 'references': [Reference(title='Plastome sequencing of 10 non-model crop species reveals multiple inversions, gene transfers to the nucleus and a recent, large mitochondrial insertion in the tree species cashew (Anacardium, Anacardiaceae)', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)], 'comment': 'PROVISIONAL REFSEQ: This record has not yet been subject to final\nNCBI review. The reference sequence is identical to KY635883.\nCOMPLETENESS: full length.', 'structured_comment': OrderedDict([('Assembly-Data', OrderedDict([('Assembly Method', 'Velvet v. 1.2.08'), ('Sequencing Technology', 'Illumina')]))])}
物种所属分类信息的键是taxonomy
,值对应的是一个列表,判断这个物种是不是植物就判断Viridiplanta
在不在这个列表里应该就可以了
欢迎大家关注我的公众号
小明的数据分析笔记本
网友评论