叶绿体基因组类的文章通常会有一幅图来展示叶绿体基因组的相似性(Sequence identity plot),出图的工具是mVISTA:mVISTA分为本地版和在线版两种。本文简要介绍使用在线版mVISTA获得Sequence identity plot的步骤。
本文使用到的序列数据为5个苹果属植物的叶绿体基因组序列
Species | accession numbers |
---|---|
Malus florentina | NC_035625 |
Malus micromalus | NC_036368 |
Malus prunifolia | NC_031163 |
Malus trilobata | NC_035671 |
Malus tschonoskii | NC_035672 |
第一步:下载序列
下载每个叶绿体基因组的fasta格式;下载作为参考基因组的genbank格式文件。(序列不是很多可以手动下载;如果序列条数非常多自己写个一个简单的python脚本批量下载叶绿体基因组序列)
import os
import sys
import argparse
parser = argparse.ArgumentParser(description='This script was used to download gb or fasta file of cp genome from NCBI nucleotides database')
parser.add_argument('-f','--format',help='Please input file format you want to download',required=True)
parser.add_argument('-a','--accession',help='file name contain accession number of cp genome you want to download',required=True)
args = parser.parse_args()
fr = open(args.accession,'r')
acc = []
for line in fr:
acc.append(line.strip())
print("The number of sequence will be downloaded is: ",len(acc))
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "" #在这里添加邮箱
fold_name = "downladed_seq_"+args.format
os.mkdir(fold_name)
os.chdir(fold_name)
for line in acc:
print(line + " will be downloaded!")
fw = open(line+"."+args.format,'w')
cp = Entrez.efetch(db='nucleotide',id=[line],rettype=args.format)
seq = SeqIO.read(cp,args.format)
SeqIO.write(seq,fw,args.format)
fw.close()
print(line + " is downloaded!")
print(len(os.listdir()),"sequence have been downloaded")
使用方式
#下载fasta格式
python download_gb_or_fa_from_NCBI_cp_genome_database_1.py -f fasta -a accession_numbers.txt
# 下载genbank格式
python download_gb_or_fa_from_NCBI_cp_genome_database_1.py -f gb -a accession_numbers.txt
# 序列号放到文本文件中每行一个
第二步:上传序列至mVISTA
http://genome.lbl.gov/vista/mvista/submit.shtml
本例中 total number of seuqences is 5 image.png
- 填写邮箱地址,运行完结果发送至邮箱;上传下载好的fasta格式序列
-
选择比对程序 论文中通常使用第三个
image.png -
上传注释文件
mVISTA要求的注释文件格式为 http://genome.lbl.gov/vista/mvista/instructions.shtml
image.png
可以通过处理genbank格式文件得到
import argparse
from Bio import SeqIO
parser = argparse.ArgumentParser(description = 'This script was used to get mVISTA annotation file used to cp genome comparative analysis')
parser.add_argument('-i','--genbank',help="Please input genBank format file", required = True)
args = parser.parse_args()
fw = open(args.genbank + "_mVISTA_annotation","w")
for rec in SeqIO.parse(args.genbank,"gb"):
for feature in rec.features:
if feature.type == "gene":
for part in feature.location.parts:
if part.strand == -1:
start_location = str(part.start)
end_location = str(part.end)
gene_name = feature.qualifiers["gene"][0]
fw.write("<\t%s\t%s\t%s\n"%(start_location,end_location,gene_name))
else:
start_location = str(part.start)
end_location = str(part.end)
gene_name = feature.qualifiers["gene"][0]
fw.write("<\t%s\t%s\t%s\n"%(start_location,end_location,gene_name))
elif feature.type == "CDS":
for part in feature.location.parts:
if part.strand == -1:
start_location = str(part.start)
end_location = str(part.end)
fw.write("%s\t%s\t%s\n"%(start_location,end_location,"exon"))
else:
start_location = str(part.start)
end_location = str(part.end)
fw.write("%s\t%s\t%s\n"%(start_location,end_location,"exon"))
elif feature.type == "tRNA" or feature.type == "rRNA":
for part in feature.location.parts:
start_location = str(feature.location.start)
end_location = str(feature.location.end)
fw.write("%s\t%s\t%s\n"%(start_location,end_location,"utr"))
else:
print("%s %s"%(feature.type,"is not needed"))
print("The process is done!")
fw.close()
# 使用方法
python get_mVISTA_annotation_file_from_genbank_1.py -i NC_031163.gb
image.png
这里只上传参考序列的注释文件和全部上传注释文件是否有区别自己还没有搞清楚,这里暂时只选择上传参考序列的注释文件。其他参数选择默认,具体参数是什么功能自己还需要仔细查看帮助文档。
image.png
结果以邮件的形式发送邮箱
image.png
选择输出的长度 点击此处下载结果
image.png
输出结果为pdf文件自己调节细节
欢迎大家关注我的公众号 小明的数据分析笔记本
网友评论