fasta文件的读取是所有数据分析的第一步。fasta文件是包含一行含有">"的序列名和一行包含其对应的序列的文件,经常需要将序列名和序列拆分处理,因此,需要将它们存成方便读取的"字典"是最好的选择。
例如:根据特定序列ID从fasta文件提取相关序列, ID是基因名
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# 需要先conda activate python3
"""
作者:徐诗芬
功能:将序列读取成字典,根据特定序列ID提取字典里的序列
日期:2021.1.9
"""
import sys
import time
def usage():
print('Usage: python seq_abstract.py [fasta_file] [index_file] [out_indexfile] [out_remainfile]')
def fasta2dict(inf):
# 按行读取序列
# 输入fasta文件,返回名称,序列
global name
dict = {}
for line in inf:
line = line.strip()
if line.startswith('>'):
name = line[1:]
dict[name] = ''
else:
dict[name] += line
return dict
def main():
inf = open(sys.argv[1],'r')
index = open(sys.argv[2], 'r')
outf1 = open(sys.argv[3], 'w')
outf2 = open(sys.argv[4], 'w')
dict = fasta2dict(inf)
key_list = list(dict.keys())
index_list = []
for line in index:
line = line.strip()
index_list.append(line)
for header in key_list:
for flag in index_list:
if flag in header:
outf1.write(header + '\n')
outf1.write(dict[header] + '\n')
else:
outf2.write(header + '\n')
outf2.write(dict[header] + '\n')
inf.close()
index.close()
outf1.close()
outf2.close()
try:
main()
except IndexError:
usage()
end_time = time.process_time()
print('Running time: {:.2f} Seconds'.format(end_time))
网友评论