将human.fasta文件存入一个字典,其中序列ID为字典的键值,序列为对应的value
#!/usr/bin/env python3
import sys
fh = open(sys.argv[1], "rt")
-----------------------------------
faDict = {}
for line in fh:
if line.startswith(">"):
ID = line
faDict[ID] = ""
else:
faDict[ID] += line
------------------------------------
'''
以上过程花费巨长时间,1个人类基因组文件3G+,需要1h+以上(说实话,我没跑完,直接ctril+C),和师兄花了很久时间才找出了原因:字符串拼接方式
python中字符串拼接方式:
1. + 拼接起来的字符串会向系统申请新的内存地址,当累加次数较少时,影响不大,拼接字符串过多时,速度非常慢
2. join 拼接字符串时,不会申请新的内存
3. %s
4. format
'''
# 解决办法1
------------------------------------
faDict = {}
for line in fh:
if line.startswith(">"):
ID = line
faDict[ID] = []
else:
faDict[ID].append(line)
-------------------------------------
时间消耗 18s
# 调用
seq = “”.join([ seq.strip("\n") for seq in faDict[ID] ])
网友评论