题目来自生信技能树论坛
这个题目不难,但是我想说明的是
大的数据集和小的数据集的脚本很多时候是不一样的
比如这道题,如果使用的是小的数据集
>chr1
ATCGTCGaaAATGAANccNNttGTA
AGGTCTNAAccAAttGggG
>chr2
ATCGAATGATCGANNNGccTA
AGGTCTNAAAAGG
>chr3
ATCGTCGANNNGTAATggGA
AGGTCTNAAAAGG
>chr4
ATCGTCaaaGANNAATGANGgggTA
我可以用构建字典的方式,将想要查找的染色体和所有该染色体所有碱基的坐标以及对应碱基构建字典
import sys
args=sys.argv
filename=args[1]
import collections
chr_num=args[2]
base_num=args[3]
aDict=collections.OrderedDict()
num=0
with open(filename) as fh:
chrome=">"+chr_num
for line in fh:
line=line.strip()
if line == chrome :
aDict[chrome]=collections.OrderedDict()
num=1
continue
if num:
if line.startswith(">"):
break
for i in line:
aDict[chrome][num]=i
num+=1
for k,v in aDict[chrome].items():
if k == int(base_num):
print(aDict[chrome][k])
这样,如果想得到chr2的坐标为8,9,10的碱基
python3 chr_base.py test.fa chr2 8
G
python3 chr_base.py test.fa chr2 9
A
python3 chr_base.py test.fa chr2 10
T
但是,如果对于hg38这种大的数据集,这种方法显然是不合适的
构建字典的过程会消耗大量的内存
所以尽量避免使用字典
import sys
args=sys.argv
filename=args[1]
import collections
chr_num=args[2]
base_num=args[3]
up_stream=int(base_num)-1
down_stream=int(base_num)+1
num=0
stop=0
with open(filename) as fh:
chrome=">"+chr_num
for line in fh:
line=line.strip()
if line == chrome :
num=1
continue
if num:
if line.startswith(">"):
break
for i in line:
if num == up_stream:
print(i)
stop=stop+1
if num == int(base_num):
print(i)
stop=stop+1
if num == down_stream:
print(i)
stop=stop+1
num+=1
if stop == 3:
break
这里需要注意的是输入的坐标位置一定要转成int
还有就是,我设置stop这个值是为了提高速度,也就是得到了结果后,后面的就不要遍历了。
time python3 chr_base_bigdata.py hg38.fa chr5 8397384
G
A
C
real 0m21.503s
user 0m21.008s
sys 0m0.488s
如果不加stop
import sys
args=sys.argv
filename=args[1]
import collections
chr_num=args[2]
base_num=args[3]
up_stream=int(base_num)-1
down_stream=int(base_num)+1
num=0
#stop=0
with open(filename) as fh:
chrome=">"+chr_num
for line in fh:
line=line.strip()
if line == chrome :
num=1
continue
if num:
if line.startswith(">"):
break
for i in line:
if num == up_stream:
print(i)
# stop=stop+1
if num == int(base_num):
print(i)
# stop=stop+1
if num == down_stream:
print(i)
# stop=stop+1
num+=1
# if stop == 3:
# break
time python3 chr_base_bigdata.py hg38.fa chr5 8397384
G
A
C
real 1m47.082s
user 1m46.152s
sys 0m0.692s
很明显,一个21s,另外一个1min46s
网友评论