文件样式:
407 206 399 474 380 505 378 262 16 307 463 239 137 518 114 470
ENST00000456328.2_1 26.0522146463374 12.8134728632941 53.9639191995227 17.6675974730022 23.31847
ENST00000450305.2_1 0 0.71185960351634 0 0 0 4.07723611272235 0.95740454723925
ENST00000488147.1_1 373.714527340564 453.454567439909 381.068290962784 539.450642842335 261.9898
ENST00000473358.1_1 0 0 0.830214141531119 1.17783983153348 1.37167476868719 0 0
ENST00000469289.1_1 0 0 0 0 0 0 0 0.994054947983683 0 0 0
ENST00000417324.1_1 0 0 0 0 0 0 0 0 0 0 0 0 0
ENST00000461467.1_1 0 0 0 0 0 0 0 0 0 0 0 0 0
ENST00000606857.1_1 0 0 0 0 0 0 0 0 0 0 0 0 0
ENST00000642116.1_1 0 0 0 0 0 0 0 0 0 0 0 0 0
ENST00000492842.2_2 0 0 0 0 0 0 0 0 0 0 0 0 0
ENST00000641515.2_2 0 0 0.830214141531119 0 0 0 0 0 0 0 0
ENST00000335137.4_2 0 0 0 0 0 0 0 0 0 0 0 0 0
ENST00000466430.5_1 6.28846560428834 7.83045563867974 6.64171313224895 14.1340779784018 1.371674
ENST00000477740.5_1 0 0 0 0 0 0 0 0 0 0 0 0 0
ENST00000471248.1_1 0.898352229184049 2.13557881054902 3.32085656612448 0 0 2.03861805636117
ENST00000610542.1_1 0 0 0 0 0 0 0 0 0 0 0 0 0
ENST00000453576.2_1 0 0 0 0 0 0 0 0 0 0 0 0 0
ENST00000495576.1_1 0 0 0 0 0 0 0 0 0 0 0 0 0
ENST00000442987.3_1 8.08517006265644 21.3557881054902 16.6042828306224 9.42271865226786 15.08842
这是一个标准的基因表达矩阵csv文件,大小为352Mb,有208938行
接下来,我将在不适用第三方库如pandas 的情况下,将第一列基因名字的后缀去掉,为了验证pypy提高for循环速度,故意使用for循环
而非列表推导
import re
def remove_dot(content):
pattern = re.compile(r"\.\d+_\d+(?=\t)")
tmp = pattern.sub("",content)
return tmp
contents=[]
with open("mRNA_normlized_by_deseq_quan.txt",'r') as f:
for line in f:
if line.startswith("407"):
contents.append(line)
else:
contents.append(remove_dot(line))
with open("test.txt",'w+') as l:
l.writelines(contents)
代码如上,直接运行python3
time python3 process.py
real 0m7.672s
user 0m4.222s
sys 0m0.961s
大概消耗了7S时间就完成了数据处理
而我们运行最新版pypy-3.6
pypy3 process.py
real 0m15.602s
user 0m7.736s
sys 0m1.262s
这里pypy3的速度居然比python3慢了8秒?
看sys时间的话,因为pypy是需要编译预热,因此启动时间慢了0.3秒,但是除去IO还有3秒的差异,这非常奇怪,
pypy在这里并没有官网说的那么神。
接下来,我们请出同样是JIT的julia语言来处理一下:
function remove_dot(content)
result=replace(content,r"\.\d+_\d+(?=\t)"=>"")
result
end
function process_file(fs)
contents=String[]
open(fs,"r") do IOstream
#contents=String[]
for line in eachline(IOstream)
if occursin(r"^407",line)
push!(contents,line)
else
push!(contents,remove_dot(line))
end
end
end
contents
end
function write2fs(lst,fs)
open(fs, "w") do io
for line in lst
write(io, line, "\n")
end
end
end
write2fs(process_file("mRNA_normlized_by_deseq_quan.txt"),"test.txt")
让我们测试一下时间
time julia process.jl
real 0m3.775s
user 0m1.905s
sys 0m1.939s
看起来,还是julia的循环速度更快一些,同样的正则表达式和算法,julia 则快了一倍
网友评论