计算蛋白理化性质一般在网站,protparam
此网站只可以一次输入一条序列,结果需要挨个复制。当序列条数很多时也是着实费力。
此外直接用Bioperl写脚本也可以,Biopython 不确定是否可以。
直接使用 request 应该也可以
goal :再看一下class 的使用,学一下html语法,format 格式化
issue : 结果数据保存在pre标签内,单项数据不是在标签内,
使用 /following::text()[1] 报错 It should be an element.
solution : 直接获取 标签内所有内容,split 从列表中获取信息
issue : 运行速度很慢,不过应该比手动准确,
solution :可能是服务器在国外,不必等到页面加载完全进行下一步操作,
issue: 对类的概念了解不是太清楚,将参数信息合并到字典里,每条序列保存为json格式
solution :只能写脚本转换为自己需要的格式。
issue: json格式有误,
solutionon: 直接写为tab分隔
test: 使用58条序列测试
result: 用时 2681 s 约44mins , 太慢了。。。。不过还好啦。浏览器最小化后,就可以做其他事了,
此脚本只能在win 下使用,并需正确安装webdriver驱动
获得数据
#!/usr/bin/env python
# coding: utf-8
#usage: python scrpit inputfile outfilename
from selenium import webdriver
from Bio import SeqIO
import re,time,json,sys
st = time.time()
input_file = sys.argv[1]
out_file = sys.argv[2]
#import os
#os.chdir(r"C:\Users\Acer\Desktop\codee\python\expasy")
expasy = webdriver.Chrome()
expasy.get("https://web.expasy.org/protparam/")
class expasy_cal():
'''get physical and chemical parameters for a given protein sequence file
based on web https://web.expasy.org/protparam/'''
def inputseq(seq):
"""input the protein sequence"""
time.sleep(0.3)
while True:
if expasy.find_element_by_xpath('//*[@id="sib_body"]/form/textarea').is_displayed():
expasy.find_element_by_xpath('//*[@id="sib_body"]/form/p[1]/input[1]').click() #获取新网页
expasy.find_element_by_xpath('//*[@id="sib_body"]/form/textarea').send_keys(seq)
expasy.find_element_by_xpath('//*[@id="sib_body"]/form/p[1]/input[2]').click()
break
else:
print("input box is not displayed")
def compute():
"""get the parameters showed below"""
#inbox.send_keys(seq)
time.sleep(0.3) #等待页面加载的时间
while True:
if expasy.find_element_by_xpath('//*[@id="sib_body"]/h2').is_displayed():
pd={}
parameters = expasy.find_element_by_xpath('//*[@id="sib_body"]/pre[2]').text.split("\n\n") #分割不同参数
aaa='\n'.join(parameters)
bbb=re.split("[:\n]",aaa) #将参数值 与 值分割
pd["number_of_amine_acid"] = bbb[1].strip()
pd["molecular_weight"] = bbb[3].strip()
pd["theoretical_pi"] = bbb[5].strip()
pd["instability_index"] = re.findall("[\d.]+",bbb[66])[0] #filter 结果怎么是个对象
pd["aliphatic_index"] = bbb[70].strip()
pd["gravy"] = bbb[72].strip()
return pd
break
else:
print("loading")
with open(out_file,"w",encoding='utf-8') as f:
f.write('{}\t{}\t{}\t{}\t{}\t{}\t{}\n'.format(
'seq_id',
'number_of_amine_acid',
'molecular_weight',
'theoretical_pi',
'instability_index',
'aliphatic_index',
'gravy'))
pros = SeqIO.parse(input_file,"fasta")
i=0
for pro in pros:
print("="*10,"seq",i+1,"->",pro.id,"on the way","="*10)
expasy_cal.inputseq(seq = pro.seq)
cccc = expasy_cal.compute()
number_of_amine_acid = cccc['number_of_amine_acid']
molecular_weight = cccc['molecular_weight']
theoretical_pi = cccc['theoretical_pi']
instability_index = cccc['instability_index']
aliphatic_index = cccc['aliphatic_index']
gravy = cccc['gravy']
f.write('{}\t{}\t{}\t{}\t{}\t{}\t{}\n'.format(
pro.id,
number_of_amine_acid,
molecular_weight,
theoretical_pi,
instability_index,
aliphatic_index,
gravy))
i+=1
#single_id_pd = {pro.id:cccc} #单条序列计算结果封装到字典 好像这种不是json格式
#print(single_id_pd)
#json.dump(single_id_pd,f,indent = 4,ensure_ascii=False) #是否要等到全部完成才写入?
#f.write(json.dumps(single_id_pd,indent = 4,ensure_ascii=False)+"\n")
#expasy.get("https://web.expasy.org/protparam/")
expasy.back() #好像back也需要重新加载页面 也是慢
expasy.close()
et = time.time()
print("process finished")
print("taking time",et-st,"s")
结果
seq_id number_of_amine_acid molecular_weight theoretical_pi instability_index aliphatic_index gravy
PgPUB56 458 50457.08 8.02 49.64 101.27 -0.206
PgPUB54 688 75204.03 8.46 50.28 108.21 -0.023
json格式 转换为二维表格格式
存储的格式可能错误,直接存储为tab分格式的吧
网友评论