异步的使用方法
异步需要用到python的两个异步库
- asynoio
- aiohttp
从python 更新至3.4版本后,新增了异步的语法糖(async 和 await),使得异步操作用起来很容易
为什么要在爬虫中使用异步呢?
爬虫,顾名思义就是抓去网页内容,但是也要先去访问网站,但是访问网站是需要等待时间的,有时候网页访问卡顿会严重阻塞后续的网页抓取操作,因此,就有大把的时间浪费在了等待上,这也是使用异步的原因之一;另外,异步可以把多任务封装,使得可以在单线程中并行运算,从而大大加快速度
asynoio的使用方法
廖雪峰大神的python3教程中有详细的介绍。
#先导入asynoio库
import asyncio
async def hello():
print("Hello world!")
r = await asyncio.sleep(1)
print("Hello again!")
这里要注意:在函数前执行async
命令,这个函数就会成为异步函数,但是await
后面必须跟另一个异步操作,这个操作可以是一个生成器,或者另一个async函数
要想运行异步操作,还需要将其放在异步的消息循环中
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(hello()) for i in range(3)]
loop.run_until_complete(*tasks)
此外,还有另外一种写法,等同于此
loop = asyncio.get_event_loop()
tasks = [hello() for i in range(3)]
loop.run_until_complete(asyncio.wait(tasks))
这里注意:
- 单个任务没有必要执行异步操作,因为即使阻塞了,也没有其他的可以先去执行的操作,还是得等待
- 异步操作需要所有的函数都构建为异步函数,否则影响速度
百万并发
异步可以并行运算,但是如果启动太多任务就会报错,因此,需要在异步中规定子任务的并发数。这里可以使用
sem = asyncio.Semaphore(50)
的方法规定并发数最多为50,而这里的sem参数需要传递到异步函数中,并在异步函数中使用async with sem:
的方法调用
下面是抓取实例
代码如下
import ssl
import argparse
import asyncio
import os
import pandas as pd
import requests
from bs4 import BeautifulSoup
import aiohttp
parser = argparse.ArgumentParser(
description='YOU CAN USE IT TO DOWNLOAD ALL SUBSTRATES OF ONE PROTEIN FAMILY')
parser.add_argument('--fm', '-f', help='please input your family number!')
parser.add_argument('--out', '-o', help='please input your outputdir!')
args = parser.parse_args()
protein_letters_1to3 = {
'A': 'Ala', 'C': 'Cys', 'D': 'Asp',
'E': 'Glu', 'F': 'Phe', 'G': 'Gly', 'H': 'His',
'I': 'Ile', 'K': 'Lys', 'L': 'Leu', 'M': 'Met',
'N': 'Asn', 'P': 'Pro', 'Q': 'Gln', 'R': 'Arg',
'S': 'Ser', 'T': 'Thr', 'V': 'Val', 'W': 'Trp',
'Y': 'Tyr',
}
table = []
def protein_letters_3to1(x):
new_dct = dict((x[1], x[0]) for x in protein_letters_1to3.items())
if x in new_dct:
return new_dct[x]
else:
return " "
def get_urls(fm):
url = "https://www.ebi.ac.uk/merops/cgi-bin/famsum?family={}".format(fm)
result = requests.get(url).content
root = "https://www.ebi.ac.uk"
soup = BeautifulSoup(result, 'lxml')
tab = soup.select('td[align="center"] > a')
urls = {heihei.get_text(): root+heihei.get("href") for heihei in tab}
return urls
async def fetch(session, url):
async with session.get(url) as response:
return await response.read()
async def parse_substrate(html):
sub_dct = {}
soup = BeautifulSoup(html, 'lxml')
tab = soup.select('tr')
for tr in tab:
if tr.select('td'):
sub_dct["".join([protein_letters_3to1(haha.get_text()) for haha in tr.select(
'td')[6:14]])] = tr.select('td')[0].get_text()
return sub_dct
# async def parse_Enzyme(html):
# soup = BeautifulSoup(html, 'lxml')
# tab = soup.find("table", {'summary': 'Activity'})
# if tab:
# Catalytictype = tab.select("tr")[1].select("td")[1].get_text()
# return Catalytictype
async def download_substrate(url):
async with aiohttp.ClientSession() as session:
html = await fetch(session, url)
return await parse_substrate(html)
# async def download_Enzyme(url):
# async with aiohttp.ClientSession() as session:
# html = await fetch(session, url)
# return await parse_Enzyme(html)
async def parse_Merops(sem, name, url):
# cat = await download_Enzyme(url)
async with sem:
fina = await download_substrate(url.replace("pepsum", "substrates"))
print("downloading database in url :{}".format(
url.replace("pepsum", "substrates")))
if fina:
pachong = [(name, val, key) for key, val in fina.items()]
table.extend(pachong)
sem = asyncio.Semaphore(50)
loop = asyncio.get_event_loop()
tasks = [parse_Merops(sem, name, url)
for name, url in get_urls(args.fm).items()]
tasks = asyncio.wait(tasks)
loop.run_until_complete(tasks)
df = pd.DataFrame(
table, columns=['merops_id', 'substrate', 'sequence'])
df.to_csv(os.path.join(args.out, "result.csv"), index=False)
print("All done! please check your result file.")
网友评论