该包是puppeteer 的非官方python实现, 可以实现与puppeteer类似的功能。
https://github.com/miyakogi/pyppeteer
安装
pip install pyppeteer
下载使用 chromium
默认下载地址为 DEFAULT_DOWNLOAD_HOST = 'https://storage.googleapis.com'
, 需要FQ才可下载,下面介绍不FQ的方法。
我使用npm 安装了 puppeteer , 运行时会下载chromium 。
npm下载的位置在F:\program_nodejs\testpuppeteer\node_modules\puppeteer\.local-chromium\win64-588429\chrome-win32
.
pyppeteer默认会将之存放在 pyppeteer_home =C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\
下,
翻看pyppeteer源码, 可以看到其存放的目录结构为local-chromium / REVISION / 'chrome-win32' / 'chrome.exe' , REVISION对应上面的588429版本号, 建立上述目录,将npm下载的chromium拷贝进去即可。
这里需要注意, 在运行时需要指定chromium的版本号 , os.environ['PYPPETEER_CHROMIUM_REVISION'] ='588429'
爬取网页代码
import asyncio
import pyppeteer
import os
os.environ['PYPPETEER_CHROMIUM_REVISION'] ='588429'
pyppeteer.DEBUG = True
async def main():
print("in main ")
print(os.environ.get('PYPPETEER_CHROMIUM_REVISION'))
browser = await pyppeteer.launch()
page = await browser.newPage()
await page.goto('http://www.baidu.com')
content = await page.content()
cookies = await page.cookies()
# await page.screenshot({'path': 'example.png'})
await browser.close()
return {'content':content, 'cookies':cookies}
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(main())
loop.run_until_complete(task)
print(task.result())
注意asyncio包的使用 , 及获取网页内容的写法。
其他api用法详见 https://miyakogi.github.io/pyppeteer/reference.html
与scrapy的整合
加入downloadmiddleware
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random
import pyppeteer
import asyncio
import os
from scrapy.http import HtmlResponse
pyppeteer.DEBUG = False
class FundscrapyDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
def __init__(self) :
print("Init downloaderMiddleware use pypputeer.")
os.environ['PYPPETEER_CHROMIUM_REVISION'] ='588429'
# pyppeteer.DEBUG = False
print(os.environ.get('PYPPETEER_CHROMIUM_REVISION'))
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(self.getbrowser())
loop.run_until_complete(task)
#self.browser = task.result()
print(self.browser)
print(self.page)
# self.page = await browser.newPage()
async def getbrowser(self):
self.browser = await pyppeteer.launch()
self.page = await self.browser.newPage()
# return await pyppeteer.launch()
async def getnewpage(self):
return await self.browser.newPage()
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(self.usePypuppeteer(request))
loop.run_until_complete(task)
# return task.result()
return HtmlResponse(url=request.url, body=task.result(), encoding="utf-8",request=request)
async def usePypuppeteer(self, request):
print(request.url)
# page = await self.browser.newPage()
await self.page.goto(request.url)
content = await self.page.content()
return content
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
网友评论