美文网首页大数据 爬虫Python AI SqlPython小哥哥
Python爬虫神器pyppeteer,对 js 加密降维打击!

Python爬虫神器pyppeteer,对 js 加密降维打击!

作者: 14e61d025165 | 来源:发表于2019-06-10 15:27 被阅读0次

    爬虫神器pyppeteer,对 js 加密降维打击

    pyppeteer 是对无头浏览器 puppeteer 的 Python 封装。无头浏览器广泛用于自动化测试,同时也是一种很好地爬虫思路。

    使用 puppeteer(等其他无头浏览器)的最大优势当然是 对 js 加密实行降维打击 ,完全无视 js 加密手段,对于一些需要登录的应用,也可以模拟点击然后保存 cookie。 而很多时候前端的加密是爬虫最难攻克的一部分 。当然puppeteer也有劣势,最大的劣势就是相比面向接口爬虫效率很低,就算是无头的chromium,那也会占用相当一部分内存。另外额外维护一个浏览器的启动、关闭也是一种负担。

    这篇文章我们来写一个简单的 demo,爬取拼多多搜索页面的数据,最终的效果如下:

    我们把所有 api 请求的原始数据保存下来:

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560151588746" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    Python学习交流群:1004391443,这里是python学习者聚集地,有大牛答疑,有资源共享!小编也准备了一份python学习资料,有想学习python编程的,或是转行,或是大学生,还有工作中想提升自己能力的,正在学习的小伙伴欢迎加入学习。

    示例 json 文件如下:

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560151588749" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    开发环境

    • python3.6+

    最好是 python3.7,因为 asyncio 在 py3.7中加入了很好用的 asyncio.run() 方法。

    • 安装pyppeteer

    如果安装有问题请去看官方文档。

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">python3 -m pip install pyppeteer
    </pre>

    • 安装 chromium

    你懂的,天朝网络环境很复杂,如果要用 pyppeteer 自己绑定的 chromium ,半天都下载不下来,所以我们要手动安装,然后在程序里面指定 executablePath 。

    下载地址: www.chromium.org/getting-inv

    hello world

    pyppeteer 的 hello world 程序是前往 exmaple.com 截个图:

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import asyncio
    from pyppeteer import launch
    async def main():
    browser = await launch({
    # Windows 和 Linux 的目录不一样,情换成自己对应的executable文件地址
    'executablePath': '你下载的Chromium.app/Contents/MacOS/Chromium',
    })
    page = await browser.newPage()
    await page.goto('http://example.com')
    await page.screenshot({'path': 'example.png'})
    await browser.close()
    asyncio.get_event_loop().run_until_complete(main())
    </pre>

    pyppeteer 重要接口介绍

    pyppeteer.launch

    launch 浏览器,可以传入一个字典来配置几个options,比如:

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">browser = await pyppeteer.launch({
    'headless': False, # 关闭无头模式
    'devtools': True, # 打开 chromium 的 devtools
    'executablePath': '你下载的Chromium.app/Contents/MacOS/Chromiu',
    'args': [
    '--disable-extensions',
    '--hide-scrollbars',
    '--disable-bundled-ppapi-flash',
    '--mute-audio',
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-gpu',
    ],
    'dumpio': True,
    })
    </pre>

    其中所有可选的 args 参数在这里: peter.sh/experiments…

    dumpio 的作用:把无头浏览器进程的 stderr 核 stdout pip 到主程序,也就是设置为 True 的话,chromium console 的输出就会在主程序中被打印出来。

    注入 js 脚本

    可以通过 page.evaluate 形式,例如:

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">await page.evaluate("""
    () =>{
    Object.defineProperties(navigator,{
    webdriver:{
    get: () => false
    }
    })
    }
    """)
    </pre>

    我们会看到这一步非常关键,因为 puppeteer 出于政策考虑(这个词用的不是很好,就是那个意思)会设置 window.navigator.webdriver 为 true ,告诉网站我是一个 webdriver 驱动的浏览器。有些网站比较聪明(反爬措施做得比较好),就会通过这个来判断对方是不是爬虫程序。

    这等价于在 devtools 里面输入那一段 js 代码。

    还可以加载一个 js 文件:

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">await page.addScriptTag(path=path_to_your_js_file)
    </pre>

    通过注入 js 脚本能完成很多很多有用的操作,比如自动下拉页面等。

    截获 request 和 response

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">await page.setRequestInterception(True)
    page.on('request', intercept_request)
    page.on('response', intercept_response)
    </pre>

    intercept_request 和 intercept_response 相当于是注册的两个回调函数,在浏览器发出请求和获取到请求之前指向这两个函数。

    比如可以这样禁止获取图片、多媒体资源和发起 websocket 请求:

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">async def intercept_request(req):
    """请求过滤"""
    if req.resourceType in ['image', 'media', 'eventsource', 'websocket']:
    await req.abort()
    else:
    await req.continue_()
    </pre>

    然后每次获取到请求之后将内容打印出来(这里只打印了 fetch 和 xhr 类型response 的内容):

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">async def intercept_response(res):
    resourceType = res.request.resourceType
    if resourceType in ['xhr', 'fetch']:
    resp = await res.text()
    print(resp)
    大家在学python的时候肯定会遇到很多难题,以及对于新技术的追求,这里推荐一下我们的Python学习扣qun:784758214,这里是python学习者聚集地
    </pre>

    一共有哪些resourceType,pyppeteer文档里面有:

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560151588773" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    拼多多搜索爬虫

    页面自动下拉

    拼多多的搜索界面是一个无限下拉的页面,我们希望能够实现无限下拉页面,并且能够控制程序提前退出,不然一直下拉也不好,我们可能并不需要那么多数据。

    js 脚本

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">async () => {
    await new Promise((resolve, reject) => {
    // 允许下滑的最大高度,防止那种可以无限下拉的页面无法结束
    const maxScrollHeight = null;
    // 控制下拉次数
    const maxScrollTimes = null;
    let currentScrollTimes = 0;
    // 记录上一次scrollHeight,便于判断此次下拉操作有没有成功,从而提前结束下拉
    let scrollHeight = 0;
    // maxTries : 有时候无法下拉可能是网速的原因
    let maxTries = 5;
    let tried = 0;
    const timer = setInterval(() => {
    // 下拉失败,提前退出
    // BUG : 如果网速慢的话,这一步会成立~
    // 所以设置一个 maxTried 变量
    if (document.body.scrollHeight === scrollHeight) {
    tried += 1;
    if (tried >= maxTries) {
    console.log("reached the end, now finished!");
    clearInterval(timer);
    resolve();
    }
    }
    scrollHeight = document.body.scrollHeight;
    window.scrollTo(0, scrollHeight);
    window.scrollBy(0, -10);
    // 判断是否设置了maxScrollTimes
    if (maxScrollTimes) {
    if (currentScrollTimes >= maxScrollTimes) {
    clearInterval(timer);
    resolve();
    }
    }
    // 判断是否设置了maxScrollHeight
    if (maxScrollHeight) {
    if (scrollHeight >= maxScrollHeight) {
    if (currentScrollTimes >= maxScrollTimes) {
    clearInterval(timer);
    resolve();
    }
    }
    }
    currentScrollTimes += 1;
    // 还原 tried
    tried = 0;
    }, 1000);
    });
    };
    </pre>

    这里面有几个重要的参数:

    • interval : 下拉间隔时间,以毫秒为单位
    • maxScrollHeight : 运行页面下拉最大高度
    • maxScrollTimes : 最多下拉多少次(推荐使用,可以更好控制爬取多少数据)
    • maxTries : 下拉不成功时最多重试几次,比如有时候会因为网络原因导致没能在 interval ms 内成功下拉

    把这些替换成你需要的。 同时你可以打开 chrome 的开发者工具运行一下这段 js 脚本 。

    完整代码

    这段代码一共也就只有70多行,比较简陋,情根据自己的实际需求更改。

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import os
    import time
    import json
    from urllib.parse import urlsplit
    import asyncio
    import pyppeteer
    from scripts import scripts
    BASE_DIR = os.path.dirname(file)
    async def intercept_request(req):
    """请求过滤"""
    if req.resourceType in ['image', 'media', 'eventsource', 'websocket']:
    await req.abort()
    else:
    await req.continue_()
    async def intercept_response(res):
    resourceType = res.request.resourceType
    if resourceType in ['xhr', 'fetch']:
    resp = await res.text()
    url = res.url
    tokens = urlsplit(url)
    folder = BASE_DIR + '/' + 'data/' + tokens.netloc + tokens.path + "/"
    if not os.path.exists(folder):
    os.makedirs(folder, exist_ok=True)
    filename = os.path.join(folder, str(int(time.time())) + '.json')
    with open(filename, 'w', encoding='utf-8') as f:
    f.write(resp)
    async def main():
    browser = await pyppeteer.launch({
    # 'headless': False,
    # 'devtools': True
    'executablePath': '/Users/changjiang/apps/Chromium.app/Contents/MacOS/Chromium',
    'args': [
    '--disable-extensions',
    '--hide-scrollbars',
    '--disable-bundled-ppapi-flash',
    '--mute-audio',
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-gpu',
    ],
    'dumpio': True,
    })
    page = await browser.newPage()
    await page.setRequestInterception(True)
    page.on('request', intercept_request)
    page.on('response', intercept_response)
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
    '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299')
    await page.setViewport({'width': 1080, 'height': 960})
    await page.goto('http://yangkeduo.com')
    await page.evaluate("""
    () =>{
    Object.defineProperties(navigator,{
    webdriver:{
    get: () => false
    }
    })
    }
    """)
    await page.evaluate("你的那一段页面自动下拉 js 脚本")
    await browser.close()
    if name == 'main':
    asyncio.run(main())
    </pre>

    相关文章

      网友评论

        本文标题:Python爬虫神器pyppeteer,对 js 加密降维打击!

        本文链接:https://www.haomeiwen.com/subject/tzmzxctx.html