《网易公开课》也能被拿来练习python爬虫？离谱~

作者: 梦想橡皮擦 | 来源:发表于2022-02-27 21:22 被阅读0次

《网易公开课》也能被拿来练习python爬虫？离谱~
Python爬虫之爬取美女图片
学习计划
网易股票python爬虫，实时采集-钱塘大数据交易中心
Python 爬虫：把廖雪峰教程转换成 PDF 电子书
给大家分享一篇爬虫：把廖雪峰的教程转换成 PDF 电子书
python 爬小说
提高英语口语能力的小方法——感兴趣的朋友可以一试。
python爬虫能拿来干什么 - 草稿
2020-01-27 如何下载股市数据以使用SQL进行分析

本篇博客是第四遍学习协程相关知识，我们在之前内容积累的基础上，新增加一个异步请求库，该库名称为 aiohttp。

为了给大家演示 aiohttp 如何与 asyncio 进行搭配，本文采用代码对比形式进行展示。

异步协程主要用于提高 I/O 操作的效率，所以本次采集的站点依旧以图片为主。

初识 aiohttp 库 - 案例为网易公开课

aiohttp 是一个异步的 HTTP 客户端/服务端框架，它基于 asyncio 模块实现，在《爬虫 120 例》专栏中主要用其客户端能力，提高爬虫的采集速度。

接下来我们将该库与 requests 模块进行对比学习。

requests 模块同步采集 20 次网易公开课

import requests
import time


def get_html():
    res = requests.get("https://open.163.com/")
    print(len(res.text))


start_time = time.perf_counter()
for i in range(20):
    get_html()

print("requests 同步采集消耗时间为：", time.perf_counter() - start_time)
# requests 同步采集消耗时间为： 4.193098181

aiohttp 库+asyncio 异步采集 20 次网易公开课

import time

import asyncio
import aiohttp


async def get_html():
    async with aiohttp.request('GET', "https://open.163.com/") as res:
        return await res.text()


async def main():
    tasks = [asyncio.ensure_future(get_html()) for i in range(20)]

    dones, pendings = await asyncio.wait(tasks)
    for task in dones:
        print(len(task.result()))


if __name__ == '__main__':
    start_time = time.perf_counter()
    asyncio.run(main())
    print("aiohttp 异步采集消耗时间为：", time.perf_counter() - start_time)
    # aiohttp 异步采集消耗时间为： 0.275251032

得到的结论 requests 模块采集 30 遍耗时 4s，而 aiohttp 库耗时 0.3s，相差将近 10 倍。

aiohttp 系统学习直接参考官方文档即可，非常清楚：https://docs.aiohttp.org/en/stable/，注意该模块需要安装，非内置模块。

在本系列专栏中，aiohttp 只会用在客户端，所以仅说明该部分知识点。

请求一个网站，并返回其数据

import aiohttp
import asyncio

async def main():

    async with aiohttp.ClientSession() as session:
        async with session.get("http://httpbin.org/get") as resp:
            print(resp.status)
            print(await resp.text())

asyncio.run(main())

在 main() 函数中，存在两个对象，第一个是 ClientSession，第二个没有显式标记，它是 ClientResponse，这两个对象分别对应 请求对象 与 响应对象 。

学习 aiohttp 可以对比 requests 进行学习，例如 ClientSession 对象具备不同的 HTTP 请求方法，分别是 get，post，put，post，delete，head，options，patch，其中主要用 get 与 post。

如果你不需要保留请求的会话状态，直接用下述代码即可，通过 aiohttp.request 直接发送请求获取响应。

import aiohttp
import asyncio

async def main():
    async with aiohttp.request("GET", "http://httpbin.org/get") as resp:
        html = await resp.text(encoding="utf-8")
        print(html)


asyncio.run(main())

使用 ClientSession 的好处不用每次请求都创建一个 session，通过第一次创建的 session 对象可以执行所有的请求。

所以在本文的开篇代码，可进行如下修改，不过时间上并无太大变化。

import time

import asyncio
import aiohttp


async def get_html(client):
    async with client.get("https://open.163.com/") as resp:
        return await resp.text()


async def main():
    async with aiohttp.ClientSession() as client:
        tasks = [asyncio.ensure_future(get_html(client)) for i in range(20)]

        dones, pendings = await asyncio.wait(tasks)
        for task in dones:
            print(len(task.result()))


if __name__ == '__main__':
    start_time = time.perf_counter()
    asyncio.run(main())
    print("aiohttp 异步采集消耗时间为：", time.perf_counter() - start_time)

如果希望请求到图片类二进制数据，将上述代码中 await resp.text() 部分，修改为 await resp.read() 即可。
如果目标数据源是 JSON 格式的数据，使用 resp.json() 即可。

aiohttp 发送请求时的参数说明

由于不同的请求方式，参数差不多，所以下述内容都使用 get 请求进行说明。

params：该参数用于构造 URL，可以传递的格式有 [("var1",1),("var2",2)]，{"var1": 1,"var2": 2}，var1=1&var2=2；
headers：请求头；
cookies：请求时携带的 Cookie；
data：用于 POST 请求，参数格式 {"var1": 1,"var2": 2}；
timeout：超时设置；
proxy：代理设置；

到这里，初识部分已经说明完毕，接下来就进入到实际的编码环节。

bensound 爬虫编写

本次要采集的目标站点是：https://www.bensound.com/royalty-free-music。
该页面包含非常多的 mp3 文件，本篇博客就对其进行采集。

python 协程第4课，目标数据源为 mp3 ，目标站点为 bensound.com
经过分析得知，mp3 的下载地址是：

https://www.bensound.com/bensound-music/bensound-allthat.mp3

该地址可以通过列表页相关数据拼凑而来，通过开发者工具得到 mp3 如下封面图地址，再通过 python 字符串操作，获取上述链接。

https://www.bensound.com/bensound-img/allthat.jpg

转换代码如下：

img_url = "https://www.bensound.com/bensound-img/allthat.jpg"
name = img_url[img_url.rfind("/") + 1:img_url.rfind(".")]

mp3_url = f"https://www.bensound.com/bensound-music/bensound-{name}.mp3"
print(mp3_url)

转换代码编写完毕，先测试一下通过 requests 模块获取 20 页数据消耗的时间。

import time

import asyncio
import aiohttp

from bs4 import BeautifulSoup
import lxml


async def get_html(client, url):
    print("正在采集", url)
    async with client.get(url) as resp:
        html = await resp.text()
        soup = BeautifulSoup(html, 'lxml')
        divs = soup.find_all(attrs={'class': 'img_mini'})
        mp3_urls = [get_mp3_url("https://www.bensound.com/" + div.a.img["src"]) for div in divs]
        return mp3_urls


def get_mp3_url(img_url):
    img_url = img_url
    name = img_url[img_url.rfind("/") + 1:img_url.rfind(".")]

    mp3_url = f"https://www.bensound.com/bensound-music/bensound-{name}.mp3"
    return mp3_url


async def main(urls):
    async with aiohttp.ClientSession() as client:
        tasks = [asyncio.ensure_future(get_html(client, urls[i])) for i in range(len(urls))]

        dones, pendings = await asyncio.wait(tasks)
        print("异步执行完毕，开始输出对应结果：")
        for task in dones:
            print(task.result())


if __name__ == '__main__':
    url_format = "https://www.bensound.com/royalty-free-music/{}"
    urls = [url_format.format(i) for i in range(1, 21)]
    start_time = time.perf_counter()
    asyncio.run(main(urls))
    print("aiohttp 异步采集消耗时间为：", time.perf_counter() - start_time)

上述代码，运行过程如下所示。
[图片上传失败...(image-1f60ee-1645968207647)]
接下来的代码就变得非常简单了，与前一篇博客内容基本一致。

import time

import asyncio
import aiohttp

from bs4 import BeautifulSoup
import lxml


async def get_html(client, url):
    print("正在采集", url)
    async with client.get(url, timeout=5) as resp:
        html = await resp.text()
        soup = BeautifulSoup(html, 'lxml')
        divs = soup.find_all(attrs={'class': 'img_mini'})
        mp3_urls = [get_mp3_url("https://www.bensound.com/" + div.a.img["src"]) for div in divs]
        return mp3_urls


def get_mp3_url(img_url):
    img_url = img_url
    name = img_url[img_url.rfind("/") + 1:img_url.rfind(".")]

    mp3_url = f"https://www.bensound.com/bensound-music/bensound-{name}.mp3"
    return mp3_url


async def get_mp3_file(client, url):
    print("正在采集 mp3 文件", url)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36",
        "Referer": "https://www.bensound.com/royalty-free-music"
    }
    mp3_file_name = url[url.rfind('-') + 1:url.rfind('.')]
    print(mp3_file_name)
    async with client.get(url, headers=headers) as resp:
        content = await resp.read()
        with open(f'./mp3/{mp3_file_name}.mp3', "wb") as f:
            f.write(content)
        return (url, "success")


async def main(urls):
    timeout = aiohttp.ClientTimeout(total=600)  # 超时时间设置为600秒
    connector = aiohttp.TCPConnector(limit=50)  # 并发数量设置为50
    async with aiohttp.ClientSession(connector=connector, timeout=timeout) as client:
        tasks = [asyncio.ensure_future(get_html(client, urls[i])) for i in range(len(urls))]

        dones, pendings = await asyncio.wait(tasks)
        print("异步执行完毕，开始输出对应结果：")
        all_mp3 = []
        for task in dones:
            all_mp3.extend(task.result())

        totle = len(all_mp3)
        print("累计获取到【", totle, "】个 MP3 文件")
        print("_" * 100)
        print("准备下载 MP3 文件")

        # 每次下载10个
        totle_page = totle // 10 if totle % 10 == 0 else totle // 10 + 1

        for page in range(0, totle_page):
            print("正在下载第{}页 MP3 文件".format(page + 1))
            start_page = 0 if page == 0 else page * 10
            end_page = (page + 1) * 10
            print("待下载地址")
            print(all_mp3[start_page:end_page])
            mp3_download_tasks = [asyncio.ensure_future(get_mp3_file(client, url)) for url in
                                  all_mp3[start_page:end_page]]
            mp3_dones, mp3_pendings = await asyncio.wait(mp3_download_tasks)
            for task in mp3_dones:
                print(task.result())


if __name__ == '__main__':
    url_format = "https://www.bensound.com/royalty-free-music/{}"
    urls = [url_format.format(i) for i in range(1, 5)]
    start_time = time.perf_counter()
    asyncio.run(main(urls))
    print("aiohttp 异步采集消耗时间为：", time.perf_counter() - start_time)

运行截图如下所示，由于 mp3 文件比较大，所以将采集总页数设置为 5 。

python 协程第4课，目标数据源为 mp3 ，目标站点为 bensound.com
上述代码还进行了 ClientSession 的全局设置，代码如下。

timeout = aiohttp.ClientTimeout(total=600)  # 超时时间设置为600秒
connector = aiohttp.TCPConnector(limit=50)  # 并发数量设置为50

设置上述参数的原因，由于部分网站的服务器限制单个 IP 建立并行 TCP 连接数量，aiohttp 默认设置连接数量为 100，可以手动调整。
超时设置也是由于 aiohttp 默认设置的是 300S（即 5 分钟），如果一个 TCP 连接的持续时间超过这个时间，服务器自动断开该连接。

写在后面

如需完整代码，请查看评论区置顶评论。

今天是持续写作的第 244 / 365 天。
期待 关注，点赞、评论、收藏。

更多精彩

《爬虫 100 例，专栏销售中，买完就能学会系列专栏》
[图片上传失败...(image-dc454d-1645968207648)]

《网易公开课》也能被拿来练习python爬虫？离谱~
本篇博客是第四遍学习协程相关知识，我们在之前内容积累的基础上，新增加一个异步请求库，该库名称为 aiohttp。 ...
Python爬虫之爬取美女图片
需求：最近对python爬虫感兴趣，于是学习了下python爬虫并找了个网站练习，练习网址：http://www....
学习计划
学习两门网易公开课课程，即每周看两节网易公开课每天学习半小时Python/Java/android 每天学英语，...
网易股票python爬虫，实时采集-钱塘大数据交易中心
网易股票python爬虫，实时采集：http://www.qtjiaoyi.com/crawlers/detail...
Python 爬虫：把廖雪峰教程转换成 PDF 电子书
写爬虫似乎没有比用 Python 更合适了，Python 社区提供的爬虫工具多得让你眼花缭乱，各种拿来就可以直接用...
给大家分享一篇爬虫：把廖雪峰的教程转换成 PDF 电子书
写爬虫似乎没有比用 Python 更合适了，Python 社区提供的爬虫工具多得让你眼花缭乱，各种拿来就可以直接用...
python 爬小说
学习爬虫，练习一下，环境python 3.6
提高英语口语能力的小方法——感兴趣的朋友可以一试。
清早起床，打开手机，划开推文，便看到了网易公开课推荐的视频——无需外语环境，在家也能练习口语。在此和大家分享一下...
python爬虫能拿来干什么 - 草稿
搜图仆我爱好之一是搜罗美图和设计图，但有老师提过这种自己去网站浏览的效率太低，我困惑于人脑扫入图片涵义的速度到底...
2020-01-27 如何下载股市数据以使用SQL进行分析
太长不看版预备知识：python，爬虫，sql python下载网易的数据数据库不要用mysql，用Postg...