6.爬虫-Requests+正则表达式抓取猫眼电影Top100

作者: 王阿根 | 来源:发表于2019-02-13 18:25 被阅读10次

Requests+正则表达式爬取猫眼电影
带给大家几个比较好玩的爬虫案例！适合小萌新玩！（附源码思路）
6.爬虫-Requests+正则表达式抓取猫眼电影Top100
Python爬虫之三：抓取猫眼电影TOP100
python爬虫day-14（抓取猫眼电影排行）
【Python】Python3网络爬虫实战-27、Request
Requests与正则表达式抓取猫眼电影排行！
抓取猫眼电影排行榜
Python爬虫猫眼电影TOP100
Python爬虫-猫眼电影排行

流程：

抓取单页内容：利用requests请求得到HTML页面代码，返回结果。
正则表达式分析：根据HTML代码的分析得到电影名称、主演、上映时间、评分、图片链接等信息。
保存至文件：通过文件的形式，将结果进行存储。每一部电影存储为一个json字符串。
开启循环：对多页内容遍历。

HTML代码分析

top100第一页.png

top100第二页.png

上图为猫眼电影中榜单前一百中的前20个，观察请求参数offset的变化，请求第一页数据时为0，请求第二页数据时为10，大家还可以查看下剩下8个页面中offset的值的变化。

下面我来看一下网页的源代码，以排名第一的霸王别姬为例，如图：

1.png
影片的信息被一个dd包裹着，标题是个超链接，图片是个img，主演是class='star'，评分为两个部分拼接而成：class='integer，class='fraction'，排名class='board-index'。其他影片的结构和第一个影片是一样的，就不做介绍了。

下面开始爬取数据：
首先引入包：

import requests
import re
import json
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
print(sys.getdefaultencoding())

写个方法，这个方法里通过正则表达式得到电影名称、主演、上映时间、评分、图片链接等信息,并通过yield返回一个生成器，将数据组成字典格式：

   def yie(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
                    +'.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                    +'.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)
    items = re.findall(pattern, html)

    for item in items:
        yield{
        "index":item[0],
        "image":item[1],
        "title":item[2],
        "actor":item[3].strip()[3:],
        "time":item[4].strip()[5:],
        "score":item[5]+item[6]
    }

上事例中正则表达式的解释：

首先匹配排名：其中的（\d+）即为排名结果

<dd>.*?board-index.*?>(\d+)</i>

接着匹配图片：其中（.*?）为图片的链接地址

.*?data-src="(.*?)"

接着匹配电影名称name：其中（.*?）为电影名称

.*?name"><a.*?>(.*?)</a>

接着匹配主演star：其中（.*?）为主演

.*?star">(.*?)</p>

接着匹配上映时间releasetime: 其中（.*?）为上映时间

.*?releasetime">(.*?)</p>

接着匹配评分的左半部分：其中（.*?）为评分的左半部分

.*?integer">(.*?)</i>

然后匹配评分的右半部分：其中（.*?）为评分的右半部分

.*?fraction">(.*?)</i>

下面抓取单页内容事例，并将结果存储到result.txt文件：

html = requests.get("https://maoyan.com/board/4")
print(html.status_code)

for item in yie(html.text):    
    with open('result.txt', 'a') as f:
        f.write(json.dumps(item, ensure_ascii = False) + '\n')
        f.close()

存储结果：

1.png

通过循环，对10个页面进行爬取，之前观察到offset参数是由0 、10、20依次以10递增的：

for i in range(10):
    url = "https://maoyan.com/board/4?offset="+str(i*10)
    html = requests.get(url)
    print(url)
    for item in yie(html.text):
        print(item)
    
        with open('result.txt', 'a') as f:
            f.write(json.dumps(item, ensure_ascii = False) + '\n')
            f.close()

展示部分爬取十页数据的结果：

2.png