美文网首页大数据 爬虫Python AI SqlPython小哥哥
Python + PyQt5 实现美剧爬虫可视工具!

Python + PyQt5 实现美剧爬虫可视工具!

作者: 14e61d025165 | 来源:发表于2019-04-18 14:24 被阅读0次

    美剧《权力的游戏》终于开播最后一季了,在上周写了个简单的可视化美剧的爬虫软件来爬取美剧,链接: https://www.cnblogs.com/weijiutao/p/10614694.html ,没想到真有小伙伴用了,并且提出一个小建议,爬取的链接是一个下载链接,需要下载后才能观看,希望能做一个可在线观看的。然后就有了本篇。

    话不多说,先看运行结果:

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555568609924" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    跟之前的其实没多大区别,有变化的是这次爬取的网站链接和内部需要重新做的爬取内容。

    注:由于本篇和上篇爬取流程大致相同,所以本篇只是做简单的内容讲解,想看详解流程的可移步上面的链接。

    欢迎加入新手技术交流基地:1004391443 群里有大牛解答,有资源,有源码,学不学的会就看你了!

    全部代码如下:

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> 1 import urllib.request
    2 from urllib import parse
    3 from lxml import etree
    4 import math
    5 import ssl
    6 from PyQt5.QtWidgets import QApplication, QWidget, QLineEdit, QTextEdit, QVBoxLayout, QPushButton, QMessageBox
    7 import sys
    8
    9 # 取消代理验证
    10 ssl._create_default_https_context = ssl.create_unverified_context
    11
    12 class TextEditMeiJu(QWidget):
    13 def init(self, parent=None):
    14 super(TextEditMeiJu, self).init(parent)
    15 # 定义窗口头部信息
    16 self.setWindowTitle('爱美剧')
    17 # 定义窗口的初始大小
    18 self.resize(500, 600)
    19 # 创建单行文本框
    20 self.textLineEdit = QLineEdit()
    21 # 创建一个按钮
    22 self.btnButton = QPushButton('确定')
    23 # 创建多行文本框
    24 self.textEdit = QTextEdit()
    25 # 实例化垂直布局
    26 layout = QVBoxLayout()
    27 # 相关控件添加到垂直布局中
    28 layout.addWidget(self.textLineEdit)
    29 layout.addWidget(self.btnButton)
    30 layout.addWidget(self.textEdit)
    31 # 设置布局
    32 self.setLayout(layout)
    33 # 将按钮的点击信号与相关的槽函数进行绑定,点击即触发
    34 self.btnButton.clicked.connect(self.buttonClick)
    35
    36 # 点击确认按钮
    37 def buttonClick(self):
    38 # 爬取开始前提示一下
    39 start = QMessageBox.information(
    40 self, '提示', '是否开始爬取《' + self.textLineEdit.text() + "》",
    41 QMessageBox.Ok | QMessageBox.No, QMessageBox.Ok
    42 )
    43 # 确定爬取
    44 if start == QMessageBox.Ok:
    45 self.page = 1
    46 self.loadSearchPage(self.textLineEdit.text(), self.page)
    47 # 取消爬取
    48 else:
    49 pass
    50
    51 # 加载输入美剧名称后的页面
    52 def loadSearchPage(self, name, page):
    53 # 将文本转为 gb2312 编码格式
    54 name = parse.quote(name.encode('utf-8'))
    55 # 请求发送的 url 地址
    56 url = "https://www.imeiju.cc/search.php?page=" + str(page) + "&searchword=" + name + "&searchtype="
    57 # 请求报头
    58 headers = {
    59 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"}
    60 # 发送请求
    61 request = urllib.request.Request(url, headers=headers)
    62 # 获取请求的 html 文档
    63 html = urllib.request.urlopen(request).read()
    64 # 对 html 文档进行解析
    65 text = etree.HTML(html)
    66 # xpath 获取想要的信息
    67 numberTotal = text.xpath('//span[@class="text-color"][2]/text()')
    68 # 去掉总条数左右的引号
    69 numberTotal = numberTotal[0][1:][:-1]
    70 # 根据显示知道每页 10 条,所以整除 10 并向上取整为总页数
    71 pageTotal = math.ceil(int(numberTotal) / 10)
    72 # 判断搜索内容是否有结果
    73 if pageTotal != 0:
    74 self.loadDetailPage(pageTotal, text, headers)
    75 # 搜索内容无结果
    76 else:
    77 self.infoSearchNull()
    78
    79 # 加载点击搜索页面点击的本季页面
    80 def loadDetailPage(self, pageTotal, text, headers):
    81 # 获取每一季的内容(剧名和链接)
    82 node_list = text.xpath('//div[@class="hy-video-details active clearfix"]//div[@class="head"]//a')
    83 items = {}
    84 items['name'] = self.textLineEdit.text()
    85 # 循环获取每一季的内容
    86 for node in node_list:
    87 # 获取信息
    88 title = node.xpath('text()')[0]
    89 link = node.xpath('@href')[0]
    90 items["title"] = title
    91 # 通过获取的单季链接跳转到本季的详情页面
    92 requestDetail = urllib.request.Request("https://www.imeiju.cc" + link, headers=headers)
    93 htmlDetail = urllib.request.urlopen(requestDetail).read()
    94 textDetail = etree.HTML(htmlDetail)
    95 node_listDetail = textDetail.xpath('//div[@class="panel clearfix"][1]//ul/li/a/@href')
    96 self.writeDetailPage(items, node_listDetail)
    97 # 爬取完毕提示
    98 if self.page == int(pageTotal):
    99 self.infoSearchDone()
    100 else:
    101 self.infoSearchContinue(pageTotal)
    102
    103 # 将数据显示到图形界面
    104 def writeDetailPage(self, items, node_listDetail):
    105 for index, nodeLink in enumerate(node_listDetail):
    106 items["link"] = nodeLink
    107 # 写入图形界面
    108 self.textEdit.append(
    109 "<div>"
    110 "<font color='black' size='3'>" + items['name'] + "</font>" + "\n"
    111 "<font color='red' size='3'>" + items['title'] + "</font>" + "\n"
    112 "<font color='orange' size='3'>第" + str(index + 1) + "集</font>" + "\n"
    113 "<font color='green' size='3'>播放链接:</font>" + "\n"
    114 "<font color='blue' size='3'>https://www.imeiju.cc" +items['link'] + "</font>"
    115 "<p></p>"
    116 "</div>"
    117 )
    118
    119 # 搜索不到结果的提示信息
    120 def infoSearchNull(self):
    121 QMessageBox.information(
    122 self, '提示', '搜索结果不存在,请重新输入搜索内容',
    123 QMessageBox.Ok, QMessageBox.Ok
    124 )
    125
    126 # 爬取数据完毕的提示信息
    127 def infoSearchDone(self):
    128 QMessageBox.information(
    129 self, '提示', '爬取《' + self.textLineEdit.text() + '》完毕',
    130 QMessageBox.Ok, QMessageBox.Ok
    131 )
    132
    133 # 多页情况下是否继续爬取的提示信息
    134 def infoSearchContinue(self, pageTotal):
    135 end = QMessageBox.information(
    136 self, '提示', '爬取第' + str(self.page) + '页《' + self.textLineEdit.text() + '》完毕,还有' + str(
    137 int(pageTotal) - self.page) + '页,是否继续爬取',
    138 QMessageBox.Ok | QMessageBox.No, QMessageBox.No
    139 )
    140 if end == QMessageBox.Ok:
    141 self.page += 1
    142 self.loadSearchPage(self.textLineEdit.text(), self.page)
    143 else:
    144 pass
    145
    146 if name == 'main':
    147 app = QApplication(sys.argv)
    148 win = TextEditMeiJu()
    149 win.show()
    150 sys.exit(app.exec
    ())
    </pre>

    能在本地运行 Python 的小伙伴直接复制粘贴上面的代码即可运行程序,当然前提是 pip 所依赖的包。

    本次我们要爬取的网站是 爱美剧 https://www.imeiju.cc/ ,具体的操作流程和上一篇差不多是一样的,这里我们就简单的说一下流程:

    我们在官网右上角搜索我们想要看的美剧:

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555568609949" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    然后就能进入我们想要看的美剧列表了:

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555568609952" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    和美剧天堂一样,浏览器的 url 地址仍然不是我们想要的,我们依旧可以点击页面下方的页面跳转来获取真正的 url 链接:

    https://www.imeiju.cc/search.php?page=1&searchword=%E6%9D%83%E5%8A%9B%E7%9A%84%E6%B8%B8%E6%88%8F&searchtype=

    这样我们就可以根据上面的 url 链接里的请求参数 page 和 searchword 来开始爬去我们的数据了,然后就是根据 xpath 对页面进行元素查找,获取要跳转的链接,再进入跳转的链接里就可以获取我们想要看的美剧链接了。

    需要注意的是当我们跳转到我们想看的链接,比如上面的 《权力的游戏第四季》

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555568609956" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    我们发现上面不仅有在线播放,还有影片下载,但是这次我们选择在线播放,但是在线播放又有好几种播放器,这里本人只取了第一种播放第一种播放源,也就是百度云播,完全是没问题的,如果大家觉得都想获取的请自行复制上面的代码修改吧,代码做了很详细的注释,大家应该能看懂。

    由于本人不是专门做 Python 的,只是了解那么一点点,上面的代码如有问题,请各位大佬批评指正,在此谢过!

    好记性不如烂笔头,特此记录,与君共勉!

    相关文章

      网友评论

        本文标题:Python + PyQt5 实现美剧爬虫可视工具!

        本文链接:https://www.haomeiwen.com/subject/iuttgqtx.html