python爬虫项目实战:
爬取糗事百科用户的所有信息,包括用户名、性别、年龄、内容等等。
10个步骤实现项目功能,下面开始实例讲解:
1.导入模块
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import re
import urllib.request
from bs4 import BeautifulSoup
</pre>
2.添加头文件,防止爬取过程被拒绝链接
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> def qiuShi(url,page):
################### 模拟成高仿度浏览器的行为 ##############
设置多个头文件参数,模拟成高仿度浏览器去爬取网页
heads ={
'Connection':'keep-alive',
'Accept-Language':'zh-CN,zh;q=0.9',
'Accept':'text/html,application/xhtml+xml,application/xml;
q=0.9,image/webp,image/apng,/;q=0.8',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
}
headall = []
for key,value in heads.items():
items = (key,value)
将多个头文件参数一个一个添加到headall列表中
headall.append(items)
print(headall)
print('测试1--')
创建opener对象
opener = urllib.request.build_opener()
添加头文件到opener对象
opener.addheaders = headall
将opener对象设置成全局模式
urllib.request.install_opener(opener)
爬取网页并读取数据到data
data = opener.open(url).read().decode()
data1 = urllib.request.urlopen(url).read().decode('utf-8')
print(data1)
print('测试2--')
################## end ########################################
</pre>
3.创建soup解析器对象
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> soup = BeautifulSoup(data,'lxml')
x = 0
</pre>
4.开始使用BeautifulSoup4解析器提取用户名信息
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> ############### 获取用户名 ########################
name = []
使用bs4解析器提取用户名
unames = soup.find_all('h2')
print('测试3--',unames)
for uname in unames:
print(uname.get_text(),'第',page,'-',str(x)+'用户名:',end='')
将用户名一个一个添加到name列表中
name.append(uname.get_text())
print(name)
print('测试4--')
#################end#############################
</pre>
5.提取发表的内容信息
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> ############## 发表的内容 #########################
cont = []
data4 = soup.find_all('div',class_='content')
print(data4)
记住二次筛选一点要转换成字符串形式,否则报错
data4 = str(data4)
使用bs4解析器提取内容
soup3 = BeautifulSoup(data4,'lxml')
contents = soup3.find_all('span')
for content in contents:
print('第',x,'篇糗事的内容:',content.get_text())
将内容一个一个添加到cont列表中
cont.append(content.get_text())
print(cont)
print('测试5--')
##############end####################################
</pre>
6.提取搞笑指数
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> #################搞笑指数##########################
happy = []
获取搞笑指数
第一次筛选
data2 = soup.find_all('span',class_="stats-vote")
获取搞笑指数
第二次筛选
data2 = str(data2) # 将列表转换成字符串形式才可以使用
print(data2)
print('测试6--')
soup1 = BeautifulSoup(data2,'lxml')
happynumbers = soup1.find_all('i',class_="number")
for happynumber in happynumbers:
print(happynumber.get_text())
将将搞笑数一个一个添加到happy列表中
happy.append(happynumber.get_text())
print(happy)
print('测试7--')
##################end#############################
</pre>
7.提取评论数
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> ############## 评论数 ############################
comm = []
data3 = soup.find_all('a',class_='qiushi_comments')
data3 = str(data3)
print(data3)
soup2 = BeautifulSoup(data3,'lxml')
comments = soup2.find_all('i',class_="number")
for comment in comments:
print(comment.get_text())
将评论数一个一个添加到comm列表中
comm.append(comment.get_text())
############end#####################################
</pre>
8.使用正则表达式提取性别和年龄
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> ######## 获取性别和年龄 ##########################
使用正则表达式匹配性别和年龄
pattern1 = '<div class="articleGender (w?)Icon">(d?)</div>'
sexages = re.compile(pattern1).findall(data)
print(sexages)
</pre>
9.设置用户所有信息输出的格局设置
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> ################## 批量输出用户的所以个人信息 #################
print()
for sexage in sexages:
sa = sexage
print(''17, '== 第', page, '页-第', str(x+1) + '个用户 == ',''17)
输出用户名
print('【用户名】:',name[x],end='')
输出性别和年龄
print('【性别】:',sa[0],' 【年龄】:',sa[1])
输出内容
print('【内容】:',cont[x])
输出搞笑数和评论数
print('【搞笑指数】:',happy[x],' 【评论数】:',comm[x])
print(''25,' 三八分割线 ',''25)
x += 1
###################end##########################
</pre>
10.设置循环遍历爬取13页的用户信息
<pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> for i in range(1,14):
糗事百科的网址
url = 'https://www.qiushibaike.com/8hr/page/'+str(i)+'/'
qiuShi(url,i)
</pre>
运行结果,部分截图:
python爬虫项目实战:爬取用户的所有信息,如性别、年龄等以上的运行结果是每时都在更新的,所以读者在运行时,结果不一样是正常的。
今天的项目实战就到这里了,喜欢的朋友可以关注、转发一下喔,希望今天的内容对大家有所帮助。
加小编Python学习群:832339352即可获取数十套Python视频学习资料+书籍!
网友评论