date: 2018-01-05 22:00:00
status: public
title: '爬取糗事百科的内容和图片并展示'
tags: Python 3.6,MySQL, Tkinter, urllib, bs4, md5, random
一些想法和思路
其实我本人不是很喜欢糗事百科,不过如我之前说的,大家都在爬,管那么多,爬就是了。
我在开写之前在网上找了一些资料,有的比较旧,是糗百改版之前的代码;好一点的做了命令行界面,按回车可以不断读取糗事。不过似乎都有意无意的忽视了带图片的糗事,我就想干脆做个类似糗百客户端的东西,同时支持图片阅读。
同时我也进一步的规整了一下我自己的爬虫代码,比如说配置文件,添加随机User-Agent,单元测试代码等等,以后写起来就可以复制粘贴了。
代码主要分成三个部分:UI、封装的数据库接口和爬虫代码。UI代码简单,先大概画个草图,然后照着写代码就行了,函数不记得手册查一查,很快。
数据库接口包含了:
- DBconnect(),连接数据库,参数从配置文件里读;同时新建qiushibaike表;
- DBupdate(url, md5, author, fun, comment, content, img_urls=None),向qiushibaike表中添加一条记录(糗事);
- DBquery(),返回所有未读(isread = 0)记录中id最大的那条数据;
- DBTotal(),返回qiushibaike表中的记录总数;
- DuplicationCheck(md5),根据传入md5值判断数据库中是否已经存在这条数据,md5是一条记录的数据指纹;
- DBdrop(),删除qiushibaike表;
- DBclose(),关闭连接;
- DBtest(),单元测试代码。
数据库的代码下面有,这里再说下爬虫代码:
- 糗百的前端代码是很整洁的(跟baidu比起来),简单分析一下就能找到我们需要的结构;同时,糗百的几个标签页面的结构都是一样的(只有一点细微的差别,后面或者代码里会写明),所以我的爬虫支持爬取糗百任意标签下的内容。下图是我们需要用到的结构的位置:
- 爬取的时候没有用代理,只是在每次爬取操作中间设置了1s的延迟,同时会带一个随机的User-Agent头。但是有时候返回的页面不能被准确的解析,参考qiushibaiky.py Line 48。我没找到原因,我自己测试了解析失败的场景,它返回的页面是有内容的,但是在用bs4的css选择器解析时会出现找不到对象的情况;而且再次操作一次又能成功,所以我觉得可能是解析器的问题,不知道换一种解析器或者换一种css选择语句会不会解决。我现在的解决方案就是重新抓取一次当前页面。
- 每一个糗百页面都有唯一的url,这里没有爬取其评论,但是用这个url的md5值作为了数据指纹,在当前糗百故事写入数据库之前会先进行一次比较,如果已经存在就不会再写入,以此达到了去重的目的。这里没有用BloomFiter,redis之类的是因为数据量很小,将所有的页面爬下来也就几千条数据。给数据库中的md5字段上一个索引,速度完全够了。还有一点,这里我本来想用mongodb的,然而它好像没有32-bit的安装包,只能作罢。
- 还有一点值得一提的就是抓取”下一页“连接的问题,“热门“标签下的糗事是固定的13页,而且在第13页的时候原本”下一页“的标签会变为”更多“,”24小时“也是这样;”热门“标签也是13页,但是最后一页没有”下一页“也没有”更多“标签,”文字“标签也是这样;”穿越“标签的页数不定,同时有可能为空,即那天没有糗事;”糗图“标签是35页,最后一页也是既没有”下一页“也没有”更多“标签,”新鲜“也是这样。
- 页数的问题比较容易解决,可以不管它有多少页,只抓”下一页“的链接即可。
- 而”下一页“、”更多“或者空就需要更多的逻辑判断,具体参见的代码。
效果图
抓取时的样子:
qiushi-spider.jpg表中数据,可以看到有的有图片,有点没有:
qiushi-db.jpg不带图片的糗事:
qiushi-nopic.jpg带图片的糗事:
qiushi-pic.jpg源码
configure.py
# DB
DB_HOST = '192.168.153.131'
DB_PORT = 3306
DB_DBNAME = 'spider'
DB_USER = 'root'
DB_PASSWORD = '123123'
DB_CHARSET = 'utf8mb4'
# User-Agents
FakeUserAgents = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3",
"Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12",
"Opera/9.27 (Windows NT 5.2; U; zh-cn)",
"Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Version/3.1 Safari/525.13",
"Mozilla/5.0 (iPhone; U; CPU like Mac OS X) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/4A93 ",
"Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 ",
"Mozilla/5.0 (Linux; U; Android 3.2; ja-jp; F-01D Build/F0001) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13 ",
"Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; ja-jp) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7",
"Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_2_1 like Mac OS X; da-dk) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5 ",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-US) AppleWebKit/530.9 (KHTML, like Gecko) Chrome/ Safari/530.9 ",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.10 Chromium/27.0.1453.93 Chrome/27.0.1453.93 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36",
"Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36"
]
db.py
import pymysql.cursors
import configure
conn = None
# 简单封装一下两个MOD
def __DMLExecutionMod(sql):
global conn
try:
with conn.cursor() as cursor:
cursor.execute(sql)
conn.commit()
except Exception as e:
conn.rollback()
print ("DB Exception: %s", e)
def __DQLExecutionMod(sql):
global conn
try:
with conn.cursor() as cursor:
cursor.execute(sql)
res = cursor.fetchall()
conn.commit()
except Exception as e:
conn.rollback()
print ("DB Exception: %s", e)
return res
# Connect
def DBconnect():
global conn
config = {
'host':configure.DB_HOST,
'port':configure.DB_PORT,
'user':configure.DB_USER,
'password':configure.DB_PASSWORD,
'db':configure.DB_DBNAME,
'charset':configure.DB_CHARSET,
'cursorclass':pymysql.cursors.DictCursor,
}
if conn == None:
conn = pymysql.connect(**config)
# init table
sql = "CREATE TABLE IF NOT EXISTS `qiushibaike` (\
`id` int(11) NOT NULL AUTO_INCREMENT,\
`isread` int(11) NULL DEFAULT 0,\
`url` varchar(255) CHARACTER SET latin1 COLLATE latin1_swedish_ci NULL DEFAULT NULL COMMENT 'url_md5 = md5(url)',\
`url_md5` binary(64) NOT NULL,\
`author` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,\
`fun` int(255) NULL DEFAULT NULL,\
`comment` int(255) NULL DEFAULT NULL,\
`content` varchar(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,\
`img_url` varchar(500) CHARACTER SET latin1 COLLATE latin1_swedish_ci NULL DEFAULT NULL,\
PRIMARY KEY (`id`) USING BTREE,\
UNIQUE INDEX `idx_id`(`id`) USING BTREE,\
UNIQUE INDEX `idx_url_md5`(`url_md5`) USING BTREE\
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Compact;\
"
__DMLExecutionMod(sql)
# Add ONE record into the table
def DBupdate(url, md5, author, fun, comment, content, img_urls=None):
global conn
if img_urls == None:
img_urls = 'null'
else:
img_urls = "'" + img_urls + "'"
sql = "INSERT INTO `qiushibaike`\
(`url`, `url_md5`, `author`, `fun`, `comment`, `content`, `img_url`)\
VALUES\
('{0:s}', HEX('{1:s}'), '{2:s}', {3:d}, {4:d}, '{5:s}', \
{6:s});".format(url, md5, author, fun, comment, content, img_urls).replace(' ', '')
__DMLExecutionMod(sql)
return True
# Retrieve ONE random record
def DBquery():
global conn
sql = "SELECT `id`, `url`, `author`, `fun`, `comment`, `content`, `img_url`\
FROM `qiushibaike` WHERE isread = 0 \
ORDER BY `id` DESC LIMIT 1;".replace(' ', '')
res = __DQLExecutionMod(sql)
sql = "UPDATE `qiushibaike` SET isread = 1 WHERE id = {0:d};".format(res[0]['id'])
__DMLExecutionMod(sql)
return res
# 获取总数
def DBTotal():
global conn;
sql = "SELECT count(*) as `total` FROM `qiushibaike`;"
res = __DQLExecutionMod(sql)
return res[0]['total']
# duplication check
def DuplicationCheck(md5):
global conn
sql = "SELECT count(*) AS `num` FROM `qiushibaike` WHERE url_md5 = HEX('{0:s}');".format(md5)
res = __DQLExecutionMod(sql)
if res[0]['num']:
return True
else:
return False
# Drop this table
def DBdrop():
global conn
__DMLExecutionMod("DROP TABLE `qiushibaike`;")
return True
# close
def DBclose():
global conn
if conn is not None:
conn.close()
def DBtest():
DBconnect()
assert True == DBupdate('http://www.google.com', 'ed646a3334ca891fd3467db131372140', 'ethan', 12, 13, 'aaaa', None), 'update fail - 1'
assert True == DBupdate('http://www.google.com', 'ed646a3334ca891fd3467db131372141', 'ethan', 12, 14, 'aaaa', 'http://a;http://b;'), 'update fail - 2'
assert True == DBupdate('http://www.google.com', 'ed646a3334ca891fd3467db131372142', 'ethan', 12, 15, 'aaaa', None), 'update fail - 3'
res = DBquery()
assert 1 == len(res), 'query fail - 11'
assert 15 == res[0]['comment'], 'query fail - 12'
res = DBquery()
assert 1 == len(res), 'query fail - 21'
assert 14 == res[0]['comment'], 'query fail - 22'
assert 3 == DBTotal(), 'query fail - 31'
assert True == DuplicationCheck('ed646a3334ca891fd3467db131372142'), 'duplicate fail - 1'
assert False == DuplicationCheck('11111111111111111111111111111111'), 'duplicate fail - 2'
assert True == DBdrop(), 'drop fail'
DBclose()
# test
if __name__ == '__main__':
DBtest()
ui.py
import tkinter as tk
import tkinter.messagebox
import webbrowser
from tkinter import END
from PIL import Image, ImageTk
import urllib.request
import db as datasourse
import qiushibaike as qb
def init_ui():
root = tk.Tk()
root.title('糗事百科私人阅读器')
width = 600
height = 440
screenwidth = root.winfo_screenwidth()
screenheight = root.winfo_screenheight()
size = '%dx%d+%d+%d' % (width, height, (screenwidth - width)/3, (screenheight - height)/3)
root.geometry(size)
# 作者,好笑,评论字段
lf_content = tk.LabelFrame(root, width=580, height=350)
lf_content.grid(row=0, column=0, sticky='w',padx=10, pady=10, columnspan=3)
lstr_author = tk.StringVar()
lstr_author.set("作者: ")
lstr_fun_comment = tk.StringVar()
lstr_fun_comment.set("0 好笑 0 评论")
lstr_url = tk.StringVar()
lstr_url.set("源地址:")
lstr_url_val = tk.StringVar()
href = ""
label_author = tk.Label(lf_content,
textvariable = lstr_author,
width= 24,
height = 1,
font = ('Microsoft YaHei', 12),
anchor='w'
)
label_author.place(x=5, y=2)
label_fun_comment = tk.Label(lf_content,
textvariable = lstr_fun_comment,
width= 24,
height = 1,
font = ('Microsoft YaHei', 8),
anchor='w'
)
label_fun_comment.place(x=5, y=30)
label_url = tk.Label(lf_content,
textvariable = lstr_url,
width= 48,
height = 1,
font = ('Microsoft YaHei', 10),
anchor='w'
)
label_url.place(x=5, y=52)
# 将URL做成可以点击的超链接
def callback(event):
global href
webbrowser.open_new(href)
label_url_val = tk.Label(lf_content,
textvariable = lstr_url_val,
fg='blue',
cursor='hand2',
width= 48,
height = 1,
font = ('Microsoft YaHei', 10),
anchor='w'
)
label_url_val.place(x=55, y=52)
label_url_val.bind("<Button-1>", callback)
# 文本组件
textbox = tk.Text(lf_content,
width=62,
height=12,
relief='solid',
font = ('Microsoft YaHei', 12),
#state = 'disabled'
)
textbox.place(x=5,y=80)
# 进行1次爬取
def button_spider_click():
count = qb.OneCircleSpider()
tk.messagebox.showinfo(title='HI', message='本次新抓取{0:d}了条记录。'.format(count))
# 取一条记录并解析
def button_luck_click():
if 0 == datasourse.DBTotal():
tk.messagebox.showinfo(title='HI', message='你已经看完了所有的百科,再抓一些吧!'.format(count))
# 解析
record = datasourse.DBquery()[0]
lstr_author.set("作者: {0:s}".format(record['author']))
lstr_fun_comment.set("{0:d} 好笑 {0:d} 评论".format(record['fun'], record['comment']))
lstr_url_val.set(record['url'])
global href
href = record['url']
# textbox在disabled状态下不能添加内容
# 先改成normal,加完内容再改回来
textbox.configure(state='normal')
existed_text = textbox.get("1.0", END).strip()
if existed_text:
textbox.delete("1.0", END)
textbox.insert('insert', record['content'])
textbox.configure(state='disabled')
# 无论如何先把图片按钮disable
# 如果有图片,下载图片,enable图片按钮
button_img.configure(state='disabled')
if record['img_url']:
urllib.request.urlretrieve(record['img_url'],filename='test.jpg')
button_img.configure(state='normal')
def button_img_click():
# 新建一个窗口,大小和图片一样
img_window = tk.Toplevel(root)
img_window.title("图片查看")
image = Image.open("test.jpg")
# 这里为什么+4?为了对称
img_window_size = '%dx%d+%d+%d' % (image.width + 4, image.height + 4, (screenwidth - image.width)/2, (screenheight - image.height)/2)
img_window.geometry(img_window_size)
img = ImageTk.PhotoImage(image)
canvas = tk.Canvas(img_window, width = image.width ,height = image.height, bg = 'grey')
# create_image()的前两个参数代表的是图片**中心**的坐标轴
canvas.create_image(image.width//2, image.height//2, image=img)
canvas.place(x=0,y=0)
img_window.mainloop()
# 三个按钮
button_spider = tk.Button(root,
text='抓取更多',
width=10,
height=2,
font = ('Microsoft YaHei', 12),
command=button_spider_click
)
button_spider.grid(row=1, column=0, sticky='we',padx=10)
button_img = tk.Button(root,
text='显示图片',
width=10,
height=2,
font = ('Microsoft YaHei', 12),
state = 'disabled',
command=button_img_click
)
button_img.grid(row=1, column=1, sticky='we',padx=10)
button_luck = tk.Button(root,
text='手气不错',
width=10,
height=2,
font = ('Microsoft YaHei', 12),
command=button_luck_click
)
button_luck.grid(row=1, column=2, sticky='we',padx=10)
root.mainloop()
if __name__ == '__main__':
datasourse.DBconnect()
init_ui()
datasourse.DBclose()
qiushibaike.py
# Standard Lib
import urllib
import hashlib
import time
from urllib import request
from urllib import error
from bs4 import BeautifulSoup
from random import choice
# User Lib
import db
import ui
import configure
# 这里几个标签的URL都可以爬,因为结构都是一样的
# 依次是热门,24小时,热图,文字,穿越,糗图,新鲜
TargetURLs = ['https://www.qiushibaike.com/',
'https://www.qiushibaike.com/imgrank/',
'https://www.qiushibaike.com/hot/',
'https://www.qiushibaike.com/text/',
'https://www.qiushibaike.com/history/',
'https://www.qiushibaike.com/pic/',
'https://www.qiushibaike.com/textnew/'
]
Domain = 'https://www.qiushibaike.com'
def OnepageSpider(myTargetURL = choice(TargetURLs)):
print ("Start to spider: {0:s}".format(myTargetURL))
try:
# 构建请求
req = request.Request(myTargetURL)
req.add_header("User-Agent",choice(configure.FakeUserAgents))
response = request.urlopen(req)
if response.getcode() != 200:
print ("HTTP Request Code: {0:d}".format(response.getcode()))
return myTargetURL, 0
html = response.read()
except error.URLError as e:
if hasattr(e,"code"):
print(e.code)
if hasattr(e,"reason"):
print(e.reason)
# 用bs4解析
soup = BeautifulSoup(html, 'lxml')
# 这里有时候会失败,但是再试一次就能成功,所以加个判断
# 失败的原因没找到,初步断定不是网络波动
# 可能和css选择器的表达式有关
if soup.select('div.col1'):
results = soup.select('div.col1')[0].select("div.article")
else:
print ("SOMETHING IS WRONG, TRY AGAIN LATER.")
return myTargetURL, 0
# 解析数据并写入DB
count = 0
for res in results:
# 首先解析URL,判断是否已经在数据库里
url = Domain + res.find_all('a', class_='contentHerf')[0].get('href')
# md5
m = hashlib.md5()
m.update(url.encode('utf-8'))
url_md5 = m.hexdigest()
if db.DuplicationCheck(url_md5):
continue
# 不在数据库里,继续解析其他值
author = res.find('h2').get_text().strip()
stat = res.find_all('i', class_='number')
# 如果评论数是0,就会不显示
# 我暂时没找到好笑数是0的帖子,不过也这样写了
if len(stat) == 0:
fun, comment = 0
elif len(stat) == 1:
fun = stat[0].get_text()
comment = 0
else:
fun = stat[0].get_text()
comment = stat[1].get_text()
content = res.select("div.content span")[0].get_text().strip()
if res.select("div.thumb"):
img_urls = "https:" + res.select("div.thumb img")[0].get('src')
else:
img_urls = None
if True == db.DBupdate(url, url_md5, author, int(fun), int(comment), content, img_urls):
count += 1
# 解析下一页的URL,并返回这个URL
next = soup.select('div.col1 ul.pagination li')[-1].a
# 这个地方这么写,是因为有的页面最后一页是个“更多”的标签,而有的是空的
# 为了适配所有页面的抓取,要多加一个判断
if next and next.span.get_text().strip() == '下一页':
next_url = Domain + next.get('href')
else:
next_url = None
return next_url, count
# 从URL里找一个抓一轮
# 就是一直抓到没有下一页为止,一般是13页,有一个页面是25页
def OneCircleSpider():
total = 0
next_url, num = OnepageSpider()
print ("Spider One Page. Add {0:d} record(s)".format(num))
total += num
while next_url:
next_url, num = OnepageSpider(next_url)
total += num
print ("Spider One Page. Add {0:d} record(s)".format(num))
time.sleep(1)
print ("Add {0:d} record(s) in this circle".format(total))
return total
def main():
db.DBconnect()
ui.init_ui()
db.DBclose()
if __name__ == '__main__':
main()
网友评论