美文网首页Python学习资料
爬取糗事百科的内容和图片并展示

爬取糗事百科的内容和图片并展示

作者: 小温侯 | 来源:发表于2018-07-21 23:17 被阅读31次

    date: 2018-01-05 22:00:00
    status: public
    title: '爬取糗事百科的内容和图片并展示'
    tags: Python 3.6,MySQL, Tkinter, urllib, bs4, md5, random


    一些想法和思路

    其实我本人不是很喜欢糗事百科,不过如我之前说的,大家都在爬,管那么多,爬就是了。

    我在开写之前在网上找了一些资料,有的比较旧,是糗百改版之前的代码;好一点的做了命令行界面,按回车可以不断读取糗事。不过似乎都有意无意的忽视了带图片的糗事,我就想干脆做个类似糗百客户端的东西,同时支持图片阅读。

    同时我也进一步的规整了一下我自己的爬虫代码,比如说配置文件,添加随机User-Agent,单元测试代码等等,以后写起来就可以复制粘贴了。

    代码主要分成三个部分:UI、封装的数据库接口和爬虫代码。UI代码简单,先大概画个草图,然后照着写代码就行了,函数不记得手册查一查,很快。

    数据库接口包含了:

    • DBconnect(),连接数据库,参数从配置文件里读;同时新建qiushibaike表;
    • DBupdate(url, md5, author, fun, comment, content, img_urls=None),向qiushibaike表中添加一条记录(糗事);
    • DBquery(),返回所有未读(isread = 0)记录中id最大的那条数据;
    • DBTotal(),返回qiushibaike表中的记录总数;
    • DuplicationCheck(md5),根据传入md5值判断数据库中是否已经存在这条数据,md5是一条记录的数据指纹;
    • DBdrop(),删除qiushibaike表;
    • DBclose(),关闭连接;
    • DBtest(),单元测试代码。

    数据库的代码下面有,这里再说下爬虫代码:

    • 糗百的前端代码是很整洁的(跟baidu比起来),简单分析一下就能找到我们需要的结构;同时,糗百的几个标签页面的结构都是一样的(只有一点细微的差别,后面或者代码里会写明),所以我的爬虫支持爬取糗百任意标签下的内容。下图是我们需要用到的结构的位置:
    qiushi-struct.jpg
    • 爬取的时候没有用代理,只是在每次爬取操作中间设置了1s的延迟,同时会带一个随机的User-Agent头。但是有时候返回的页面不能被准确的解析,参考qiushibaiky.py Line 48。我没找到原因,我自己测试了解析失败的场景,它返回的页面是有内容的,但是在用bs4的css选择器解析时会出现找不到对象的情况;而且再次操作一次又能成功,所以我觉得可能是解析器的问题,不知道换一种解析器或者换一种css选择语句会不会解决。我现在的解决方案就是重新抓取一次当前页面。
    • 每一个糗百页面都有唯一的url,这里没有爬取其评论,但是用这个url的md5值作为了数据指纹,在当前糗百故事写入数据库之前会先进行一次比较,如果已经存在就不会再写入,以此达到了去重的目的。这里没有用BloomFiter,redis之类的是因为数据量很小,将所有的页面爬下来也就几千条数据。给数据库中的md5字段上一个索引,速度完全够了。还有一点,这里我本来想用mongodb的,然而它好像没有32-bit的安装包,只能作罢。
    • 还有一点值得一提的就是抓取”下一页“连接的问题,“热门“标签下的糗事是固定的13页,而且在第13页的时候原本”下一页“的标签会变为”更多“,”24小时“也是这样;”热门“标签也是13页,但是最后一页没有”下一页“也没有”更多“标签,”文字“标签也是这样;”穿越“标签的页数不定,同时有可能为空,即那天没有糗事;”糗图“标签是35页,最后一页也是既没有”下一页“也没有”更多“标签,”新鲜“也是这样。
      • 页数的问题比较容易解决,可以不管它有多少页,只抓”下一页“的链接即可。
      • 而”下一页“、”更多“或者空就需要更多的逻辑判断,具体参见的代码。

    效果图

    抓取时的样子:

    qiushi-spider.jpg

    表中数据,可以看到有的有图片,有点没有:

    qiushi-db.jpg

    不带图片的糗事:

    qiushi-nopic.jpg

    带图片的糗事:

    qiushi-pic.jpg

    源码

    configure.py

    # DB
    DB_HOST = '192.168.153.131'
    DB_PORT = 3306
    DB_DBNAME = 'spider'
    DB_USER = 'root'
    DB_PASSWORD = '123123'
    DB_CHARSET = 'utf8mb4'
    
    # User-Agents
    FakeUserAgents = [
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
        "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
        "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
        "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
        "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
        "Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12",
        "Opera/9.27 (Windows NT 5.2; U; zh-cn)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Version/3.1 Safari/525.13",
        "Mozilla/5.0 (iPhone; U; CPU like Mac OS X) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/4A93 ",
        "Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 ",
        "Mozilla/5.0 (Linux; U; Android 3.2; ja-jp; F-01D Build/F0001) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13 ",
        "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; ja-jp) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7",
        "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_2_1 like Mac OS X; da-dk) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5 ",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-US) AppleWebKit/530.9 (KHTML, like Gecko) Chrome/ Safari/530.9 ",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.10 Chromium/27.0.1453.93 Chrome/27.0.1453.93 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36",
        "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36"
    ]
    

    db.py

    import pymysql.cursors
    import configure
    
    conn = None
    
    # 简单封装一下两个MOD
    def __DMLExecutionMod(sql):
        global conn
    
        try:
            with conn.cursor() as cursor:
                cursor.execute(sql)
            conn.commit()
        except Exception as e:
            conn.rollback()
            print ("DB Exception: %s", e)
    
    def __DQLExecutionMod(sql):
        global conn
    
        try:
            with conn.cursor() as cursor:
                cursor.execute(sql)
                res = cursor.fetchall()
            conn.commit()
        except Exception as e:
            conn.rollback()
            print ("DB Exception: %s", e)
        
        return res
    
    # Connect
    def DBconnect():
        global conn
    
        config = {
            'host':configure.DB_HOST,
            'port':configure.DB_PORT,
            'user':configure.DB_USER,
            'password':configure.DB_PASSWORD,
            'db':configure.DB_DBNAME,
            'charset':configure.DB_CHARSET,
            'cursorclass':pymysql.cursors.DictCursor,
            }
    
        if conn == None:
            conn = pymysql.connect(**config)
    
        # init table
        sql = "CREATE TABLE IF NOT EXISTS `qiushibaike`  (\
                `id` int(11) NOT NULL AUTO_INCREMENT,\
                `isread` int(11) NULL DEFAULT 0,\
                `url` varchar(255) CHARACTER SET latin1 COLLATE latin1_swedish_ci NULL DEFAULT NULL COMMENT 'url_md5 = md5(url)',\
                `url_md5` binary(64) NOT NULL,\
                `author` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,\
                `fun` int(255) NULL DEFAULT NULL,\
                `comment` int(255) NULL DEFAULT NULL,\
                `content` varchar(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,\
                `img_url` varchar(500) CHARACTER SET latin1 COLLATE latin1_swedish_ci NULL DEFAULT NULL,\
                PRIMARY KEY (`id`) USING BTREE,\
                UNIQUE INDEX `idx_id`(`id`) USING BTREE,\
                UNIQUE INDEX `idx_url_md5`(`url_md5`) USING BTREE\
                ) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Compact;\
            "   
    
        __DMLExecutionMod(sql)
    
    # Add ONE record into the table
    def DBupdate(url, md5, author, fun, comment, content, img_urls=None):
        global conn
    
        if img_urls == None:
            img_urls = 'null'
        else:
            img_urls = "'" + img_urls + "'"
    
        sql = "INSERT INTO `qiushibaike`\
                (`url`, `url_md5`, `author`, `fun`, `comment`, `content`, `img_url`)\
                VALUES\
                ('{0:s}', HEX('{1:s}'), '{2:s}', {3:d}, {4:d}, '{5:s}', \
                {6:s});".format(url, md5, author, fun, comment, content, img_urls).replace('    ', '')
    
        __DMLExecutionMod(sql)
    
        return True
    
    # Retrieve ONE random record
    def DBquery():
        global conn
    
        sql = "SELECT `id`, `url`, `author`, `fun`, `comment`, `content`, `img_url`\
                    FROM `qiushibaike` WHERE isread = 0 \
                    ORDER BY `id` DESC LIMIT 1;".replace('  ', '')
    
        res = __DQLExecutionMod(sql)
    
    
        sql = "UPDATE `qiushibaike` SET isread = 1 WHERE id = {0:d};".format(res[0]['id'])
        __DMLExecutionMod(sql)
    
        return res
    
    # 获取总数
    def DBTotal():
        global conn;
        sql = "SELECT count(*) as `total` FROM `qiushibaike`;"
    
        res = __DQLExecutionMod(sql)
    
        return res[0]['total']
    
    # duplication check
    def DuplicationCheck(md5):
        global conn
        sql = "SELECT count(*) AS `num` FROM `qiushibaike` WHERE url_md5 = HEX('{0:s}');".format(md5)
    
        res = __DQLExecutionMod(sql)
    
        if res[0]['num']:   
            return True
        else:
            return False
    
    # Drop this table
    def DBdrop():
        global conn
        __DMLExecutionMod("DROP TABLE `qiushibaike`;")
    
        return True
    
    # close
    def DBclose():
        global conn
        if conn is not None:
            conn.close()
    
    def DBtest():
        DBconnect()
    
        assert True == DBupdate('http://www.google.com', 'ed646a3334ca891fd3467db131372140', 'ethan', 12, 13, 'aaaa', None), 'update fail - 1'
        assert True == DBupdate('http://www.google.com', 'ed646a3334ca891fd3467db131372141', 'ethan', 12, 14, 'aaaa', 'http://a;http://b;'), 'update fail - 2'
        assert True == DBupdate('http://www.google.com', 'ed646a3334ca891fd3467db131372142', 'ethan', 12, 15, 'aaaa', None), 'update fail - 3'
    
        res = DBquery()
        assert 1 == len(res), 'query fail - 11'
        assert 15 == res[0]['comment'], 'query fail - 12'
    
        res = DBquery()
        assert 1 == len(res), 'query fail - 21'
        assert 14 == res[0]['comment'], 'query fail - 22'
    
        assert 3 == DBTotal(), 'query fail - 31'
    
        assert True == DuplicationCheck('ed646a3334ca891fd3467db131372142'), 'duplicate fail - 1'
        assert False == DuplicationCheck('11111111111111111111111111111111'), 'duplicate fail - 2'
    
        assert True == DBdrop(), 'drop fail'
        DBclose()
    
    # test
    if __name__ == '__main__':
        DBtest()
    

    ui.py

    import tkinter as tk
    import tkinter.messagebox
    import webbrowser
    from tkinter import END
    from PIL import Image, ImageTk
    import urllib.request
    
    import db as datasourse
    import qiushibaike as qb
    
    def init_ui():
        root = tk.Tk()
        root.title('糗事百科私人阅读器')
        width = 600
        height = 440
        screenwidth = root.winfo_screenwidth()  
        screenheight = root.winfo_screenheight()  
        size = '%dx%d+%d+%d' % (width, height, (screenwidth - width)/3, (screenheight - height)/3)
        root.geometry(size)
    
        # 作者,好笑,评论字段
        lf_content = tk.LabelFrame(root, width=580, height=350)  
        lf_content.grid(row=0, column=0, sticky='w',padx=10, pady=10, columnspan=3)
    
        lstr_author = tk.StringVar()
        lstr_author.set("作者: ")
        lstr_fun_comment = tk.StringVar()
        lstr_fun_comment.set("0 好笑 0 评论")
        lstr_url = tk.StringVar()
        lstr_url.set("源地址:")
        lstr_url_val = tk.StringVar()
        href = ""
    
        label_author = tk.Label(lf_content,
            textvariable = lstr_author,
            width= 24, 
            height = 1,
            font = ('Microsoft YaHei', 12),
            anchor='w'
            )
        label_author.place(x=5, y=2)
    
        label_fun_comment = tk.Label(lf_content,
            textvariable = lstr_fun_comment,
            width= 24, 
            height = 1,
            font = ('Microsoft YaHei', 8),
            anchor='w'
            )
        label_fun_comment.place(x=5, y=30)
    
        label_url = tk.Label(lf_content,
            textvariable = lstr_url,
            width= 48, 
            height = 1,
            font = ('Microsoft YaHei', 10),
            anchor='w'
            )
        label_url.place(x=5, y=52)
    
        # 将URL做成可以点击的超链接
        def callback(event):
            global href
            webbrowser.open_new(href)
    
        label_url_val = tk.Label(lf_content,
            textvariable = lstr_url_val,
            fg='blue',
            cursor='hand2',
            width= 48, 
            height = 1,
            font = ('Microsoft YaHei', 10),
            anchor='w'
            )
        label_url_val.place(x=55, y=52)
        label_url_val.bind("<Button-1>", callback)
    
        # 文本组件
        textbox = tk.Text(lf_content, 
            width=62,
            height=12,
            relief='solid',
            font = ('Microsoft YaHei', 12),
            #state = 'disabled'
        )
        textbox.place(x=5,y=80)     
    
        # 进行1次爬取
        def button_spider_click():
            count = qb.OneCircleSpider()
            tk.messagebox.showinfo(title='HI', message='本次新抓取{0:d}了条记录。'.format(count))
    
        # 取一条记录并解析
        def button_luck_click():
            if 0 == datasourse.DBTotal():
                tk.messagebox.showinfo(title='HI', message='你已经看完了所有的百科,再抓一些吧!'.format(count))
    
            # 解析
            record = datasourse.DBquery()[0]
            lstr_author.set("作者: {0:s}".format(record['author']))
            lstr_fun_comment.set("{0:d} 好笑 {0:d} 评论".format(record['fun'], record['comment']))
            lstr_url_val.set(record['url'])
            global href
            href = record['url']
    
            # textbox在disabled状态下不能添加内容
            # 先改成normal,加完内容再改回来
            textbox.configure(state='normal')
            existed_text = textbox.get("1.0", END).strip()
            if existed_text:
                textbox.delete("1.0", END)
            textbox.insert('insert', record['content'])
            textbox.configure(state='disabled')
    
            # 无论如何先把图片按钮disable
            # 如果有图片,下载图片,enable图片按钮
            button_img.configure(state='disabled')
            if record['img_url']:
                urllib.request.urlretrieve(record['img_url'],filename='test.jpg')
                button_img.configure(state='normal')
    
        def button_img_click():
            # 新建一个窗口,大小和图片一样
            img_window = tk.Toplevel(root)
            img_window.title("图片查看")
            image = Image.open("test.jpg")
            # 这里为什么+4?为了对称
            img_window_size = '%dx%d+%d+%d' % (image.width + 4, image.height + 4, (screenwidth - image.width)/2, (screenheight - image.height)/2)
            img_window.geometry(img_window_size)
                            
            img = ImageTk.PhotoImage(image)
            canvas = tk.Canvas(img_window, width = image.width ,height = image.height, bg = 'grey')
            # create_image()的前两个参数代表的是图片**中心**的坐标轴
            canvas.create_image(image.width//2, image.height//2, image=img)
            canvas.place(x=0,y=0)
    
            img_window.mainloop()
    
        # 三个按钮
        button_spider = tk.Button(root,
            text='抓取更多',
            width=10,
            height=2,
            font = ('Microsoft YaHei', 12),
            command=button_spider_click
            )
        button_spider.grid(row=1, column=0, sticky='we',padx=10)
    
        button_img = tk.Button(root,
            text='显示图片',
            width=10,
            height=2,
            font = ('Microsoft YaHei', 12),
            state = 'disabled',
            command=button_img_click
            )
        button_img.grid(row=1, column=1, sticky='we',padx=10)
    
        button_luck = tk.Button(root,
            text='手气不错',
            width=10,
            height=2,
            font = ('Microsoft YaHei', 12),
            command=button_luck_click
            )
        button_luck.grid(row=1, column=2, sticky='we',padx=10)
    
        root.mainloop()
    
    if __name__ == '__main__':
        datasourse.DBconnect()
        init_ui()
        datasourse.DBclose()
    

    qiushibaike.py

    # Standard Lib
    import urllib
    import hashlib
    import time
    from urllib import request
    from urllib import error
    from bs4 import BeautifulSoup
    from random import choice
    
    # User Lib
    import db
    import ui
    import configure
    
    # 这里几个标签的URL都可以爬,因为结构都是一样的
    # 依次是热门,24小时,热图,文字,穿越,糗图,新鲜
    TargetURLs = ['https://www.qiushibaike.com/',
                'https://www.qiushibaike.com/imgrank/',
                'https://www.qiushibaike.com/hot/',
                'https://www.qiushibaike.com/text/',
                'https://www.qiushibaike.com/history/',
                'https://www.qiushibaike.com/pic/',
                'https://www.qiushibaike.com/textnew/'
            ]
    
    Domain = 'https://www.qiushibaike.com'
    
    def OnepageSpider(myTargetURL = choice(TargetURLs)):
        print ("Start to spider: {0:s}".format(myTargetURL))
        try:
            # 构建请求
            req = request.Request(myTargetURL)
            req.add_header("User-Agent",choice(configure.FakeUserAgents))
            response = request.urlopen(req)
            if response.getcode() != 200:
                print ("HTTP Request Code: {0:d}".format(response.getcode()))
                return myTargetURL, 0
            html = response.read()
        except error.URLError as e:
            if hasattr(e,"code"):
                print(e.code)
            if hasattr(e,"reason"):
                print(e.reason)
    
        # 用bs4解析
        soup = BeautifulSoup(html, 'lxml')
    
        # 这里有时候会失败,但是再试一次就能成功,所以加个判断
        # 失败的原因没找到,初步断定不是网络波动
        # 可能和css选择器的表达式有关
        if soup.select('div.col1'): 
            results = soup.select('div.col1')[0].select("div.article")
        else:
            print ("SOMETHING IS WRONG, TRY AGAIN LATER.")
            return myTargetURL, 0
    
        # 解析数据并写入DB
        count = 0
        for res in results:
            # 首先解析URL,判断是否已经在数据库里
            url = Domain + res.find_all('a', class_='contentHerf')[0].get('href')
            # md5
            m = hashlib.md5()
            m.update(url.encode('utf-8'))
            url_md5 = m.hexdigest()
            
            if db.DuplicationCheck(url_md5):
                continue
    
            # 不在数据库里,继续解析其他值
            author = res.find('h2').get_text().strip()
            
            stat = res.find_all('i', class_='number')
            
            # 如果评论数是0,就会不显示
            # 我暂时没找到好笑数是0的帖子,不过也这样写了
            if len(stat) == 0:
                fun, comment = 0
            elif len(stat) == 1:
                fun = stat[0].get_text()
                comment = 0
            else:
                fun = stat[0].get_text()
                comment = stat[1].get_text()
    
            content = res.select("div.content span")[0].get_text().strip()
    
            if res.select("div.thumb"):
                img_urls = "https:" + res.select("div.thumb img")[0].get('src')
            else:
                img_urls = None
    
    
            if True == db.DBupdate(url, url_md5, author, int(fun), int(comment), content, img_urls):
                count += 1
    
        
        # 解析下一页的URL,并返回这个URL
        next = soup.select('div.col1 ul.pagination li')[-1].a
        # 这个地方这么写,是因为有的页面最后一页是个“更多”的标签,而有的是空的
        # 为了适配所有页面的抓取,要多加一个判断
        if next and next.span.get_text().strip() == '下一页':
            next_url = Domain + next.get('href')
        else:
            next_url = None
    
        return next_url, count
    
    # 从URL里找一个抓一轮
    # 就是一直抓到没有下一页为止,一般是13页,有一个页面是25页
    def OneCircleSpider():
        total = 0
    
        next_url, num = OnepageSpider()
        print ("Spider One Page. Add {0:d} record(s)".format(num))
        total += num
        
        while next_url:
            next_url, num = OnepageSpider(next_url)
            total += num
            print ("Spider One Page. Add {0:d} record(s)".format(num))
            time.sleep(1)
        
        print ("Add {0:d} record(s) in this circle".format(total))
        
        return total
    
    def main():
        db.DBconnect()
        ui.init_ui()
        db.DBclose()
        
    if __name__ == '__main__':
        main()
    

    相关文章

      网友评论

        本文标题:爬取糗事百科的内容和图片并展示

        本文链接:https://www.haomeiwen.com/subject/fnbkmftx.html