爬取糗百|测试作业

作者: Mrchw | 来源:发表于2017-05-23 23:41 被阅读27次

爬取糗百|测试作业
爬取糗百
爬糗事百科段子
Python 爬虫入门(一)——爬取糗百
【Python爬虫】糗百-文字版块
爬取糗百12-02
Scrapy爬取糗百并存入MySQL
Scrapy框架之CrawlSpider操作 2018-11-0
1.爬取糗百段子
使用python爬取糗百段子

主要爬取的糗百文字版，格式比较统一，不需要对图片、视频进行判断。这次爬取只用了标准库，数据提取用了正则表达式。

设置了请求头

user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
headers = {'User-Agent': user_agent}

翻页

for page in range(1,5):

    url = 'http://www.qiushibaike.com/text/page/'+str(page)+'/?s=4984889'

正则表达式

用（.*?）提取数据

pattern = re.compile('<h2>(.*?)</h2>.*?<div class="articleGender (.*?)Icon">(.*?)</div>.*?<div class="content">.*?<span>(.*?)</span>'+
                             '.*?<span class="stats-vote"><i class="number">(.*?)</i>.*?<i class="number">(.*?)</i>',re.S)

匹配了6个数据，正则太掏粪了，容易出错~可以拿来练手

完整代码

# -*- coding:utf-8 -*-
import urllib
import urllib2
import re

user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
headers = {'User-Agent': user_agent}

for page in range(1,5):
    url = 'http://www.qiushibaike.com/text/page/'+str(page)+'/?s=4984889'
    try:
        #获取源码
        request = urllib2.Request(url,headers = headers)
        response = urllib2.urlopen(request)
        content = response.read().decode('utf-8')
        #正则匹配
        pattern = re.compile('<h2>(.*?)</h2>.*?<div class="articleGender (.*?)Icon">(.*?)</div>.*?<div class="content">.*?<span>(.*?)</span>'+
                             '.*?<span class="stats-vote"><i class="number">(.*?)</i>.*?<i class="number">(.*?)</i>',re.S)
        items = re.findall(pattern,content)
        for item in items:
            print u"第%s页\n作者:%s\t性别:%s\t年龄:%s\n段子内容:%s\n好笑数:%s\t评论数:%s" % (page,item[0],item[1],item[2],item[3],item[4],item[5])
    except urllib2.URLError, e:
        if hasattr(e,"code"):
            print e.code
        if hasattr(e,"reason"):
            print e.reason

输出

输出
正则还有个缺点是容易带入
，还需进行清洗。

网友评论

本文标题：爬取糗百|测试作业

本文链接：https://www.haomeiwen.com/subject/cbqnxxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

爬取糗百|测试作业

设置了请求头

翻页

正则表达式

完整代码

相关文章

爬取糗百|测试作业

爬取糗百

爬糗事百科段子

Python 爬虫入门(一)——爬取糗百

【Python爬虫】糗百-文字版块

爬取糗百12-02

Scrapy爬取糗百并存入MySQL

Scrapy框架之CrawlSpider操作 2018-11-0

1.爬取糗百段子

使用python爬取糗百段子

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

爬虫日记

Python爬虫作业