美文网首页
Scrapy爬取糗百并存入MySQL

Scrapy爬取糗百并存入MySQL

作者: 我叫GTD | 来源:发表于2018-02-14 14:38 被阅读80次

爬取糗百文字信息,页面比较简单,爬取难度不大,但要先确定其是否是动态加载。
在终端输入命令:

scrapy shell https://www.qiushibaike.com/text/

却得到这样的显示:Connection was closed cleanly.
我猜测是因为Headers的原因,于是写了个短小的代码来测试(其实是用Postman导出来的)


实用工具Postman

代码直接可用,测试OK,得到结果:糗料儿们都是静态加载的。
作者、糗料儿很容易就能找出来:


页面分析
#热度(图片内未显示出)
hot_num = int(each.css('div.stats>span.stats-vote>i.number::text').extract_first())
#作者
content['author'] = each.css('div.author.clearfix>a:nth-child(2)>h2::text').extract_first().strip()
#糗料儿
content['funny_text'] = each.css('div.content>span::text').extract_first().strip()

根据要爬取的三项构建Item:

import scrapy

class FunnyItem(scrapy.Item):
    author = scrapy.Field()
    funny_text = scrapy.Field()
    hot_num = scrapy.Field()

还有特殊情况,作爬虫不就是靠分析处理特殊情况提升的嘛!
特殊情况之匿名以及查看全文:


特殊情况
匿名这段明显和非匿名部分的编排是不一样的
查看全文

点击查看全文和点击文本一样的效果:进入全文网页,网页在截图第一行可以找到。
主要代码:

import scrapy
from scrapy import Request
from ..items import FunnyItem

class Funny66Spider(scrapy.Spider):
    name = 'funny66'
    allowed_domains = ['www.qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/']
#定制请求头
    headers = {
        'If-None-Match': "1ab399a69d7f90aff3a3ead3bafb0eba47dc4607",
        'DNT': "1",
        'Accept-Encoding': "gzip, deflate, br",
        'Accept-Language': "zh-CN,zh;q=0.9",
        'Upgrade-Insecure-Requests': "1",
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
        'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        'Referer': "https://www.qiushibaike.com/",
        'Cookie': "_ga=GA1.2.614688305.1518532238; _gid=GA1.2.1133015715.1518532238; __cur_art_index=1900; _xsrf=2|9fe9f0b4|e3aec0cb1f03a169d8ca96e869adb6ea|1518577666; Hm_lvt_2670efbdd59c7e3ed3749b458cafaa37=1518532238,1518573213,1518577671; Hm_lpvt_2670efbdd59c7e3ed3749b458cafaa37=1518577787",
        'Connection': "keep-alive",
        'Cache-Control': "no-cache",
    }

    def parse(self, response):
        for each in response.css('div.article'):
            content = FunnyItem()
            hot_num = int(each.css('div.stats>span.stats-vote>i.number::text').extract_first())
            if hot_num >= 666: #筛选:只爬取热度大于666的
                content['hot_num'] = hot_num
                #判断是否匿名用户并作相应的处理
                if each.css('div.author.clearfix>a'):
                    content['author'] = each.css('div.author.clearfix>a:nth-child(2)>h2::text').extract_first().strip()
                else:
                    content['author'] = '匿名用户'
                #判断是否已显示全文
                if each.css('div.content>span.contentForAll'):
                    url = each.css('a.contentHerf::attr(href)').extract_first()
                    url = response.urljoin(url)
                    meta = {
                        'author': content['author'],
                        'hot_num': content['hot_num'],
                    }
                    yield Request(url, callback=self.parse_all, meta=meta, headers=self.headers)
                else:
                    content['funny_text'] = each.css('div.content>span::text').extract_first().strip()
                yield content
            else:
                yield None
        #下一页
        if response.css('ul.pagination>li:last-child span.next'):
            next_url = response.css('ul.pagination>li:last-child>a::attr(href)').extract_first()
            next_url = response.urljoin(next_url)
            yield Request(next_url, callback=self.parse, headers=self.headers)
    #对于长文的获取
    def parse_all(self, response):
        haha = FunnyItem()
        haha['funny_text'] = 'Long Content: ' + str(response.css('div.content::text').extract_first().strip())
        haha['author'] = response.meta['author']
        haha['hot_num'] = response.meta['hot_num']
        yield haha

整体部分已经结束,下面考虑如何把数据输入到数据库:
先构建出数据库:


构建数据库

个人觉得Workbench比命令终端好用多了,之前出现了一个问题:MySQL服务无法启动,我用MySQL Installer把服务卸载-重启-安装,就解决问题了。
下面编写pipelines.py:

import MySQLdb

class MySQLPipeline(object):
    def open_spider(self, spider):
        db = spider.settings.get('MYSQL_DB_NAME', 'funny')
        host = spider.settings.get('MYSQL_HOST', 'localhost')
        port = spider.settings.get('MYSQL_PORT', 3306)
        user = spider.settings.get('MYSQL_USER', 'root')
        passwd = spider.settings.get('MYSQL_PASSWORD', 'xxxxxxxxxxxxx') #密码自己设定
        self.db_conn = MySQLdb.connect(host=host, port=port, db=db, user=user, passwd=passwd, charset='utf8')
        self.db_cur = self.db_conn.cursor()

    def close_spider(self, spider):
        self.db_conn.commit()
        self.db_conn.close()
    def process_item(self, item, spider):
        self.insert_db(item)
        return item

    def insert_db(self, item):
        values = (
            item['author'],
            item['funny_text'],
            item['hot_num'],
        )
        sql = 'INSERT INTO content VALUES(%s,%s,%s)'
        self.db_cur.execute(sql, values)

最后进行配置settings.py

import random

BOT_NAME = 'funny'

SPIDER_MODULES = ['funny.spiders']
NEWSPIDER_MODULE = 'funny.spiders'
ROBOTSTXT_OBEY = False

MYSQL_DB_NAME = 'funny'
MYSQL_HOST = 'localhost'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'lovewangqian'

ITEM_PIPELINES = {
    'funny.pipelines.MySQLPipeline': 401,
}

FEED_EXPORT_FIELDS = ['author', 'funny_text', 'hot_num']
#延迟得摸着良心设置
DOWNLOAD_DELAY = random.random() + random.random()
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'

好了,结束。

给自己的:

前些天在看Linux基础,今天随手找了个网站练练手,防止生疏,这次又加入了数据库的使用,等看完Linux之后要好好看看数据库方面的知识了。

给大家的:
if not '单身狗':
  print('祝大家情人节快乐!')
else:
  print('好好吃狗粮,来年继续扛')

大吉大利,马上发财!

相关文章

网友评论

      本文标题:Scrapy爬取糗百并存入MySQL

      本文链接:https://www.haomeiwen.com/subject/wcjetftx.html