美文网首页scrapy学习小组
Scrapy学习笔记(3)-循环爬取以及数据库操作

Scrapy学习笔记(3)-循环爬取以及数据库操作

作者: leeyis | 来源:发表于2018-03-07 09:31 被阅读30次

    前言

    系统环境:CentOS7

    本文假设你已经安装了virtualenv,并且已经激活虚拟环境ENV1,如果没有,请参考这里:使用virtualenv创建python沙盒(虚拟)环境,在上一篇文章(Scrapy学习笔记(2)-使用pycharm在虚拟环境中运行第一个spider)中我们已经能够使用scrapy的命令行工具创建项目以及spider、使用Pycharm编码并在虚拟环境中运行spider抓取http://quotes.toscrape.com/中的article和author信息, 最后将抓取的信息存入txt文件,上次的spider只能单页爬取,今天我们在上次的spider上再深入一下。

    目标

    跟踪next(下一页)链接循环爬取http://quotes.toscrape.com/中的article和author信息,将结果保存到mysql数据库中。

    正文

    1.因为要用Python操作MySQL数据库,所以先得安装相关的Python模块,本文使用MySQLdb

    #sudo yum install mysql-devel

    #pip install mysql-devel

    2.在数据库中创建目标表quotes,建表语句如下:

    CREATE TABLE `quotes` (

      `id` int(11) NOT NULL AUTO_INCREMENT,

      `article` varchar(500) DEFAULT NULL,

      `author` varchar(50) DEFAULT NULL,

      PRIMARY KEY (`id`)

    ) ENGINE=MyISAM DEFAULT CHARSET=utf8;

    3.items.py文件详细代码如下:

    # -*- coding: utf-8 -*-

    # Define here the models for your scraped items

    #

    # See documentation in:

    # http://doc.scrapy.org/en/latest/topics/items.html

    import scrapy

    class QuotesItem(scrapy.Item):

        # define the fields for your item here like:

        # name = scrapy.Field()

        article=scrapy.Field()

        author=scrapy.Field()

        pass

    4.修改quotes_spider.py如下:

    # -*- coding: utf-8 -*-

    import scrapy

    from ..items import QuotesItem

    from urlparse import urljoin

    from scrapy.http import Request

    class QuotesSpiderSpider(scrapy.Spider):

        name = "quotes_spider"

        allowed_domains = ["quotes.toscrape.com"]

        start_urls = ['http://quotes.toscrape.com']

        def parse(self, response):

            articles=response.xpath("//div[@class='quote']")

            next_page=response.xpath("//li[@class='next']/a/@href").extract_first()

            for article in articles:

                item=QuotesItem()

                content=article.xpath("span[@class='text']/text()").extract_first()

                author=article.xpath("span/small[@class='author']/text()").extract_first()

                item['article']=content.encode('utf-8')

                item['author'] = author.encode('utf-8')

                yield item#使用yield返回结果但不会中断程序执行

            if next_page:#判断是否存在next链接

                url=urljoin(self.start_urls[0],next_page)#拼接url

                yield Request(url,callback=self.parse)

    5.修改pipelines.py文件,将爬取到的数据保存到数据库

    # -*- coding: utf-8 -*-

    # Define your item pipelines here

    #

    # Don't forget to add your pipeline to the ITEM_PIPELINES setting

    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

    from twisted.enterprise import adbapi

    import MySQLdb

    import MySQLdb.cursors

    class QuotesPipeline(object):

        def __init__(self):

            db_args=dict(

                host="192.168.0.107",#数据库主机ip

                db="scrapy",#数据库名称

                user="root",#用户名

                passwd="123456",#密码

                charset='utf8',#数据库字符编码

                cursorclass = MySQLdb.cursors.DictCursor,#以字典的形式返回数据集

                use_unicode = True,

            )

            self.dbpool = adbapi.ConnectionPool('MySQLdb', **db_args)

        def process_item(self, item, spider):

            self.dbpool.runInteraction(self.insert_into_quotes, item)

            return item

        def insert_into_quotes(self,conn,item):

            conn.execute(

                '''

                INSERT INTO quotes(article,author)

                VALUES(%s,%s)

                '''

                ,(item['article'],item['author'])

            )

    6.pipeline.py文件代码不变:

    # -*- coding: utf-8 -*-

    # Scrapy settings for quotes project

    #

    # For simplicity, this file contains only settings considered important or

    # commonly used. You can find more settings consulting the documentation:

    #

    #    http://doc.scrapy.org/en/latest/topics/settings.html

    #    http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

    #    http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

    BOT_NAME = 'quotes'

    SPIDER_MODULES = ['quotes.spiders']

    NEWSPIDER_MODULE = 'quotes.spiders'

    # Obey robots.txt rules

    ROBOTSTXT_OBEY = True

    # Configure item pipelines

    # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

    ITEM_PIPELINES = {

      'quotes.pipelines.QuotesPipeline': 300,

    }

    7.开始运行spider

    (ENV1) [eason@localhost quotes]$ scrapy crawl quotes_spider

    8.检验结果,Done!

    更多原创文章,尽在金笔头博客

    相关文章

      网友评论

        本文标题:Scrapy学习笔记(3)-循环爬取以及数据库操作

        本文链接:https://www.haomeiwen.com/subject/cgpkextx.html