Scrapy框架——CrawlSpider类爬虫案例

作者: carpe_diem_c | 来源:发表于2017-02-09 22:23 被阅读3598次

Scrapy框架中分两类爬虫，Spider类和CrawlSpider类。
此案例采用的是CrawlSpider类实现爬虫。

它是Spider的派生类，Spider类的设计原则是只爬取start_url列表中的网页，而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制，从爬取的网页中获取link并继续爬取的工作更适合。

创建项目指令：

    scrapy startproject tencent

模版创建：

scrapy genspider crawl -t tencent 'hr.tencent.com'

CrawlSpider继承于Spider类，除了继承过来的属性外（name、allow_domains），还提供了新的属性和方法:

LinkExtractors

class scrapy.linkextractors.LinkExtractor

Link Extractors 的目的很简单: 提取链接｡
每个LinkExtractor有唯一的公共方法是 extract_links()，它接收一个 Response 对象，并返回一个 scrapy.link.Link 对象。
Link Extractors要实例化一次，并且 extract_links 方法会根据不同的 response 调用多次提取链接｡

        主要参数：
        
            allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。
            
            deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
            
            allow_domains：会被提取的链接的domains。
            
            deny_domains：一定不会被提取链接的domains。
            
            restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接。

rules

在rules中包含一个或多个Rule对象，每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了相同的链接，则根据规则在本集合中被定义的顺序，第一个会被使用。

参数介绍：
link_extractor：是一个Link Extractor对象，用于定义需要提取的链接。

    callback： 从link_extractor中每获取到链接时，参数所指定的值作为回调函数，该回调函数接受一个response作为其第一个参数。
    
    注意：当编写爬虫规则时，避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败。
    
    follow：是一个布尔(boolean)值，指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None，follow 默认设置为True ，否则默认为False。
    
    process_links：指定该spider中哪个的函数将会被调用，从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。
    
    process_request：指定该spider中哪个的函数将会被调用， 该规则提取到每个request时都会调用该函数。 (用来过滤request)

以下是案例代码：

item文件

    import scrapy
    
    class TencentItem(scrapy.Item):
        # 职位
        name = scrapy.Field()
        # 详情链接
        positionlink = scrapy.Field()
        #职位类别
        positiontype = scrapy.Field()
        # 人数
        peoplenum = scrapy.Field()
        # 工作地点
        worklocation = scrapy.Field()
        # 发布时间
        publish = scrapy.Field()

pipeline文件

    import json
    class TencentPipeline(object):
    
        def __init__(self):
            self.filename = open("tencent.json", "w")
        def process_item(self, item, spider):
            text = json.dumps(dict(item), ensure_ascii = False)  + ",\n"
            self.filename.write(text.encode("utf-8"))
            return item
        def close_spider(self, spider):
            self.filename.close()

setting文件

    BOT_NAME = 'tencent'
    
    SPIDER_MODULES = ['tencent.spiders']
    NEWSPIDER_MODULE = 'tencent.spiders'
    LOG_FILE = 'tenlog.log'
    LOG_LEVEL = 'DEBUG'
    LOG_ENCODING = 'utf-8'
    
    ROBOTSTXT_OBEY = True
    
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    }
    
    
    ITEM_PIPELINES = {
       'tencent.pipelines.TencentPipeline': 300,
    }

spider文件

    # -*- coding: utf-8 -*-
    import scrapy
    # 导入链接匹配规则类，用来提取符合规则的链接
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from tencent.items import TencentItem
    
    class TenecntSpider(CrawlSpider):
        name = 'tencent1'
        # 可选，加上会有一个爬去的范围
        allowed_domains = ['hr.tencent.com']
        start_urls = ['http://hr.tencent.com/position.php?&start=0#a']
        # response中提取 链接的匹配规则，得出是符合的链接
        pagelink = LinkExtractor(allow=('start=\d+'))
    
        print (pagelink)
        # 可以写多个rule规则
        rules = [
            # follow = True需要跟进的时候加上这句。
            # 有callback的时候就有follow
            # 只要符合匹配规则，在rule中都会发送请求，同是调用回调函数处理响应
            # rule就是批量处理请求
            Rule(pagelink, callback='parse_item', follow=True),
        ]
    
        # 不能写parse方法，因为源码中已经有了，回覆盖导致程序不能跑
        def parse_item(self, response):
            for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
                # 把数据保存在创建的对象中，用字典的形式
    
                item = TencentItem()
                # 职位
                # each.xpath('./td[1]/a/text()')返回的是列表，extract转为unicode字符串，[0]取第一个
                item['name'] = each.xpath('./td[1]/a/text()').extract()[0]
                # 详情链接
                item['positionlink'] = each.xpath('./td[1]/a/@href').extract()[0]
                # 职位类别
                item['positiontype'] = each.xpath("./td[2]/text()").extract()[0]
                # 人数
                item['peoplenum'] = each.xpath('./td[3]/text()').extract()[0]
                # 工作地点
                item['worklocation'] = each.xpath('./td[4]/text()').extract()[0]
                # 发布时间
                item['publish'] = each.xpath('./td[5]/text()').extract()[0]
    
                # 把数据交给管道文件
                yield item

结果展示：
http://p1.bpimg.com/4851/0bc14ca5a6c502be.png

网友评论

o0__0o:CrawlSpider 如何添加请求头呢
菜先生:xpath 填的不是从chrom copy下来的?
microtex:请问， scrapy爬虫是否有基于 Python2 或者 python 3 之分？
carpe_diem_c:不好意思，这么久才看到。现在自己弄了个人站点，简书的都没管了
carpe_diem_c:我认为是没有的，scrapy只是框架而已。

本文标题：Scrapy框架——CrawlSpider类爬虫案例

本文链接：https://www.haomeiwen.com/subject/xmiwittx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Scrapy框架——CrawlSpider类爬虫案例

LinkExtractors

rules

item文件

pipeline文件

setting文件

spider文件

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

网络爬虫scrapy

Scrapy Python 爬虫框架

Scrapy框架——CrawlSpider类爬虫案例

LinkExtractors

rules

item文件

pipeline文件

setting文件

spider文件

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

网络爬虫scrapy

Scrapy Python 爬虫 框架

Scrapy Python 爬虫框架