美文网首页
scrapy及scrapy-redis简介

scrapy及scrapy-redis简介

作者: 没心没肺最开心 | 来源:发表于2021-12-07 10:17 被阅读0次

    scrapy 及 scrapy-redis 简介

    演讲目录

    一、简介

    1、scrapy简介

    Scrapy是一个快速的高级web爬行和web爬行框架,用于爬行网站并从其页面中提取结构化数据。它可以用于广泛的用途,从数据挖掘到监控和自动化测试。

    2、scrapy简介

    基于redis的分布式爬虫。
    官方介绍特点第一句话是:您可以启动共享单个redis队列的多个spider实例。最适合广泛的多域爬网。
    

    讲解信息

    Python 3.7.7
    Scrapy 2.5.1
    scrapy-redis 0.7.1

    爬虫html,指向localhost即可

    <h1>企业列表</h1>
    
    <div class="quotes">
        <div class="word">小米科技有限责任公司 @北京</div>
        <div class="author"> --<a href="./detail_1">详情</a></div>
    </div>
    
    <div class="quotes">
        <div class="word">小米有品科技有限公司@江苏</div>
        <div class="author"> --<a href="./detail_2">详情</a></div>
    </div>
    

    二、爬虫基本流程及目标

    准备待爬任务(输入) -> 爬虫消费 -> 清洗入库(输出)

    目标 :爬取首页数据,获取公司地区

    输入: url

    输出:json格式

    {
        "company_name":"xxxx",
        "province":"xxx"
    }
    

    三、scrapy 爬虫demo-输出到文件

    • 启动服务nodejs服务

    cd ./web_demo
    node index.js

    • 创建爬虫项目
      scrapy -h 查看全部命令

    scrapy startproject company

    cd company

    • 创建爬虫

    scrapy genspider companyindex localhost

    • 修改爬虫文件
    # -*- coding: utf-8 -*-
    import scrapy
    from loguru import logger as log
    
    
    class CompanyindexSpider(scrapy.Spider):
        name = 'companyindex'
        allowed_domains = ['localhost']
        start_urls = ['http://localhost:3000']
    
        def parse(self, response):
            log.info(response.body)
    
    
    • 运行爬虫

    scrapy crawl companyindex --nolog

    我们可以看到已经获取了目标网站的正文

    • 解析网站获取基础数据
      可以使用scrapy shell 工具

    scrapy shell 'http://localhost:3000'

    response.xpath('//div[@class="word"]/text()').getall()

    继续修改companyindex代码

    # -*- coding: utf-8 -*-
    import scrapy
    from loguru import logger as log
    
    class CompanyindexSpider(scrapy.Spider):
        name = 'companyindex'
        allowed_domains = ['localhost']
        start_urls = ['http://localhost:3000']
        def parse(self, response):
            context = response.xpath('//div[@class="word"]/text()').getall()
            for context_item in context:
                company_name,province = context_item.split('@')
                re_data = {
                    "company_name":company_name,
                    "province":province
                }
    
                log.info(re_data)
                yield re_data
    

    运行命令行

    scrapy crawl companyindex --nolog -o companys.jl # 将爬虫输出到文件

    四、scrapy 爬虫demo-输出定义,装饰

    修改items.py 文件:

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    from itemloaders.processors import Join, MapCompose, TakeFirst
    
    def  add_trim(str):
        return str.strip();
    
    class CompanyItem(scrapy.Item):
        # define the fields for your item here like:
        company_name = scrapy.Field(
           input_processor=MapCompose(add_trim),
            output_processor=TakeFirst()
    
        )
        tag = scrapy.Field(
            output_processor=Join(',')
        )
        province = scrapy.Field(
            output_processor=TakeFirst()
        )
    
    

    修改爬虫文件

    # -*- coding: utf-8 -*-
    import scrapy
    from loguru import logger as log
    from company.items import CompanyItem
    from scrapy.loader import ItemLoader
    
    class CompanyindexSpider(scrapy.Spider):
        name = 'companyindex'
        allowed_domains = ['localhost']
        start_urls = ['http://localhost:3000']
    
        def parse(self, response):
            
            context = response.xpath('//div[@class="word"]/text()').getall()
            for context_item in context:
                l = ItemLoader(item=CompanyItem(), response=response)
                company_name,province = context_item.split('@')
                l.add_value("company_name",company_name)
                l.add_value("tag",'test') #新增爬虫标签 环境
                l.add_value("tag",'20211125') #新增爬虫标签 年月日
                l.add_value("province",province)
    
                yield l.load_item()
    
    

    再次运行命令

    scrapy crawl companyindex --nolog -o companys.jl

    五、scrapy 爬虫demo-通过管道存储内容

    以上操作都是通过 -o 选项输出到指定文件,如何能输出到指定媒介呢?例如,mysql\等。scrapy 的Item Pipeline 就是实现这个的

    • 打开配置文件 settings.py 中 ITEM_PIPELINES 配置项
    ITEM_PIPELINES = {
       'company.pipelines.CompanyPipeline': 300,
    }
    

    后面的数字是优先级:越小越高

    假设将目标文件写入指定文件

    pipelines.py 修改如下:

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    from itemadapter import ItemAdapter
    import  json
    class CompanyPipeline(object):
    
        def open_spider(self, spider):
                self.file = open('company_items.jl', 'w')
    
        def close_spider(self, spider):
            self.file.close()
    
    
        def process_item(self, item, spider):
            line = json.dumps(ItemAdapter(item).asdict()) + "\n"
            self.file.write(line)
            return item
    
    

    运行命令

    scrapy crawl companyindex

    五、scrapy-redis 简介

    • 网上通用的是重写 make_request_from_data 方法

    • 文件:companyindex.py

    # -*- coding: utf-8 -*-
    from loguru import logger as log
    from company.items import CompanyItem
    from scrapy.loader import ItemLoader
    from scrapy_redis.spiders import RedisSpider
    from scrapy.http import Request
    import json
    
    class CompanyindexSpider(RedisSpider):
        name = 'companyindex'
        allowed_domains = ['localhost']
    
        def make_request_from_data(self, data):
            try:
                task_params = json.loads(data)
                log.info(task_params)
                return self.make_requests_from_url(task_params)
    
            except Exception:
                log.info('parse json error')
                return None
                
        def make_requests_from_url(self,task_params):
            """ This method is deprecated. """
            url = task_params.get("url")
            log.info(f"获取任务url:{url}")
            return Request(url, dont_filter=True,callback=self.company_parse)
    
    
        def company_parse(self, response):
            context = response.xpath('//div[@class="word"]/text()').getall()
            for context_item in context:
                l = ItemLoader(item=CompanyItem(), response=response)
                company_name,province = context_item.split('@')
                l.add_value("company_name",company_name)
                l.add_value("tag",'test')
                l.add_value("tag",'20211125')
                l.add_value("province",province)
    
                yield l.load_item()
    
    
    

    settings.py 文件

    
    BOT_NAME = 'company'
    
    SPIDER_MODULES = ['company.spiders']
    NEWSPIDER_MODULE = 'company.spiders'
    ROBOTSTXT_OBEY = True
    REDIS_URL = 'redis://127.0.0.1:2888/7'
    
    ITEM_PIPELINES = {
       'company.pipelines.CompanyPipeline': 300,
       'scrapy_redis.pipelines.RedisPipeline': 301
    }
    REDIS_START_URLS_KEY = "scrapy_companyindex_spider"
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    STATS_CLASS = "scrapy_redis.stats.RedisStatsCollector"
    
    

    pipelines.py 文件

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    from itemadapter import ItemAdapter
    import  json
    from loguru import logger as log
    
    class CompanyPipeline(object):
    
        def open_spider(self, spider):
            log.info(" CompanyPipeline open_spider--------")
    
            self.file = open('company_items.jl', 'w')
    
        # def close_spider(self, spider):
            # log.info(" CompanyPipeline close_spider--------")
    
        def process_item(self, item, spider):
            log.info(" CompanyPipeline process_item--------")
            line = json.dumps(ItemAdapter(item).asdict()) + "\n"
            self.file.write(line)
            self.file.close()
    
            return item
    

    其他

    • 中间件:爬虫中间件 下载中间件

    • 速度控制

    • 如何进行聚合页面爬取? 如详情页面包含多个资源链接

    使用用 meta

    相关文章

      网友评论

          本文标题:scrapy及scrapy-redis简介

          本文链接:https://www.haomeiwen.com/subject/msvoxrtx.html