美文网首页随笔-生活工作点滴
2019-07-05scrapy 爬虫框架搭建

2019-07-05scrapy 爬虫框架搭建

作者: hcc_9bf4 | 来源:发表于2019-07-05 14:18 被阅读31次

    安装scrapy包:
    pip install scrapy
    安装时会报错...如果是py3需要手动下载依赖包Twisted

    image.png
    下载地址:https://pypi.org/simple/twisted/
    或者:https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

    下载后放在桌面Twisted-19.2.1-cp37-cp37m-win_amd64.whl

    pip install C:\Users\Administrator\Desktop\Twisted-19.2.1-cp37-cp37m-win_amd64.whl

    再次pip isntall scrapy,显示下面的表示依赖包都安装完成

    D:\>pip install scrapy
    Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
    Requirement already satisfied: scrapy in d:\python\lib\site-packages (1.6.0)
    Requirement already satisfied: Twisted>=13.1.0 in d:\python\lib\site-packages (from scrapy) (19.2.1)
    Requirement already satisfied: parsel>=1.5 in d:\python\lib\site-packages (from scrapy) (1.5.1)
    Requirement already satisfied: PyDispatcher>=2.0.5 in d:\python\lib\site-packages (from scrapy) (2.0.5)
    Requirement already satisfied: w3lib>=1.17.0 in d:\python\lib\site-packages (from scrapy) (1.20.0)
    Requirement already satisfied: queuelib in d:\python\lib\site-packages (from scrapy) (1.5.0)
    Requirement already satisfied: cssselect>=0.9 in d:\python\lib\site-packages (from scrapy) (1.0.3)
    Requirement already satisfied: pyOpenSSL in d:\python\lib\site-packages (from scrapy) (19.0.0)
    Requirement already satisfied: lxml in d:\python\lib\site-packages (from scrapy) (4.3.4)
    Requirement already satisfied: service-identity in d:\python\lib\site-packages (from scrapy) (18.1.0)
    Requirement already satisfied: six>=1.5.2 in d:\python\lib\site-packages (from scrapy) (1.12.0)
    Requirement already satisfied: hyperlink>=17.1.1 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (19.0.0)
    Requirement already satisfied: zope.interface>=4.4.2 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (4.6.0)
    Requirement already satisfied: attrs>=17.4.0 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (19.1.0)
    Requirement already satisfied: PyHamcrest>=1.9.0 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (1.9.0)
    Requirement already satisfied: constantly>=15.1 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (15.1.0)
    Requirement already satisfied: incremental>=16.10.1 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (17.5.0)
    Requirement already satisfied: Automat>=0.3.0 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (0.7.0)
    Requirement already satisfied: cryptography>=2.3 in d:\python\lib\site-packages (from pyOpenSSL->scrapy) (2.7)
    Requirement already satisfied: pyasn1-modules in d:\python\lib\site-packages (from service-identity->scrapy) (0.2.5)
    Requirement already satisfied: pyasn1 in d:\python\lib\site-packages (from service-identity->scrapy) (0.4.5)
    Requirement already satisfied: idna>=2.5 in d:\python\lib\site-packages (from hyperlink>=17.1.1->Twisted>=13.1.0->scrapy) (2.8)
    Requirement already satisfied: setuptools in d:\python\lib\site-packages (from zope.interface>=4.4.2->Twisted>=13.1.0->scrapy) (40.8.0)
    Requirement already satisfied: asn1crypto>=0.21.0 in d:\python\lib\site-packages (from cryptography>=2.3->pyOpenSSL->scrapy) (0.24.0)
    Requirement already satisfied: cffi!=1.11.3,>=1.8 in d:\python\lib\site-packages (from cryptography>=2.3->pyOpenSSL->scrapy) (1.12.3)
    Requirement already satisfied: pycparser in d:\python\lib\site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.3->pyOpenSSL->scrapy) (2.19)
    
    如果显示下面图片,表示scrapy安装完成: image.png

    安装scrapy工程可以放在任意磁盘目录下
    先切换到如D盘下,运行scrapy startproject Tencent

    它的含义是用scrapy安装tencent工程,此时在D盘下生成Tencent文件夹 image.png
    工程配置文件在d盘下的Tencent下的Tencent中 image.png
    这个下面包含了scrapy框架的主要文件
    1. item.py:定义需要爬取的item,明确目标,如职位名称,工作地点等(需要自己设置)


      image.png
    2. middlewares.py,爬虫中间件 ,很少用到,创建后都已经自定义好,不需要更改(不需要自己设置)
    # -*- coding: utf-8 -*-
    
    # Define here the models for your spider middleware
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    from scrapy import signals
    
    
    class TencentSpiderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the spider middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_spider_input(self, response, spider):
            # Called for each response that goes through the spider
            # middleware and into the spider.
    
            # Should return None or raise an exception.
            return None
    
        def process_spider_output(self, response, result, spider):
            # Called with the results returned from the Spider, after
            # it has processed the response.
    
            # Must return an iterable of Request, dict or Item objects.
            for i in result:
                yield i
    
        def process_spider_exception(self, response, exception, spider):
            # Called when a spider or process_spider_input() method
            # (from other spider middleware) raises an exception.
    
            # Should return either None or an iterable of Response, dict
            # or Item objects.
            pass
    
        def process_start_requests(self, start_requests, spider):
            # Called with the start requests of the spider, and works
            # similarly to the process_spider_output() method, except
            # that it doesn’t have a response associated.
    
            # Must return only requests (not items).
            for r in start_requests:
                yield r
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)
    
    
    class TencentDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            return None
    
        def process_response(self, request, response, spider):
            # Called with the response returned from the downloader.
    
            # Must either;
            # - return a Response object
            # - return a Request object
            # - or raise IgnoreRequest
            return response
    
        def process_exception(self, request, exception, spider):
            # Called when a download handler or a process_request()
            # (from other downloader middleware) raises an exception.
    
            # Must either:
            # - return None: continue processing this exception
            # - return a Response object: stops process_exception() chain
            # - return a Request object: stops process_exception() chain
            pass
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)
    
    1. pipelines.py:管道文件,需要对文件格式存储方式做修改(需要自己配置)
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import json
    
    class TencentPipeline(object):
        def __init__(self):
            self.f = open("tencent.csv","w",encoding='utf8')
        def process_item(self,item,spider):
            content=json.dumps(dict(item),ensure_ascii=False) + ",\n"
            self.f.write(content)
            return item
        
        def close_spider(self,spider):
            self.f.close()
    
    1. settings.py:对需要的设置做打开或者关闭处理默认大部分关闭,如需要打开管道设置:


      image.png

    默认关闭状态

    创建爬虫文件:scrapy genspider tencent "tencent.com"

    我们需要编写的爬虫文件在spiders里面的tencent.py


    image.png

    tencent.py:

    # -*- coding: utf-8 -*-
    import scrapy
    from Tencent.items import TencentItem
    import json
    class TencentSpider(scrapy.Spider):
        name = 'tencent'
        baseurl="https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
        # allowed_domains = ['tencent.com']
        offset  = 1
      #   url='https://careers.tencent.com/tencentcareer/api/post/Query?'   
        start_urls=[baseurl.format(offset)]
        
        def parse(self, response):
            
            
            job_items=json.loads(response.body.decode())['Data']['Posts']
            
            for job_item in job_items:
                
                item = TencentItem()
    
                item['positionName'] = job_item["RecruitPostName"]
                
                item['positionLink'] = job_item["PostURL"] + job_item["PostId"]
    
                item['positionType'] = job_item["Responsibility"]
    
                item['worklocation'] = job_item["LocationName"]
    
                item['publishTime'] = job_item["LastUpdateTime"]
    
                yield item
    
            if self.offset < 430:
    
                self.offset += 1
    
                url = self.baseurl.format(self.offset)
    
                yield scrapy.Request(
                
                url = url,
       
                callback = self.parse
            )
    

    运行爬虫:scrapy crawl tencent

    11
    D:\Tencent\Tencent\spiders>scrapy crawl tencent
    2019-07-05 13:56:28 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Tencent)
    2019-07-05 13:56:28 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
    2019-07-05 13:56:28 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Tencent', 'NEWSPIDER_MODULE': 'Tencent.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Tencent.spiders']}
    2019-07-05 13:56:28 [scrapy.extensions.telnet] INFO: Telnet Password: 1f4cb6e4d1fc4caa
    2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.logstats.LogStats']
    2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled item pipelines:
    ['Tencent.pipelines.TencentPipeline']
    2019-07-05 13:56:28 [scrapy.core.engine] INFO: Spider opened
    2019-07-05 13:56:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2019-07-05 13:56:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2019-07-05 13:56:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://careers.tencent.com/404.html> from <GET https://careers.tencent.com/robots.txt>
    2019-07-05 13:56:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://careers.tencent.com/404.html> (referer: None)
    2019-07-05 13:56:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn> (referer: None)
    2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
    {'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=01147013579229106176',
     'positionName': '22989-Serverless前端架构师',
     'positionType': '负责腾讯 Serverless 平台战略目标规划、整体平台产品能力设计;\n'
                     '负责探索前端技术与 Serverless 的结合落地,包括不限于腾讯大前端架构建设,公共组件的设计, '
                     'Serverless 的前端应用场景落地;\n'
                     '负责分析 Serverless 客户复杂应用场景的具体实现(小程序,Node.js);\n'
                     '负责 Serverless 场景中 Node.js 以及微信小程序相关生态建设。',
     'publishTime': '2019年07月05日',
     'worklocation': '深圳'}
    2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
    {'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=01147013576054018048',
     'positionName': '22989-语音通信研发工程师(深圳)',
     'positionType': '负责腾讯云通信号码保护、企业总机、呼叫中心、融合通信产品开发;\n'
                     '负责融合通信PaaS平台的构建和优化;\n'
                     '负责通话质量分析和调优;',
     'publishTime': '2019年07月05日',
     'worklocation': '深圳'}
    2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
    {'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=11231766955960606721123176695596060672',
     'positionName': '18435-合规反洗钱岗',
     'positionType': '1、根据反洗钱法律法规及监管规定的要求,完善落实反洗钱工作,指导各业务部门、分支机构开展反洗钱工作,支 持反洗钱监管沟通及监管报告反馈工作;\n'
                     '2、制定与完善内部反洗钱配套制度与流程,推动公司反洗钱标准化及流程化建设;\n'
                     '3、熟悉监管部门各项反洗钱政策制度要求,能就日常产品业务及合同及时进行反洗钱合规评审;\n'
                     '4、开展对各业务部门、分支机构的反洗钱合规自查工作,跟进缺陷问题;\n'
                     '5、根据反洗钱法律法规及监管规定的更新情况,及时对各业务部门进行法规解读,并追踪落实;\n'
                     '6、重点项目的跟进及推动工作;\n'
                     '7、领导交办的其他工作。',
     'publishTime': '2019年07月05日',
     'worklocation': '深圳总部'}
    2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
    {'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=11231779032200683521123177903220068352',
     'positionName': '25927-游戏测试项目经理',
     'positionType': '负责项目计划和迭代计划的制定、跟进和总结回顾,推动产品需求、运营需求和技术需求的落地执行,排除障碍,确保交付时间和质量;\n'
                     '负责跟合作有关部门和团队对接,确保内部外部团队高效协同工作;\n'
                     '不断优化项目流程规范;,及时发现并跟踪解决项目问题,有效管理项目风险。',
     'publishTime': '2019年07月05日',
     'worklocation': '深圳总部'}
    
    image.png

    篇幅有限,只能图片展示


    image.png

    总结:
    编写scrapy步骤:
    scrapy project XXXX
    scrapy genspider xxxx "xxx.com"
    编写item.py 明确需要提取的数据
    编写spider文件下面的xxxx.py编写爬虫文件处理,处理请求和响应,以及提取数据(yield item)
    编写pipelines.py编写管道文件处理spider返回的item数据,比如本地持久化存储等...
    编写settings.py启动管道组件 ITEM_PIPLELINES = {.....},以及其他相关设置
    执行爬虫

    相关文章

      网友评论

        本文标题:2019-07-05scrapy 爬虫框架搭建

        本文链接:https://www.haomeiwen.com/subject/mxnshctx.html