美文网首页大数据 爬虫Python AI SqlPython小哥哥
Python爬虫:Scrapy从脚本运行爬虫的5种方式

Python爬虫:Scrapy从脚本运行爬虫的5种方式

作者: 14e61d025165 | 来源:发表于2019-04-27 15:33 被阅读0次

    测试环境

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># 环境一
    Python 3.6.5
    Scrapy==1.5.0

    环境二

    Python 2.7.5
    Scrapy==1.1.2
    </pre>

    一、命令行运行爬虫

    1、编写爬虫文件 baidu.py

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
    from scrapy import Spider
    class BaiduSpider(Spider):
    name = 'baidu'
    start_urls = ['http://baidu.com/']
    def parse(self, response):
    self.log("run baidu")
    </pre>

    2、运行爬虫(2种方式)

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># 运行爬虫
    $ scrapy crawl baidu

    在没有创建项目的情况下运行爬虫

    $ scrapy runspider baidu.py
    </pre>

    二、文件中运行爬虫

    1、cmdline方式运行爬虫

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
    from scrapy import cmdline, Spider
    class BaiduSpider(Spider):
    name = 'baidu'
    start_urls = ['http://baidu.com/']
    def parse(self, response):
    self.log("run baidu")
    if name == 'main':
    cmdline.execute("scrapy crawl baidu".split())
    </pre>

    2、CrawlerProcess方式运行爬虫

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
    from scrapy import Spider
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    class BaiduSpider(Spider):
    name = 'baidu'
    start_urls = ['http://baidu.com/']
    def parse(self, response):
    self.log("run baidu")
    if name == 'main':
    # 通过方法 get_project_settings() 获取配置信息
    process = CrawlerProcess(get_project_settings())
    process.crawl(BaiduSpider)
    process.start()
    </pre>

    3、通过CrawlerRunner 运行爬虫

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
    from scrapy import Spider
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from twisted.internet import reactor
    class BaiduSpider(Spider):
    name = 'baidu'
    start_urls = ['http://baidu.com/']
    def parse(self, response):
    self.log("run baidu")
    if name == 'main':
    # 直接运行控制台没有日志
    configure_logging(
    {
    'LOG_FORMAT': '%(message)s'
    }
    )
    runner = CrawlerRunner()
    d = runner.crawl(BaiduSpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run()
    </pre>

    三、文件中运行多个爬虫

    项目中新建一个爬虫 SinaSpider

    python学习扣qun【 1004391443】,内有安装包和学习视频资料免费分享,好友都会在里面交流,分享一些学习的方法和需要注意的小细节,每天也会准时的讲一些项目实战案例,欢迎加入

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
    from scrapy import Spider
    class SinaSpider(Spider):
    name = 'sina'
    start_urls = ['https://www.sina.com.cn/']
    def parse(self, response):
    self.log("run sina")
    </pre>

    1、cmdline方式不可以运行多个爬虫

    如果将两个语句放在一起,第一个语句执行完后程序就退出了,执行到不到第二句

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
    from scrapy import cmdline
    cmdline.execute("scrapy crawl baidu".split())
    cmdline.execute("scrapy crawl sina".split())
    </pre>

    记得之前我还写过一个使用 cmdline运行多个爬虫的脚本

    <bi style="box-sizing: border-box; display: block;">文章:Python爬虫:scrapy定时运行的脚本</bi>

    不过有了以下两个方法来替代,就更优雅了

    2、CrawlerProcess方式运行多个爬虫

    备注:爬虫项目文件为:

    scrapy_demo/spiders/baidu.py

    scrapy_demo/spiders/sina.py

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
    from scrapy.crawler import CrawlerProcess
    from scrapy_demo.spiders.baidu import BaiduSpider
    from scrapy_demo.spiders.sina import SinaSpider
    process = CrawlerProcess()
    process.crawl(BaiduSpider)
    process.crawl(SinaSpider)
    process.start()
    </pre>

    此方式运行,发现日志中中间件只启动了一次,而且发送请求基本是同时的,说明这两个爬虫运行不是独立的,可能会相互干扰

    3、通过CrawlerRunner 运行多个爬虫

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># -- coding: utf-8 --
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from twisted.internet import reactor
    from scrapy_demo.spiders.baidu import BaiduSpider
    from scrapy_demo.spiders.sina import SinaSpider
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(BaiduSpider)
    runner.crawl(SinaSpider)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    reactor.run()
    </pre>

    此方式也只加载一次中间件,不过是逐个运行的,会减少干扰,官方文档也推荐使用此方法来运行多个爬虫

    总结

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556350265126 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    cmdline.execute 运行单个爬虫文件的配置最简单,一次配置,多次运行

    相关文章

      网友评论

        本文标题:Python爬虫:Scrapy从脚本运行爬虫的5种方式

        本文链接:https://www.haomeiwen.com/subject/qhsknqtx.html