scrapy命令行

作者: 小董不太懂 | 来源:发表于2019-07-17 19:41 被阅读2次

3、scrapy基本概念介绍
Scrapy项目文件介绍
Scrapy1.6 爬虫框架2 提取数据
在Pycharm中调试scrapy爬虫的两种方法
在Pycharm中调试scrapy爬虫的两种方法
2-2 Scrapy安装及基本使用
Scrapy命令行动态传参给spider
2020-07-19--scrapy框架2
scrapy输出到文件字符编码设置
day7、scrapy通过scrapy genspider -t

创建scrapy项目

(主要在命令行操作)
scrapy startproject 项目名
例：

C:\Users\董贺贺>scrapy startproject hongyanhuoshui
New Scrapy project 'hongyanhuoshui', using template directory 'D:\anaconda\lib\site-packages\scrapy\templates\project', created in:
    C:\Users\董贺贺\hongyanhuoshui

You can start your first spider with:
    cd hongyanhuoshui
    scrapy genspider example example.com

C:\Users\董贺贺>

这时候crawl的结构目录已经基本生成：

C:\Users\董贺贺>cd hongyanhuoshui

C:\Users\董贺贺\hongyanhuoshui>tree /f
卷 Windows-SSD 的文件夹 PATH 列表
卷序列号为 0026-9AA2
C:.
│  scrapy.cfg
│
└─hongyanhuoshui
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─spiders
    │  │  __init__.py
    │  │
    │  └─__pycache__
    └─__pycache__

C:\Users\董贺贺\hongyanhuoshui>

根据第一段代码的提示，我们还可以创建spider.py文件，故我们创建了spider文件，并命名为hongyan.py，然后我们查看创建spider后的结构：

C:\Users\董贺贺\hongyanhuoshui> scrapy genspider hongyan https://www.baidu.com
Created spider 'hongyan' using template 'basic' in module:
  hongyanhuoshui.spiders.hongyan

C:\Users\董贺贺\hongyanhuoshui>tree /f
卷 Windows-SSD 的文件夹 PATH 列表
卷序列号为 0026-9AA2
C:.
│  scrapy.cfg
│
└─hongyanhuoshui
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─spiders
    │  │  hongyan.py
    │  │  __init__.py
    │  │
    │  └─__pycache__
    │          __init__.cpython-37.pyc
    │
    └─__pycache__
            settings.cpython-37.pyc
            __init__.cpython-37.pyc


C:\Users\董贺贺\hongyanhuoshui>

对比上一段代码，显然spiders文件中多了hongyan.py，此文件对于我们执行爬虫文件必不可少。

关于命令行的详细使用

命令的使用范围

这里的命令分为全局的命令和项目的命令，全局的命令表示可以在任何地方使用，而项目的命令只能在项目目录下使用

可用的工具命令

C:\Users\董贺贺\hongyanhuoshui>scrapy -h
Scrapy 1.6.0 - project: hongyanhuoshui

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

还可以通过crapy <command> -h，查看命令详情：

全局命令有：

startproject
genspider
settings
runspider
shell
fetch
view
version

项目命令有：

crawl
check
list
edit
parse
bench

startproject

这个东西没啥好说的，就是创建项目。scrapy startproject 项目名

genspider

用于生成爬虫，这里scrapy提供给我们不同的几种模板生成spider,默认用的是basic,我们可以通过命令查看所有的模板，默认情况下是先进入项目，即：

cd 项目名
scrapy genspider 文件名 目标网站

如果我们想更改默认怎么办，我么先看看主要有哪几种形式：

C:\Users\董贺贺\hongyanhuoshui>scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

主要有四种形式，比如我们想使用crawl模式怎么更改呢(我重新完整的创建了新项目)：

C:\Users\董贺贺>scrapy startproject myproject
New Scrapy project 'myproject', using template directory 'D:\anaconda\lib\site-packages\scrapy\templates\project', created in:
    C:\Users\董贺贺\myproject

You can start your first spider with:
    cd myproject
    scrapy genspider example example.com

C:\Users\董贺贺>cd myproject

C:\Users\董贺贺\myproject>scrapy genspider -t crawl baidu https://www.baidu.com
Created spider 'baidu' using template 'crawl' in module:
  myproject.spiders.baidu

scrapy genspider -t 模板名名称网址

crawl

这个是用去启动spider爬虫格式为：
scrapy crawl 爬虫名字

C:\Users\董贺贺\myproject>scrapy crawl baidu
2019-07-17 15:02:17 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: myproject)
2019-07-17 15:02:17 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Windows-10-10.0.17763-SP0
2019-07-17 15:02:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'myproject', 'NEWSPIDER_MODULE': 'myproject.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['myproject.spiders']}
2019-07-17 15:02:17 [scrapy.extensions.telnet] INFO: Telnet Password: b57cecfc4210fb6c
2019-07-17 15:02:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-07-17 15:02:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-07-17 15:02:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-07-17 15:02:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-07-17 15:02:18 [scrapy.core.engine] INFO: Spider opened
2019-07-17 15:02:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-17 15:02:18 [py.warnings] WARNING: D:\anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.baidu.com in allowed_domains.
  warnings.warn(message, URLWarning)

2019-07-17 15:02:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-17 15:02:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https/robots.txt> (failed 1 times): DNS lookup failed: no results for hostname lookup: https.
2019-07-17 15:02:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https/robots.txt> (failed 2 times): DNS lookup failed: no results for hostname lookup: https.
2019-07-17 15:02:25 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://https/robots.txt> (failed 3 times): DNS lookup failed: no results for hostname lookup: https.
2019-07-17 15:02:39 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://https/robots.txt>: DNS lookup failed: no results for hostname lookup: https.
Traceback (most recent call last):
  File "D:\anaconda\lib\site-packages\twisted\internet\defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "D:\anaconda\lib\site-packages\twisted\python\failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "D:\anaconda\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "D:\anaconda\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\anaconda\lib\site-packages\twisted\internet\endpoints.py", line 975, in startConnectionAttempts
    "no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: https.
2019-07-17 15:02:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https//www.baidu.com/> (failed 1 times): DNS lookup failed: no results for hostname lookup: https.
2019-07-17 15:02:43 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https//www.baidu.com/> (failed 2 times): DNS lookup failed: no results for hostname lookup: https.
2019-07-17 15:02:45 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://https//www.baidu.com/> (failed 3 times): DNS lookup failed: no results for hostname lookup: https.
2019-07-17 15:02:45 [scrapy.core.scraper] ERROR: Error downloading <GET http://https//www.baidu.com/>
Traceback (most recent call last):
  File "D:\anaconda\lib\site-packages\twisted\internet\defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "D:\anaconda\lib\site-packages\twisted\python\failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "D:\anaconda\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "D:\anaconda\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\anaconda\lib\site-packages\twisted\internet\endpoints.py", line 975, in startConnectionAttempts
    "no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: https.
2019-07-17 15:02:46 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-17 15:02:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 6,
 'downloader/request_bytes': 1299,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 6,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 7, 17, 7, 2, 46, 72694),
 'log_count/DEBUG': 6,
 'log_count/ERROR': 2,
 'log_count/INFO': 9,
 'log_count/WARNING': 1,
 'retry/count': 4,
 'retry/max_reached': 2,
 'retry/reason_count/twisted.internet.error.DNSLookupError': 4,
 "robotstxt/exception_count/<class 'twisted.internet.error.DNSLookupError'>": 1,
 'robotstxt/request_count': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2019, 7, 17, 7, 2, 18, 354765)}
2019-07-17 15:02:46 [scrapy.core.engine] INFO: Spider closed (finished)

这里需要注意这里的爬虫名字和通过scrapy genspider 生成爬虫的名字是一致的

check

用于检查代码是否有错误，scrapy check

C:\Users\董贺贺\myproject>scrapy check

----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK

显然我们这段代码是没问题的

list

输出所有可用的爬虫文件

C:\Users\董贺贺\myproject>scrapy list
baidu

edit

edit 在命令行下编辑spider 不建议运行
scrapy edit myspider

fetch

scrapy fetch url地址
该命令会通过scrapy downloader 讲网页的源代码下载下来并显示出来
这里有一些参数：
--nolog 不打印日志
--headers 打印响应头信息
--no-redirect 不做跳转

# fetch 输出日志及网页源代码
scrapy fetch http://www.baidu.com

# fetch --nolog 只输出源代码
scrapy fetch --nolog http://www.baidu.com

# fetch --nolog --headers 输出响应头
scrapy fetch --nolog --headers http://www.baidu.com

# --nolog --no-redirect 禁止重定向
scrapy fetch --nolog --no-redirect http://www.baidu.com

scrapy

scrapy view url地址
该命令会讲网页document内容下载下来，并且在浏览器显示出来
例：scrapy view http://news.163.com

shell

这是一个命令行交互模式
通过scrapy shell url地址进入交互模式
这里我么可以通过css选择器以及xpath选择器获取我们想要的内容

C:\Users\董贺贺\myproject>scrapy shell www.baidu.com// --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000013F362D2CC0>
[s]   item       {}
[s]   request    <GET http://www.baidu.com//>
[s]   settings   <scrapy.settings.Settings object at 0x0000013F362D2BE0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:

exit退出交互模式

settings

获取当前的配置信息
通过scrapy settings -h可以获取这个命令的所有帮助信息

C:\Users\董贺贺\myproject>scrapy settings -h
Usage
=====
  scrapy settings [options]

Get settings values

Options
=======
--help, -h              show this help message and exit
--get=SETTING           print raw setting value
--getbool=SETTING       print setting value, interpreted as a boolean
--getint=SETTING        print setting value, interpreted as an integer
--getfloat=SETTING      print setting value, interpreted as a float
--getlist=SETTING       print setting value, interpreted as a list

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

runspider

这个和通过crawl启动爬虫不同，这里是scrapy runspider 爬虫文件名称
所有的爬虫文件都是在项目目录下的spiders文件夹中

version

查看版本信息，并查看依赖库的信息

C:\Users\董贺贺\myproject>scrapy version
Scrapy 1.6.0

C:\Users\董贺贺\myproject>scrapy version -v
Scrapy       : 1.6.0
lxml         : 4.3.0.0
libxml2      : 2.9.8
cssselect    : 1.0.3
parsel       : 1.5.1
w3lib        : 1.20.0
Twisted      : 19.2.0
Python       : 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]
pyOpenSSL    : 18.0.0 (OpenSSL 1.1.1c  28 May 2019)
cryptography : 2.7
Platform     : Windows-10-10.0.17763-SP0

shell是重点

3、scrapy基本概念介绍
本篇介绍scrapy的命令行工具、重要组件和重要对象。 scrapy 命令行工具 help：帮助信息（scrapy...
Scrapy项目文件介绍
使用scrapy命令新建一个scrapy爬虫项目，命令行：scrapy startproject first_sp...
Scrapy1.6 爬虫框架2 提取数据
使用 scrapy shell 提取数据 scrapy shell 是 scrapy 提供的命令行工具，可以方便的...
在Pycharm中调试scrapy爬虫的两种方法
通常，运行scrapy爬虫的方式是在命令行输入scrapy crawl ,调试的常用方式是在命令行输入scrap...
在Pycharm中调试scrapy爬虫的两种方法
通常，运行scrapy爬虫的方式是在命令行输入scrapy crawl ,调试的常用方式是在命令行输入scrap...
2-2 Scrapy安装及基本使用
第一个Scrapy项目一、新建scrapy项目在命令行输入scrapy startproject city_5...
Scrapy命令行动态传参给spider
scrapy命令行执行传递多个参数给spider 动态传参在命令行运行scrapy爬虫若爬虫中有参数可以控制爬...
2020-07-19--scrapy框架2
scrapy调试通常，运行scrapy爬虫的方式是在命令行输入scrapy crawl ,调试的常用方式是在命令...
scrapy输出到文件字符编码设置
使用scrapy命令行工具建立了爬虫项目（startproject），并使用scrapy genspider建立了...
day7、scrapy通过scrapy genspider -t
原在命令行输入:创建蜘蛛：scrapy genspider xxx xxx.com运行蜘蛛：scrapy craw...

scrapy命令行

创建scrapy项目

关于命令行的详细使用

命令的使用范围

可用的工具命令

全局命令有：

项目命令有：

startproject

genspider

crawl

check

list

edit

fetch

scrapy

shell

settings

runspider

version

相关文章

3、scrapy基本概念介绍

Scrapy项目文件介绍

Scrapy1.6 爬虫框架2 提取数据

在Pycharm中调试scrapy爬虫的两种方法

在Pycharm中调试scrapy爬虫的两种方法

2-2 Scrapy安装及基本使用

Scrapy命令行动态传参给spider

2020-07-19--scrapy框架2

scrapy输出到文件字符编码设置

day7、scrapy通过scrapy genspider -t

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据爬虫Python AI Sql