爬虫入门很容易！就看你的学习方法了！来试试scrapy入门！

作者: Python树苗 | 来源:发表于2018-05-30 11:21 被阅读30次

爬虫入门很容易！就看你的学习方法了！来试试scrapy入门！
scrapy笔记
scrapy折腾系列01
scrapy入门使用及pycharm远程调试
2019Python学习教程（全套Python学习视频）：Scr
（大纲37）Python07爬虫第4节、scrapy框架
10分钟从入门到进阶python爬虫
教程
Scrapy简记
python使用scrapy爬表格，爬虫中级

scrapy作为一款强大的爬虫框架，当然要好好学习一番，本文便是本人学习和使用scrapy过后的一个总结，内容比较基础，算是入门笔记吧，主要讲述scrapy的基本概念和使用方法。

需要说明的是，项目管道( Item Pipeline )主要完成数据清洗，验证，持久化存储等工作；下载器中间件( Downloader Middlewares )作为下载器和引擎之间的的钩子( hook )，用于监听或修改下载请求或已下载的网页，比如修改请求包的头部信息等；爬虫中间件( Spider Middlewares )作为爬虫和引擎之间的钩子( hook )，用于处理爬虫的输入输出，即网页 response 和爬虫解析网页后得到的 Items 和 requests 。

Items

至于什么是 Items ，个人认为就是经爬虫解析后得到的一个数据单元，包含一组数据，比如爬取的是某网站的商品信息，那么每爬取一个网页可能会得到多组商品信息，每组信息包含商品名称，价格，生产日期，商品样式等，那我们便可以定义一组 Item

Install

with pip

pipinstall scrapy

or conda

condainstall -c conda-forgescrapy

基本指令如下：

D:\WorkSpace>scrapy --helpScrapy 1.5.0 -noactive projectUsage: scrapy [options] [args]Available commands: benchRunquick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspiderRuna self-contained spider (without creating a project) settings Get settings values shell Interactive scraping consolestartproject Create new project versionPrintScrapy version view Open URLinbrowser, as seen by Scrapy [ more ] More commands available whenrunfromproject directoryUse"scrapy -h"tosee moreinfoabout a command

如果需要使用虚拟环境，需要安装 virtualenv

pipinstallvirtualenv

scrapy startproject

scrapy startproject [project-dir]

使用该指令可以生成一个新的 scrapy 项目，以 demo 为例

$ scrapy startproject demo...You can start your first spider with: cd demo scrapy genspider example example.com$ cd demo$ tree.├── demo│ ├── __init__.py│ ├── items.py│ ├── middlewares.py│ ├── pipelines.py│ ├── __pycache__│ ├── settings.py│ └── spiders│ ├── __init__.py│ └── __pycache__└── scrapy.cfg4directories,7files

可以看到 startproject 自动生成了一些文件夹和文件，其中：

scrapy.cfg : 项目配置文件，一般不用修改

items.py : 定义 items 的文件，例如上述的 GoodsItem

middlewares.py : 中间件代码，默认包含下载器中间件和爬虫中间件

pipelines.py : 项目管道，用于处理 spider 返回的 items ，包括清洗，验证，持久化等

settings.py : 全局配置文件，包含各类全局变量

spiders : 该文件夹用于存储所有的爬虫文件，注意一个项目可以包含多个爬虫

__init__.py : 该文件指示当前文件夹属于一个 python 模块

__pycache__ : 存储解释器生成的 .pyc 文件（一种跨平台的字节码 byte code ），在 python2 中该类文件与 .py 保存在相同文件夹

scrapy genspider

项目生成以后，可以使用 scrapy genspider 指令自动生成一个爬虫文件，比如，如果要爬取花瓣网首页，执行以下指令：

$cddemo$ scrapy genspider huaban www.huaban.com

默认生成的爬虫文件 huaban.py 如下：

# -*- coding: utf-8 -*-importscrapyclassHuabanSpider(scrapy.Spider):name ='huaban'allowed_domains = ['www.huaban.com'] start_urls = ['http://www.huaban.com/']defparse(self, response):passscrapy.Spidernameallowed_domainsstart_urls

如果要自定义起始链接，也可以重写 scrapy.Spider 类的 start_requests 函数，此处不予细讲。

parse 函数是一个默认的回调函数，当下载器下载网页后，会调用该函数进行解析，response 就是请求包的响应数据。至于网页内容的解析方法， scrapy 内置了几种选择器( Selector )，包括 xpath 选择器、 CSS 选择器和正则匹配。下面是一些选择器的使用示例，方便大家更加直观的了解选择器的用法。

#xpathselectorresponse.xpath('//a')response.xpath('./img').extract()response.xpath('//*[@id="huaban"]').extract_first()repsonse.xpath('//*[@id="Profile"]/div[1]/a[2]/text()').extract_first()#cssselectorresponse.css('a').extract()response.css('#Profile > div.profile-basic').extract_first()response.css('a[href="test.html"]::text').extract_first()#reselectorresponse.xpath('.').re('id:\s*(\d+)')response.xpath('//a/text()').re_first('username: \s(.*)')

需要说明的是， response 不能直接调用 re , re_first .

scrapy crawl

假设爬虫编写完了，那就可以使用 scrapy crawl 指令开始执行爬取任务了。

当进入一个创建好的 scrapy 项目目录时，使用 scrapy -h 可以获得相比未创建之前更多的帮助信息，其中就包括用于启动爬虫任务的 scrapy crawl

$ scrapy -hScrapy 1.5.0 - project: huabanUsage: scrapy [options] [args]Available commands: benchRunquick benchmark test check Check spider contracts crawlRuna spidereditEditspider fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates list List available spiders parse Parse URL (using its spider)andprintthe results runspiderRuna self-contained spider (without creating a project) settings Get settings values shell Interactive scraping consolestartproject Create new project versionPrintScrapy version view Open URLinbrowser, as seen by ScrapyUse"scrapy -h"tosee moreinfoabout a command$ scrapy crawl -hUsage===== scrapy crawl [options] Runa spiderOptions=======--help, -h show this help messageandexit-aNAME=VALUEsetspider argument (may be repeated)--output=FILE, -o FILE dump scraped items into FILE (use -forstdout)--output-format=FORMAT, -t FORMAT formattousefordumping items with -oGlobal Options----------------logfile=FILE log file.ifomitted stderr will be used--loglevel=LEVEL, -L LEVEL log level (default: DEBUG)--nologdisable logging completely--profile=FILE write python cProfile statstoFILE--pidfile=FILE write process IDtoFILE--set=NAME=VALUE, -sNAME=VALUE set/override setting (may be repeated)--pdbenablepdb on failure

从 scrapy crawl 的帮助信息可以看出，该指令包含很多可选参数，但必选参数只有一个，就是 spider ，即要执行的爬虫名称，对应每个爬虫的名称( name )。

scrapy crawl huaban

至此，一个 scrapy 爬虫任务的创建和执行过程就介绍完了，至于实例，后续博客会陆续介绍。

scrapy shell

最后简要说明一下指令 scrapy shell ，这是一个交互式的 shell ,类似于命令行形式的python ，当我们刚开始学习 scrapy 或者刚开始爬虫某个陌生的站点时，可以使用它熟悉各种函数操作或者选择器的使用，用它来不断试错纠错，熟练掌握 scrapy 各种用法。

$ scrapy shell www.huaban.com2018-05-2923:58:49[scrapy.utils.log]INFO:Scrapy1.5.0started (bot: scrapybot)2018-05-2923:58:49[scrapy.utils.log]INFO:Versions: lxml4.2.1.0, libxml22.9.5, cssselect1.0.3, parsel1.4.0, w3lib1.19.0,Twisted17.9.0,Python3.6.3(v3.6.3:2c5fed8,Oct32017,17:26:49) [MSCv.190032bit (Intel)], pyOpenSSL17.5.0(OpenSSL1.1.0h27Mar2018), cryptography2.2.2,PlatformWindows-10-10.0.17134-SP02018-05-2923:58:49[scrapy.crawler]INFO:Overriddensettings: {'DUPEFILTER_CLASS':'scrapy.dupefilters.BaseDupeFilter','LOGSTATS_INTERVAL':0}2018-05-2923:58:49[scrapy.middleware]INFO:Enabledextensions:['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole']2018-05-2923:58:50[scrapy.middleware]INFO:Enableddownloader middlewares:['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']2018-05-2923:58:50[scrapy.middleware]INFO:Enabledspider middlewares:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']2018-05-2923:58:50[scrapy.middleware]INFO:Enableditem pipelines:[]2018-05-2923:58:50[scrapy.extensions.telnet]DEBUG:Telnetconsole listening on127.0.0.1:60232018-05-2923:58:50[scrapy.core.engine]INFO:Spideropened2018-05-2923:58:50[scrapy.downloadermiddlewares.redirect]DEBUG:Redirecting(301) to from 2018-05-2923:58:50[scrapy.core.engine]DEBUG:Crawled(200) (referer:None)[s]AvailableScrapyobjects:[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)[s] crawler [s] item {}[s] request [s] response <200http://huaban.com/>[s] settings [s] spider [s]Usefulshortcuts:[s] fetch(url[, redirect=True])FetchURLand update local objects (by default, redirects are followed)[s] fetch(req)Fetcha scrapy.Requestand update local objects[s] shelp()Shellhelp (print this help)[s] view(response)Viewresponse in a browserIn[1]: view(response)Out[1]:TrueIn[2]: response.xpath('//a')Out[2]:[, ">, 添加采">, 添加画板, 安装采集工具, ]In[3]: response.xpath('//a').extract()Out[3]:['','','添加采集','添加画板','安装采集工具','']In[4]: response.xpath('//img')Out[4]: []In[5]: response.xpath('//a/text()')Out[5]:[, , , , ]In[6]: response.xpath('//a/text()').extract()Out[6]: ['添加采集','添加画板','安装采集工具',' ',' ']In[7]: response.xpath('//a/text()').extract_first()Out[7]:'添加采集'

get到了吗？如果小编的文章对你有帮助的话，欢迎关注我的博客或者公众号：https://home.cnblogs.com/u/Python1234/ Python学习交流

欢迎加入我的千人交流学习答疑群：125240963

爬虫入门很容易！就看你的学习方法了！来试试scrapy入门！
scrapy作为一款强大的爬虫框架，当然要好好学习一番，本文便是本人学习和使用scrapy过后的一个总结，内容比较...
scrapy笔记
1 scrapy的运行原理参考：Learning Scrapy笔记（三）- Scrapy基础Scrapy爬虫入门...
scrapy折腾系列01
scrapy爬虫折腾 1、scrapy爬虫入门 scrapy是框架，好比一辆车子，beautifulsoup好比一...
scrapy入门使用及pycharm远程调试
一·scrapy的入门使用 scrapy的安装创建scrapy项目创建scrapy爬虫：在项目目录下执行运行...
2019Python学习教程（全套Python学习视频）：Scr
Scrapy爬虫框架入门 Scrapy概述 Scrapy是Python开发的一个非常流行的网络爬虫框架，可以用来抓...
（大纲37）Python07爬虫第4节、scrapy框架
7、爬虫4、scrapy框架 1.4.0Scrapy框架1.4.1配置安装1.4.2入门案例1.4.3Scrapy...
10分钟从入门到进阶python爬虫
本文目录基础入门基本模块方法实例爬虫框架（scrapy）常用工具（神器）分布式爬虫一、基础入门 1....
教程
一入门系列 1. Scrapy爬虫入门教程二官方提供Demo https://www.jianshu.com/...
Scrapy简记
摘自Scrapy 中文文档一：入门 scrapy startproject tutorial创建新的爬虫项目 s...
python使用scrapy爬表格，爬虫中级
上一篇讲道了爬虫入门，这一篇介绍怎么使用爬虫框架来爬数据。框架用的是scrapy https://doc.scra...