1、Scrapy安装
在windows平台anaconda
环境下,在命令窗口输入conda install scrapy
,输入确认的y
后,静静等待安装完成即可。安装完成后,在窗口输入scrapy version
,能显示版本号说明能够正常使用。
2、Scrapy指令
输入scrapy -h
可以看到指令,关于命令行,后面会再总结。
Scrapy 1.3.3 - project: quotetutorial
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
commands
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
3、新建项目
爬取的为用于测试scrapy的网站:http://quotes.toscrape.com/
爬取目标:获取名言---作者---标签。
1、命令窗口下,用cd
指令移动到想用来存放项目的文件夹
2、命令窗口下,scrapy startproject 你的文件夹名
,这里命名为scrapy startproject quotetutorial
。
可以看到两个提示, cd quotetutorial
,scrapy genspider example example.com
,(即cd 你的工作文件夹
,scrapy genspider 你的爬虫名 爬取的目标地址
),根据提示继续操作。
C:\Users\m1812>scrapy startproject quotetutorial
New Scrapy project 'quotetutorial', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
C:\Users\m1812\quotetutorial
You can start your first spider with:
cd quotetutorial
scrapy genspider example example.com
3、cd quotetutorial
移动到创建好的文件夹中
C:\Users\m1812>cd quotetutorial
4、scrapy genspider quotes quotes.toscrape.com
,生成一个名为quotes.py
的文件,地址为quotes quotes.toscrape.com
C:\Users\m1812\quotetutorial>scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
quotetutorial.spiders.quotes
用pycharm打开工程,框架如图
4、Scrapy初窥
1、修改quotes.py
中的parse
函数,让其打印出网页的html代码,这个网页直接输出print(response.text)
会有编码报错。parse
函数会在爬虫运行的最后开始执行,这里的response就是网页请求返回的结果。
2、在命令窗口中使用
scrapy crawl quotes
运行爬虫,看到scrapy除了打印出网页html代码外,还有很多信息输出。
C:\Users\m1812\quotetutorial>scrapy crawl quotes
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'SPIDER_MODULES': ['quotetutorial.spi
ders'], 'BOT_NAME': 'quotetutorial', 'ROBOTSTXT_OBEY': True}
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 19:50:11 [scrapy.core.engine] INFO: Spider opened
2019-04-05 19:50:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-05 19:50:11 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n <link rel="stylesheet" href="/static/bo
otstrap.min.css">\n <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n <div class="container">\n <div class="row header-
box">\n <div class="col-md-8">\n <h1>\n <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n
</h1>\n </div>\n <div class="col-md-4">\n <p>\n \n <a href="/l
ogin">Login</a>\n \n </p>\n </div>\n </div>\n \n\n<div class="row">\n <div class="col-md-8">\n\
n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0The world as we have
created it is a process of our thinking. It cannot be changed without changing our thinking.\xa1\xb1</span>\n <span>by <small class="author"
itemprop="author">Albert Einstein</small>\n <a href="/author/Albert-Einstein">(about)</a>\n </span>\n <div class="tags">\n
Tags:\n <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" / > \n \n
<a class="tag" href="/tag/change/page/1/">change</a>\n \n <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n
\n <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n \n <a class="tag" href="/tag/world/page/1/"
>world</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span cl
ass="text" itemprop="text">\xa1\xb0It is our choices, Harry, that show what we truly are, far more than our abilities.\xa1\xb1</span>\n <span>
by <small class="author" itemprop="author">J.K. Rowling</small>\n <a href="/author/J-K-Rowling">(about)</a>\n </span>\n <div cla
ss="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="abilities,choices" / > \n \n
<a class="tag" href="/tag/abilities/page/1/">abilities</a>\n \n <a class="tag" href="/tag/choices/page/1/">choices</a>\n
\n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop
="text">\xa1\xb0There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xa1
\xb1</span>\n <span>by <small class="author" itemprop="author">Albert Einstein</small>\n <a href="/author/Albert-Einstein">(about)</a>\
n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="inspirational,life,l
ive,miracle,miracles" / > \n \n <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n \n
<a class="tag" href="/tag/life/page/1/">life</a>\n \n <a class="tag" href="/tag/live/page/1/">live</a>\n \n
<a class="tag" href="/tag/miracle/page/1/">miracle</a>\n \n <a class="tag" href="/tag/miracles/page/1/">miracles</a>\n
\n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemp
rop="text">\xa1\xb0The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xa1\xb1</span>\n <sp
an>by <small class="author" itemprop="author">Jane Austen</small>\n <a href="/author/Jane-Austen">(about)</a>\n </span>\n <div c
lass="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" / > \n
\n <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>\n \n <a class="tag" href="/tag/books/page/1/">books</a
>\n \n <a class="tag" href="/tag/classic/page/1/">classic</a>\n \n <a class="tag" href="/tag/humor/page/1
/">humor</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span
class="text" itemprop="text">\xa1\xb0Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring
.\xa1\xb1</span>\n <span>by <small class="author" itemprop="author">Marilyn Monroe</small>\n <a href="/author/Marilyn-Monroe">(about)</
a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="be-yourself,inspi
rational" / > \n \n <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>\n \n <a class="tag"
href="/tag/inspirational/page/1/">inspirational</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://s
chema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0Try not to become a man of success. Rather become a man of value.\xa1\xb
1</span>\n <span>by <small class="author" itemprop="author">Albert Einstein</small>\n <a href="/author/Albert-Einstein">(about)</a>\n
</span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="adulthood,success,value
" / > \n \n <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>\n \n <a class="tag" href="/tag/
success/page/1/">success</a>\n \n <a class="tag" href="/tag/value/page/1/">value</a>\n \n </div>\n </div>\
n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0It is better to be
hated for what you are than to be loved for what you are not.\xa1\xb1</span>\n <span>by <small class="author" itemprop="author">Andr\xa8\xa6
Gide</small>\n <a href="/author/Andre-Gide">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta cla
ss="keywords" itemprop="keywords" content="life,love" / > \n \n <a class="tag" href="/tag/life/page/1/">life</a>\n
\n <a class="tag" href="/tag/love/page/1/">love</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemty
pe="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0I have not failed. I've just found 10,000 ways that won&
#39;t work.\xa1\xb1</span>\n <span>by <small class="author" itemprop="author">Thomas A. Edison</small>\n <a href="/author/Thomas-A-Edis
on">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="edis
on,failure,inspirational,paraphrased" / > \n \n <a class="tag" href="/tag/edison/page/1/">edison</a>\n \n
<a class="tag" href="/tag/failure/page/1/">failure</a>\n \n <a class="tag" href="/tag/inspirational/page/1/">inspirational<
/a>\n \n <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>\n \n </div>\n </div>\n\n <div c
lass="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0A woman is like a tea bag; you
never know how strong it is until it's in hot water.\xa1\xb1</span>\n <span>by <small class="author" itemprop="author">Eleanor Roosevelt</
small>\n <a href="/author/Eleanor-Roosevelt">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta cl
ass="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" / > \n \n <a class="tag" href="/tag/misattribut
ed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemt
ype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0A day without sunshine is like, you know, night.\xa1\xb1</s
pan>\n <span>by <small class="author" itemprop="author">Steve Martin</small>\n <a href="/author/Steve-Martin">(about)</a>\n </sp
an>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" / > \n
\n <a class="tag" href="/tag/humor/page/1/">humor</a>\n \n <a class="tag" href="/tag/obvious/page/1/">obvi
ous</a>\n \n <a class="tag" href="/tag/simile/page/1/">simile</a>\n \n </div>\n </div>\n\n <nav>\n
<ul class="pager">\n \n \n <li class="next">\n <a href="/page/2/">Next <span aria-hidden="true">&r
arr;</span></a>\n </li>\n \n </ul>\n </nav>\n </div>\n <div class="col-md-4 tags-box">\n \n <
h2>Top Ten tags</h2>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 28px" href="/tag/love/">love</a
>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 26px" href="/tag/inspirationa
l/">inspirational</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 26px" hre
f="/tag/life/">life</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 24px" h
ref="/tag/humor/">humor</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 22p
x" href="/tag/books/">books</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size:
14px" href="/tag/reading/">reading</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="fo
nt-size: 10px" href="/tag/friendship/">friendship</a>\n </span>\n \n <span class="tag-item">\n <a class="
tag" style="font-size: 8px" href="/tag/friends/">friends</a>\n </span>\n \n <span class="tag-item">\n <a
class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>\n </span>\n \n <span class="tag-item">\n
<a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>\n </span>\n \n \n </div>\n</div>\n\n </div>\n
<footer class="footer">\n <div class="container">\n <p class="text-muted">\n Quotes by: <a href="https://www.goo
dreads.com/quotes">GoodReads.com</a>\n </p>\n <p class="copyright">\n Made with <span class=\'sh-red\'>\x817\xc5
8</span> by <a href="https://scrapinghub.com">Scrapinghub</a>\n </p>\n </div>\n </footer>\n</body>\n</html>'
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-05 19:50:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 444,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 5, 11, 50, 12, 560342),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 4, 5, 11, 50, 11, 713697)}
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Spider closed (finished)
5、开始爬取
首先浏览下目标信息的html结构:
1、修改items.py
中的内容,将欲提取的3个信息按照指定的格式填入:
2、修改
quotes.py
中的内容,添加爬取的规则,并且和步骤一中items.py
的配置相映射。
# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuotetutorialItem()
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
这边用到了自带的css选择器。
在命令窗口中,利用shell
指令可以进行交互性测试,scrapy shell "quotes.toscrape.com"
(注意这里的双引号),从这里我们可以理解上面的css选出了什么,extract_first()
和extract()
有什么区别。
C:\Users\m1812\quotetutorial>scrapy shell "quotes.toscrape.com"
2019-04-05 20:08:39 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 20:08:39 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'quotetutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
r', 'SPIDER_MODULES': ['quotetutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'quotetutorial.spiders'}
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 20:08:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 20:08:39 [scrapy.core.engine] INFO: Spider opened
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000002B3C7410748>
[s] item {}
[s] request <GET http://quotes.toscrape.com>
[s] response <200 http://quotes.toscrape.com>
[s] settings <scrapy.settings.Settings object at 0x000002B3C74D97B8>
[s] spider <DefaultSpider 'default' at 0x2b3c90e5ba8>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:
In [1]: response
Out[1]: <200 http://quotes.toscrape.com>
In [2]: quotes = response.css('.quote')
In [3]: quotes
Out[3]:
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>]
In [4]: quotes[0]
Out[4]: <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" i
temscope itemtype="h'>
In [5]: quotes[0].css('.text')
Out[5]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" i
temprop="text">“The '>]
In [6]: quotes[0].css('.text::text')
Out[6]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world a
s we have created it is a pr'>]
In [7]: quotes[0].css('.text::text').extract()
Out[7]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']
In [8]: quotes[0].css('.text').extract()
Out[8]: ['<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing ou
r thinking.”</span>']
In [9]: quotes[0].css('.text::text').extract_first()
Out[9]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
In [10]: quotes[0].css('.tags .tag::text').extract()
Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']
In [11]: exit()
此时我们再运行下爬虫scrapy crawl quotes
在终端可以看到很多信息。
3、单页爬取完成,接下来要进行翻页。网页的url变化如下,可以通过next按钮的href属性获得下一页网址。
修改
quotes.py
中的代码:
# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuotetutorialItem() #字典类型
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next) #生成完整url
yield scrapy.Request(url=url, callback=self.parse) #递归调用
4、保存数据
命令行scrapy crawl quotes -o quotes.json
,保存为json格式
或者存成jl格式,
scrapy crawl quotes -o quotes.jl
或者存成CSV格式,
scrapy crawl quotes -o quotes.csv
或者存成xml格式,
scrapy crawl quotes -o quotes.xml
或者存成pickle格式,
scrapy crawl quotes -o quotes.pickle
或者存成marshal格式,
scrapy crawl quotes -o quotes.marshal
5、处理一些不想要的item或者保存到数据库
这时候需要修改pipelines.py
中的代码
这里限定了字符最大为50个字符,超过的部分在后面添加
...
同时定义了mongodb的保存函数
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.exceptions import DropItem
# 这里限定了最大字符个数为50,超过用省略号代替
class QuotetutorialPipeline(object):
def __init__(self):
self.limit = 50
# 只能返回两种值,item和DropItem
def process_item(self, item, spider):
if item['text']:
if len(item['text']) > self.limit:
item['text'] = item['text'][0:self.limit].rstrip() + '...'
return item
else:
return DropItem('Missing Text')
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
name = item.__class__.__name__
self.db[name].insert(dict(item))
return item
def close_spider(self, spider):
# 关闭mongodb
self.client.close()
还要修改setting.py
中的相关设置
重新运行命令行scrapy crawl quotes
mongodb中也可以看到保存的数据了。
参考自崔庆才博主的Scrapy教学
网友评论