Scrapy爬取1000本书

作者: 松爱家的小秦 | 来源:发表于2017-12-15 19:47 被阅读0次

Scrapy爬取1000本书
爬虫技术scrapy
Scrapy爬取网易云音乐和评论（二、Scrapy框架每个模块的
Scrapy爬取网易云音乐和评论（一、思路分析）
Scrapy爬取网易云音乐和评论（四、关于API）
Scrapy爬取网易云音乐和评论（三、爬取歌手）
Scrapy爬取网易云音乐和评论（五、评论）
0.Python 爬虫之Scrapy入门实践指南（Scrapy基
[scrapy]scrapy爬取京东商品信息——以自营手机为例
Scrapy爬取数据初识

效果如下：

从左到右依次是书的upc编码，名字，类型，储存量，价格，评分，评分数目，简介

网址是这个http://books.toscrape.com/

使用scrapy shell来操作一个爬虫，先简单进行爬取实验，把网页分析好

scrapy shell http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html

进入交互框

In [2]: sel = response.css('div.col-sm-6.product_main')

In [4]: sel.xpath('./h1/text()').extract_first()

Out[4]: u'A Light in the Attic'#name

In [5]: sel.css('p.price_color').extract_first()

Out[5]: u'\xa351.77#price

In [30]: response.xpath('//*[@id="content_inner"]/article/p/text()').extract_first()

Out[30]: u"It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more"

#jianjie

In [31]: response.xpath('//*[@id="default"]/div/div/ul/li[3]/a/text()').extract_first()

Out[31]: u'Poetry'

#leixing

In [46]: table2 = response.css("table.table.table-striped")

In [48]: table2.xpath("(.//tr)[1]/td/text()").extract_first()

Out[48]: u'a897fe39b1053632'#upc编码

In [49]: table2.xpath("(.//tr)[last()-1]/td/text()").extract_first()

Out[49]: u'In stock (22 available)'#库存

In [56]: from scrapy.linkextractors import LinkExtractorIn [57]: le = LinkExtractor(restrict_css='article.product_pod')In [58]: leOut[58]:In [59]: le.extract_links(response)

Out[59]:

[Link(url='http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', text=u'', fragment='', nofollow=False),

Link(url='http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html', text=u'', fragment='', nofollow=False),

Link(url='http://books.toscrape.com/catalogue/soumission_998/index.html', text=u'', fragment='', nofollow=False)................以下连接省略

每一本书都会有自己的主页的，我们需要获取到全部书的主页，再进去分析。

首先

scrapy startproject toscrappe_book

创建项目

再初始化项目，设定爬去域在我们要的网站内

scrapy genspider books books.toscrape.com

设计思路：

1.根据刚刚分析出来的网页信息，设置items

2.根据刚刚分析的网页，设计爬虫spieder

(1)爬虫需要爬去单个页面需要信息

（2）爬完一个网页，爬虫需要去爬取下一个目标网页

3.在setting里设置相关信息

4.在pipelines处理特别的数据

代码已经上传到git

网友评论

本文标题：Scrapy爬取1000本书

本文链接：https://www.haomeiwen.com/subject/wkadwxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Scrapy爬取1000本书

相关文章