效果如下:
![](https://img.haomeiwen.com/i5186857/50cad1e2dae9015c.png)
从左到右依次是书的upc编码,名字,类型,储存量,价格,评分,评分数目,简介
网址是这个http://books.toscrape.com/
使用scrapy shell来操作一个爬虫,先简单进行爬取实验,把网页分析好
scrapy shell http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
进入交互框
![](https://img.haomeiwen.com/i5186857/904c0678da39c191.png)
In [2]: sel = response.css('div.col-sm-6.product_main')
In [4]: sel.xpath('./h1/text()').extract_first()
Out[4]: u'A Light in the Attic'#name
In [5]: sel.css('p.price_color').extract_first()
Out[5]: u'\xa351.77#price
In [30]: response.xpath('//*[@id="content_inner"]/article/p/text()').extract_first()
Out[30]: u"It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more"
#jianjie
In [31]: response.xpath('//*[@id="default"]/div/div/ul/li[3]/a/text()').extract_first()
Out[31]: u'Poetry'
#leixing
In [46]: table2 = response.css("table.table.table-striped")
In [48]: table2.xpath("(.//tr)[1]/td/text()").extract_first()
Out[48]: u'a897fe39b1053632'#upc编码
In [49]: table2.xpath("(.//tr)[last()-1]/td/text()").extract_first()
Out[49]: u'In stock (22 available)'#库存
In [56]: from scrapy.linkextractors import LinkExtractorIn [57]: le = LinkExtractor(restrict_css='article.product_pod')In [58]: leOut[58]:In [59]: le.extract_links(response)
Out[59]:
[Link(url='http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', text=u'', fragment='', nofollow=False),
Link(url='http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html', text=u'', fragment='', nofollow=False),
Link(url='http://books.toscrape.com/catalogue/soumission_998/index.html', text=u'', fragment='', nofollow=False)................以下连接省略
每一本书都会有自己的主页的,我们需要获取到全部书的主页,再进去分析。
首先
scrapy startproject toscrappe_book
创建项目
再初始化项目,设定爬去域在我们要的网站内
scrapy genspider books books.toscrape.com
设计思路:
1.根据刚刚分析出来的网页信息,设置items
2.根据刚刚分析的网页,设计爬虫spieder
(1)爬虫需要爬去单个页面需要信息
(2)爬完一个网页,爬虫需要去爬取下一个目标网页
3.在setting里设置相关信息
4.在pipelines处理特别的数据
代码已经上传到git
网友评论