目标:获取allitbooks网站的书籍信息,特别是书名和下载直链,存到cassandra或者scylla中
github主页:https://github.com/baiwfg2/scrapy-examples/tree/master/allitbooks
get到所有大主题
response.css('div ul#menu-categories li a::text').extract()
![](https://img.haomeiwen.com/i1071648/4ae4d9e79125f06c.png)
get所有的大主题url
response.css('div ul#menu-categories li a::attr(href)').extract()
![](https://img.haomeiwen.com/i1071648/2d4ac3d6edf3fc0c.png)
get database页面下的总页数:
response.css('div.pagination a::text').extract()[-1]
![](https://img.haomeiwen.com/i1071648/948e8c5bbb2ceaec.png)
get database/page/3下的所有book link,
response.css('h2.entry-title a::attr(href)').extract()
![](https://img.haomeiwen.com/i1071648/de9701758f00480f.png)
get one book的作者,可能有多个
response.css('div.book-detail dl').xpath('.//dt[text()="Author:"]/following-sibling::dd')[0].css('a::text').extract()
![](https://img.haomeiwen.com/i1071648/4384aff7561a6b0e.png)
效果图:
![](https://img.haomeiwen.com/i1071648/8c3f814ea621bca1.png)
遗憾的是,只爬取到143条数据。日后在诊断原因……
搜索的主键name太长,需要模糊查找!!
![](https://img.haomeiwen.com/i1071648/33abe0cb8564596b.png)
网友评论