继续实践案例
本次实践如下案例文章
通过复制代码运行后发现与预期结果完全不符合,从运行终端看到如下一行行运行代码结果,然而生成的文件抓取内容是空的。
[scrapy.core.scraper] ERROR: Spider error processing <GET https://blog.csdn.net/u012150179/article/details/38230295> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/insight2026/CSDNBlog/CSDNBlog/spiders/CSDNBlog_spider.py", line 47, in parse
item['article_name'] = article_name.encode("utf-8")
AttributeError: 'list' object has no attribute 'encode'
查看此文章博文回复发现如下信息
素笺鸣2018-06-04 09:30:02#9楼
现在博主博客的分页方式已经改变,既没有了“下一页”的链接,博文列表页下方的页码也不能按普通的方式抓取到,请问有没有其它的解决办法?
qq_362560132017-12-05 15:13:41#8楼
全是错的
weixin_358408552017-09-15 11:01:34#7楼
2017-09-15 10:49:56 [scrapy.core.scraper] ERROR: Spider error processing 楼主按照您的方法出现如上错误,没有找到原因,是怎么回事呢,谢谢楼主~ kxltsuperr2017-01-19 23:25:03#6楼
githubgithubgithub
fabien_xia2016-02-12 16:43:31#5楼
我按照楼主的写Pipeline一直报错说 AttributeError : 'CsdnblogPipeline' object has no attribute "file"求解救
一切迹象来看此代码看来有问题无法运行,接下来怎么办呢,看来只有想办法调试了,通过之前官方文档搜寻发现,先使用Scrapy终端(Scrapy shell)看看提取对不对,结果发现提出为空,通过查看原网址源码发现获取的标签已变更,需要更改,修改后发现可以正确获取想要结果,但是发现获取的数据无法存档到建立的json文件中,报错现象为参考的博文回复的5楼所回复现象“Pipeline一直报错说 AttributeError : 'CsdnblogPipeline' object has no attribute "file"”,此问题通过谷歌查询分析后发现与python2和python3的相互差异有关,当前使用的环境是python3,从博文时间来分析估计是python2,所以后来按官方事例代码重写Pipelines就能存储数据到文件了,下面给出成功运行代码变更处:
spider文件中标题和链接xpath变更,及按python3方式改item数据传递
#article_name = sel.xpath('//div[@id="article_details"]/div/h1/span/a/text()').extract()
article_name = sel.xpath('//title').extract()
# urls = sel.xpath('//li[@class="next_article"]/a/@href').extract()
urls = sel.xpath('//div[@class="related-article related-article-next text-truncate"]/a/@href').extract()
#item['article_name'] = [n.encode('utf-8') for n in article_name]
item['article_name'] = article_name
# item['article_name'] = article_name.encode("utf-8")
item['article_url'] = article_url
pipelines文件中文件打开和传入数据获取变更
# self.file = codecs.open('CSDNBlog_data.json', mode='wb', encoding='utf-8')
self.file = open('CSDNBlog_data.json', 'wb')
# self.file.write(line.decode("utf-8"))
self.file.write(line.encode("utf-8"))
未完成事宜:当前通过pipelines生成json文件中文章标题还未转换为中文字符,还未找到解决办法,接下来还需继续解决。
网友评论