美文网首页
Python学习-scrapy7

Python学习-scrapy7

作者: ericblue | 来源:发表于2018-08-14 11:30 被阅读0次

    继续学习案例文章

    Scrapy研究探索(六)——自动爬取网页之II(CrawlSpider)

    按文中方式同步上篇已实现成功的代码之后发现一直出现AttributeError: 'str' object has no attribute 'iter'错误,如下所示:

    [scrapy.core.scraper] ERROR: Spider error processing <GET https://blog.csdn.net/u012150179/article/details/11749017> (referer: None)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
        yield next(it)
      File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
        for x in result:
      File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/usr/local/lib/python3.6/site-packages/scrapy/spiders/crawl.py", line 82, in _parse_response
        for request_or_item in self._requests_to_follow(response):
      File "/usr/local/lib/python3.6/site-packages/scrapy/spiders/crawl.py", line 61, in _requests_to_follow
        links = [lnk for lnk in rule.link_extractor.extract_links(response)
      File "/usr/local/lib/python3.6/site-packages/scrapy/linkextractors/lxmlhtml.py", line 128, in extract_links
        links = self._extract_links(doc, response.url, response.encoding, base_url)
      File "/usr/local/lib/python3.6/site-packages/scrapy/linkextractors/__init__.py", line 109, in _extract_links
        return self.link_extractor._extract_links(*args, **kwargs)
      File "/usr/local/lib/python3.6/site-packages/scrapy/linkextractors/lxmlhtml.py", line 58, in _extract_links
        for el, attr, attr_val in self._iter_links(selector.root):
      File "/usr/local/lib/python3.6/site-packages/scrapy/linkextractors/lxmlhtml.py", line 46, in _iter_links
        for el in document.iter(etree.Element):
    AttributeError: 'str' object has no attribute 'iter'
    [scrapy.core.engine] INFO: Closing spider (finished)
    [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 494,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 13455,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/301': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2018, 8, 13, 14, 39, 12, 567429),
     'log_count/DEBUG': 3,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'memusage/max': 46272512,
     'memusage/startup': 46272512,
     'response_received_count': 1,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'spider_exceptions/AttributeError': 1,
     'start_time': datetime.datetime(2018, 8, 13, 14, 39, 9, 244154)}
    

    经过多方查找排查,还是无法解决如上问题,此处留作记录,把修改后源码粘贴如下,以备后续解决。

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
    from scrapy.selector import Selector
    
    
    class CSDNBlogCrawlSpider(CrawlSpider):
        """继承自CrawlSpider,实现自动爬取的爬虫。"""
    
        name = 'CSDNBlogCrawlSpider'
    
        download_delay = 2
        allowed_domains = ['blog.csdn.net']
        start_urls = [
    
            'http://blog.csdn.net/u012150179/article/details/11749017'
    
        ]
    
        rules = [
                Rule(LxmlLinkExtractor(allow=('/u012150179/article/details'),
                                      restrict_xpaths=('//div[@class="related-article related-article-next text-truncate"]/a/@href')),
                     callback='parse_item',
                     follow=True)
            ]
    
    
        def parse_item(self, response):
            item = CsdnblogcrawlspiderItem()
            sel = Selector(response)
            blog_url = str(response.url)
            blog_name = sel.xpath('//title').extract()
            item['blog_name'] = [n for n in blog_name]
            item['blog_url'] = blog_url
            yield item
    
    

    相关文章

      网友评论

          本文标题:Python学习-scrapy7

          本文链接:https://www.haomeiwen.com/subject/cpvnvftx.html