美文网首页Python爬虫
三十六. Scrapy实战 - 简书热门专题之CSV

三十六. Scrapy实战 - 简书热门专题之CSV

作者: 橄榄的世界 | 来源:发表于2018-03-01 09:08 被阅读0次

    爬取网址:https://www.jianshu.com/recommendations/collections?order_by=hot
    爬取内容:专题名称、专题介绍、收录文章、关注人数
    爬取方式:Scrapy框架
    储存方式:csv文件

    image.png

    使用F12,观察动态加载的URL,共37页:

    https://www.jianshu.com/recommendations/collections?page=1&order_by=hot
    https://www.jianshu.com/recommendations/collections?page=2&order_by=hot
    https://www.jianshu.com/recommendations/collections?page=3&order_by=hot
    

    1.items.py文件

    import scrapy
    
    class ZhuantiItem(scrapy.Item):
        # define the fields for your item here:
        name = scrapy.Field()
        content = scrapy.Field()
        article = scrapy.Field()
        fans = scrapy.Field()
    

    2.zhuantispider.py文件

    from scrapy.spiders import CrawlSpider
    from scrapy.selector import Selector
    from scrapy.http import Request
    from zhuanti.items import ZhuantiItem
    
    class zhuanti(CrawlSpider):
    
        name = "zhuanti"
        start_urls = ["https://www.jianshu.com/recommendations/collections?page=1&order_by=hot"]
    
        def parse(self, response):
            item = ZhuantiItem()
            selector = Selector(response)
            infos = selector.xpath('//div[@class="collection-wrap"]')
            for info in infos:
                try:
                    name = info.xpath('a/h4/text()').extract()[0]
                    content = info.xpath('a/p/text()').extract()[0].replace('\n', '')
                    article = info.xpath('div[@class="count"]/a/text()').extract()[0]
                    fans = info.xpath('div[@class="count"]/text()').extract()[0].strip('· ')
    
                    item['name'] = name
                    item['content'] =content
                    item['article'] = article
                    item['fans'] = fans
                    yield item
    
                except IndexError:
                    pass
    
            #构造第2页到第37页的‘热门专题’URL,通过Request请求URL,并回调parse()函数
            urls = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for i in range(2,37)]
            for url in urls:
                yield Request(url,callback=self.parse)
    

    3.settings.py文件: 此处使用Scrapy自带的存储功能(Feed exports),所以不需要使用pipelines.py进行数据的处理存储。

    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'     #请求头
    DOWNLOAD_DELAY = 0.5                 #睡眠时间0.5秒
    FEED_URI = 'file:F:/zhuanti.csv'
    FEED_FORMAT = 'csv'                        #存入csv文件
    

    4.main.py文件

    from scrapy import cmdline
    cmdline.execute("scrapy crawl zhuanti".split())
    

    运行main.py文件即可得到运行结果,用记事本打开:


    image.png

    如果用EXCEL文件打开,需要先在Notepad+中选择编码“以UTF-8格式编码”,并重新保存即可。


    image.png

    相关文章

      网友评论

        本文标题:三十六. Scrapy实战 - 简书热门专题之CSV

        本文链接:https://www.haomeiwen.com/subject/vfdwxftx.html