美文网首页
Scrapy抓取v2ex.com

Scrapy抓取v2ex.com

作者: dpkBat | 来源:发表于2017-05-19 16:52 被阅读0次
    • Scrapy
    • Unicode与utf-8编码转换

    1. 安装Scrapy

    conda install scrapy
    

    验证安装是否成功

    scrapy version
    
    安装成功

    2. scray shell的使用

    • 使用方法
    scrapy shell -s ROBOTSTXT_OBEY=False "http://mp.weixin.qq.com/s?__biz=MjM5MTI0NjQ0MA==&mid=402001834&idx=1&sn=fbe58fd99b6a1b64e6764a436964ba4a&scene=21#wechat_redirect"
    scrapy shell -s USER_AGENT='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' "http://www.jianshu.com/trending/weekly?utm_medium=index-banner-s&utm_source=desktop&page=5"
    
    • 用于测试css、xpath表达式是否正确
    response.xpath('//*[(@id ="TopicsNode")]//td[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]')
    topic.css('a::attr("href")').extract_first() 
    
    • 用于测试网页返回内容是否正确
    view(response)
    
    • 获取请求状态码
    response.status
    

    3. 爬取v2ex.com

    • 网址url构成
    url = 'https://www.v2ex.com/go/python?p={}'.format(page_number)
    

    4. v2ex爬虫代码

    v2exSpider

    5. Unicode与utf-8编码转换

    scrapy默认编码为Unicode,修改pipeline.py的内容将unicode编码为utf-8

    import json
    
    class JsonWriterPipeline(object):
    
        def open_spider(self, spider):
            self.file = codecs.open('jianshu_data_utf-8.json', 'w', encoding='utf-8')
    
        def close_spider(self, spider):
            self.file.close()
    
        def process_item(self, item, spider):
            line = json.dumps(dict(item), ensure_ascii=False) + "\n"
            self.file.write(line)
            return item
    

    修改完成后激活Item Pipeline组件,将Item Pipeline组件的类名加入到settings.py的ITEM_PIPELINES中。

    ITEM_PIPELINES = {
        'myproject.pipelines.PricePipeline': 300,
        'myproject.pipelines.JsonWriterPipeline': 800,
    }
    

    相关文章

      网友评论

          本文标题:Scrapy抓取v2ex.com

          本文链接:https://www.haomeiwen.com/subject/wsiyxxtx.html