美文网首页
推荐系统1:Scrapy创建一个简单的爬虫

推荐系统1:Scrapy创建一个简单的爬虫

作者: 崔业康 | 来源:发表于2018-06-19 17:29 被阅读0次

    创建项目

    进入到文件存放目录下
    创建项目,执行 scrapy startproject zhihuscrapy

    创建爬虫

    在spiders目录下创建文件 zhihu_spider.py
    文件代码如下:

    import scrapy
    
    class ZhihuSpider(scrapy.Spider):
        name = "zhihu"
        allowed_domains = ["zhihu.com"]
        start_urls = [
            "https://zhuanlan.zhihu.com/p/38198729",
            "https://zhuanlan.zhihu.com/p/38235624"
        ]
    
        def parse(self, response):
            for sel in response.xpath('//head'):
                title = sel.xpath('title/text()').extract()
                link = sel.xpath('title/text()').extract()
                desc = sel.xpath('title/text()').extract()
                print title, link, desc
    
    

    设置请求头

    在settings.py中增加

    #请求头 
    USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' 
    #关闭robot 
    ROBOTSTXT_OBEY = False 
    #关闭cookies追踪 
    COOKIES_ENABLED = False
    

    启动爬取

    回到项目目录下

    scrapy crawl zhihu

    改进代码

    import scrapy
    
    from zhihuscrapy.items import ZhihuscrapyItem
    
    class ZhihuSpider(scrapy.Spider):
        name = "zhihu"
        allowed_domains = ["zhihu.com"]
        start_urls = [
            "https://zhuanlan.zhihu.com/p/38198729",
            "https://zhuanlan.zhihu.com/p/38235624"
        ]
    
        def parse(self, response):
            for href in response.css("UserLink-link > a::attr('href')"):
                #url = response.urljoin(response.url, href.extract())
                url = response.urljoin(href.extract())
                yield scrapy.Request(url, callback=self.parse_dir_contents)
    
        def parse_dir_contents(self, response):
            for sel in response.xpath('//head'):
                item = ZhihuscrapyItem()
                item['title'] = sel.xpath('title/text()').extract()
                item['link'] = sel.xpath('title/text()').extract()
                item['desc'] = sel.xpath('title/text()').extract()
                yield item
    

    执行,并输出

    scrapy crawl zhihu -o items.json

    参考: Scrapy爬虫(1)-知乎
    参考: Scrapy入门教程

    相关文章

      网友评论

          本文标题:推荐系统1:Scrapy创建一个简单的爬虫

          本文链接:https://www.haomeiwen.com/subject/euqzeftx.html