美文网首页学习提升模版Python
12.scrapy实战之爬取简书整站内容

12.scrapy实战之爬取简书整站内容

作者: MononokeHime | 来源:发表于2018-06-14 12:56 被阅读0次

    这一节,我们利用scrapy来爬取简书整站的内容。对于一篇文章详情页面,我们发现许多内容是Ajax异步加载的,所以使用传统方式返回的response里并没有我们想要的数据,例如评论数,喜欢数等等。对于动态数据请求,我们使用selenium+chromedriver来完成

    1.到淘宝镜像https://npm.taobao.org/mirrors/chromedriver选择对应的chromedriver。将解压后的chromedriver.exe放到chrome浏览器的安装目录下
    2.安装 selenium pip install selenium

    数据库设计

    设计爬虫之前,我们需要知道爬取的内容有哪些,如下是数据库字段部分。其中id是主键,且设置为自动增长。

    image.png

    整个爬虫的执行流程

    • 首先从start_urls 里取出要执行的请求地址
    • 返回的内容经过下载中间件进行处理(selenium加载动态数据)
    • 经过中间件处理的数据(response)返回给爬虫进行提取数据
    • 提取出的数据返回给pipeline进行存储

    创建爬虫项目

    1.进入到虚拟环境下
    scrapy startproject jianshu
    2.进入项目(jianshu)下,新建spider,由于我们是整站爬虫,所以我们可以指定crawl模板,利用里面的rule来方便爬取
    scrapy genspider -t crawl js jianshu.com
    此时spiders文件夹下多了一个js.py

    在items.py中定义字段

    import scrapy
    
    class JianshuItem(scrapy.Item):
        title = scrapy.Field()
        content = scrapy.Field()
        article_id = scrapy.Field()
        origin_url = scrapy.Field()
        author = scrapy.Field()
        avatar = scrapy.Field()
        pub_time = scrapy.Field()
        read_count = scrapy.Field()
        like_count = scrapy.Field()
        word_count = scrapy.Field()
        subjects = scrapy.Field()
        comment_count = scrapy.Field()
    

    定义下载中间件

    request请求和response响应都是要经过下载中间件,所以我们在这里将selenium集成到scrapy的中间件下载器中。定义完中间件之后,一定要在setting中开启。

    # middlewares.py
    from scrapy import signals
    from selenium import webdriver
    import time
    from scrapy.http.response.html import HtmlResponse
    
    class SeleniumDownloadMiddleware(object):
        def __init__(self):
            self.driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
    
        def process_request(self, request, spider):
            self.driver.get(request.url)
            time.sleep(2)
            try:
                while True:
                    showMore = self.driver.find_element_by_class_name('show-more')  #获取标签
                    showMore.click()
                    time.sleep(0.5)
                    if not showMore:
                        break
            except:
                pass
            source = self.driver.page_source
            response = HtmlResponse(url=self.driver.current_url, body=source, request=request,encoding='utf-8')
            return response
    
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'
    }
    

    爬虫的设计

    发现简书上的文章的页面的url满足特点的规则:
    https://www.jianshu.com/p/d65909d2173a
    前面是域名,后面是uid,这样可以通过crawl模板来制定Rule。

    # js.py
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from jianshu.items import JianshuItem
    
    
    class JsSpider(CrawlSpider):
        name = 'js'
        allowed_domains = ['jianshu.com']
        start_urls = ['http://jianshu.com/']
    
        rules = (
            Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),
        )
    
        def parse_detail(self, response):
            title = response.xpath("//h1[@class='title']/text()").get()
            avatar = response.xpath("//a[@class='avatar']/img/@src").get()
            author = response.xpath("//span[@class='name']/a/text()").get()
            pub_time = response.xpath("//span[@class='publish-time']/text()").get().replace("*","")
            url = response.url
            url1 = url.split("?")[0]
            article_id = url1.split('/')[-1]
            content = response.xpath("//div[@class='show-content']").get()
            word_count_list = response.xpath("//span[@class='wordage']/text()").get().split(' ') #字数 10000
            word_count = int(word_count_list[-1])
            comment_count_list = response.xpath("//span[@class='comments-count']/text()").get().split(' ') #评论 427
            comment_count = int(comment_count_list[-1])
            read_count_list = response.xpath("//span[@class='views-count']/text()").get().split(' ') #喜欢 427
            read_count=int(read_count_list[-1])
            like_count_list = response.xpath("//span[@class='likes-count']/text()").get().split(' ') #喜欢 3
            like_count = int(like_count_list[-1])
            subjects = ",".join(response.xpath("//div[@class='include-collection']/a/div/text()").getall())
    
            item = JianshuItem(
                title=title,
                avatar=avatar,
                author=author,
                pub_time=pub_time,
                origin_url=response.url,
                article_id=article_id,
                content=content,
                subjects=subjects,
                word_count=word_count,
                comment_count=comment_count,
                read_count=read_count,
                like_count=like_count,
            )
            print('y' * 100)
            yield item
    

    pipelines的设计

    这里是将爬虫(js.py)中返回的item保存到数据库中。下面代码操作数据库是同步的。

    import pymysql
    
    class JianshuPipeline(object):
        def __init__(self):
            dbparams = {
                'host':'127.0.0.1',
                'port':3306,
                'user':'root',
                'password':'123456',
                'database':'jianshu',
                'charset':'utf8'
            }
            self.conn = pymysql.connect(**dbparams)
            self.cursor = self.conn.cursor()
            self._sql = None
    
        def process_item(self,item,spider):
            self.cursor.execute(self.sql,(item['title'],item['content'],
                                          item['author'],item['avatar'],
                                          item['pub_time'],item['origin_url'],
                                          item['article_id'],item['read_count'],
                                          item['like_count'],item['word_count'],
                                          item['subjects'],item['comment_count']))
            self.conn.commit()
            return item
    
        @property
        def sql(self):
            if not self._sql:
                self._sql = """
                insert into js(id,title,content,author,avatar,pub_time,
                origin_url,article_id,read_count,like_count,word_count,subjects,comment_count) values (null,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
                """
                return self._sql
            return self._sql
    

    setting.py

    为了使上面定义的中间件起作用,必须在setting中开启中间件

    • 设置user-agent
    • 关闭robot协议
    • 设置合理地下载延迟,否则会被服务器禁用
    • 开启下载器中间件
    • 开启管道
    ROBOTSTXT_OBEY = False
    DOWNLOAD_DELAY = 3
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'
    }
    DOWNLOADER_MIDDLEWARES = {
       'jianshu.middlewares.SeleniumDownloadMiddleware': 543,
    }
    ITEM_PIPELINES = {
       'jianshu.pipelines.JianshuPipeline': 300,
    }
    ......
    

    启动爬虫

    我们可以在终端中取运行爬虫,我们也可以新建start.py,用py文件来执行命令行操作,从而来运行爬虫。

    from scrapy import cmdline
    cmdline.execute(['scrapy','crawl','js'])
    

    相关文章

      网友评论

        本文标题:12.scrapy实战之爬取简书整站内容

        本文链接:https://www.haomeiwen.com/subject/maywsftx.html