美文网首页Python爬虫系列
Scrapy+MySQL爬取豆瓣电影TOP250

Scrapy+MySQL爬取豆瓣电影TOP250

作者: hu1991die | 来源:发表于2017-11-21 23:51 被阅读1132次

    说真的,不知道为啥!只要一问那些做过爬虫的筒靴,不管是自己平时兴趣爱好亦或是刚接触入门,都喜欢拿豆瓣网作为爬虫练手对象,以至于到现在都变成了没爬过豆瓣的都不好意思说自己搞过爬虫了。好了,切入正题......

    一、系统环境

    Python版本:2.7.12(64位)
    Scrapy版本:1.4.0
    Mysql版本:5.6.35(64位)
    系统版本:Win10(64位)
    MySQLdb版本: MySQL-python-1.2.3.win-amd64-py2.7(64位)
    开发IDE:PyCharm-2106.3.3(64位)

    二、安装MySQL数据库

    官网下载地址:http://www.mysql.com/downloads/
    可以顺带装个图形化工具,我用的是Navicat-for-MySQL11.0.9,官网下载地址:http://www.formysql.com/xiazai_mysql.html

    2.1、安装MySQLdb

    ok,到这里,说明上面的MySQL已经安装成功了,接下来你需要安装MySQLdb了。

    2.2、什么是MySQLdb?

    MySQLdb 是用于Python链接Mysql数据库的接口,它实现了 Python 数据库 API 规范 V2.0,基于 MySQL C API 上建立的;简单来说,就是类似于Java中的JDBC。

    2.3、如何安装MySQLdb?

    目前你有两个选择:

    • 1、安装已编译好的版本(强烈推荐)
    • 2、从官网下载,自己编译安装(这个真要取决于个人的RP人品了,如果喜欢折腾的话不妨可以试他一试,在此不做介绍,请自行度娘即可)

    ok,我们选择第一种方式,官网下载地址:http://www.codegood.com/downloads,大家根据自己的系统自行下载即可,下载完毕直接双击进行安装,可以修改下安装路径,然后一路next即可。

    image.png

    2.4、验证MySQLdb是否安装成功

    cmd——》输入python——》输入import MySQLdb,查看是否报错,没有报错则说明MySQLdb安装成功!

    image.png

    2.5、如何使用MySQLdb

    请大家自行参考W3C教程:http://www.runoob.com/python/python-mysql.html

    2.6、熟悉XPath

    抓取网页时,你做的最常见的任务是从HTML源码中提取数据。现有的一些库可以达到这个目的。

    • BeautifulSoup:是在程序员间非常流行的网页分析库,它基于HTML代码的结构来构造一个Python对象, 对不良标记的处理也非常合理,但它有一个缺点:慢。
    • lxml:是一个基于 ElementTree (不是Python标准库的一部分)的python化的XML解析库(也可以解析HTML)。
    • XPath:即为XML路径语言,它是一种用来确定XML(标准通用标记语言的子集)文档中某部分位置的语言。XPath基于XML的树状结构,有不同类型的节点,包括元素节点,属性节点和文本节点,提供在数据结构树中找寻节点的能力。

    Scrapy提取数据有自己的一套机制。它们被称作选择器(seletors),因为他们通过特定的 XPath 或者 CSS 表达式来“选择” HTML文件中的某个部分。

    关于XPath的使用,大家可以自行参考官网教程:https://www.w3.org/TR/xpath/
    或者中文教程:http://www.w3school.com.cn/xpath/index.asp

    ok,有了上面这些基本的准备工作之后,我们可以开始正式编写爬虫程序了。这里以豆瓣电影TOP250为例:https://movie.douban.com/top250

    三、编写爬虫

    首先我们使用Chrome或者Firefox浏览器打开这个地址,然后一起分析下这个页面的html元素结构,按住F12键即可查看网页源代码。分析页面我们可以看到,最终需要提取的信息都已经被包裹在class属性为grid_view的这个ol里面了,所以我们就可以基本确定解析范围了,以这个ol元素为整个大的边框,然后再在里面进行查找定位即可。

    image.png

    然后具体细节在此就不罗嗦了,直接撸代码吧:
    完整的代码已经上传至github上git@github.com:hu1991die/douan_movie_spider.git,欢迎fork,欢迎clone!
    1、DoubanMovieTop250Spider.py

    # encoding: utf-8
    '''
    @author: feizi
    @file: DoubanMovieTop250Spider.py
    @Software: PyCharm
    @desc:
    '''
    import re
    
    from scrapy import Request
    from scrapy.spiders import Spider
    from douan_movie_spider.items import DouanMovieItem
    
    class DoubanMovieTop250Spider(Spider):
        name = 'douban_movie_top250'
    
        def start_requests(self):
            url = 'https://movie.douban.com/top250'
            yield Request(url)
    
        def parse(self, response):
            item = DouanMovieItem()
            movieList = response.xpath('//ol[@class="grid_view"]/li')
            for movie in movieList:
                # 排名
                rank = movie.xpath('.//div[@class="pic"]/em/text()').extract_first()
                # 封面
                cover = movie.xpath('.//div[@class="pic"]/a/img/@src').extract_first()
                # 标题
                title = movie.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first()
                # 评分
                score = movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract_first()
                # 评价人数
                comment_num = movie.xpath('.//div[@class="star"]/span[4]/text()').re(ur'(\d+)')[0]
                # 经典语录
                quote = movie.xpath('.//p[@class="quote"]/span[@class="inq"]/text()').extract_first()
                # 上映年份,上映地区,电影分类
                briefList = movie.xpath('.//div[@class="bd"]/p/text()').extract()
                if briefList:
                    # 以'/'进行分割
                    briefs = re.split(r'/', briefList[1])
                    # 电影分类
                    types = re.compile(u'([\u4e00-\u9fa5].*)').findall(briefs[len(briefs) - 1])[0]
                    # 上映地区
                    region = re.compile(u'([\u4e00-\u9fa5]+)').findall(briefs[len(briefs) - 2])[0]
                    if len(briefs) <= 3:
                        # 上映年份
                        years = re.compile(ur'(\d+)').findall(briefs[len(briefs) - 3])[0]
                    else:
                        # 上映年份
                        years = ''
                        for brief in briefs:
                            if hasNumber(brief):
                                years = years + re.compile(ur'(\d+)').findall(brief)[0] + ","
                                print years
    
                    if types:
                        # 替换空格为“,”
                        types = types.replace(" ", ",")
    
                print(rank, cover, title, score, comment_num, quote, years, region, types)
                item['rank'] = rank
                item['cover'] = cover
                item['title'] = title
                item['score'] = score
                item['comment_num'] = comment_num
                item['quote'] = quote
                item['years'] = years
                item['region'] = region
                item['types'] = types
                yield item
    
            # 获取下一页url
            next_url = response.xpath('//span[@class="next"]/a/@href').extract_first()
            if next_url:
                next_url = 'https://movie.douban.com/top250' + next_url
                yield Request(next_url)
    
    def hasNumber(str):
        return bool(re.search('\d+', str))
    

    2、items.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    # 电影实体类
    class DouanMovieItem(scrapy.Item):
        # 排名
        rank = scrapy.Field()
        # 封面
        cover = scrapy.Field()
        # 标题
        title = scrapy.Field()
        # 评分
        score = scrapy.Field()
        # 评价人数
        comment_num = scrapy.Field()
        # 经典语录
        quote = scrapy.Field()
        # 上映年份
        years = scrapy.Field()
        # 上映地区
        region = scrapy.Field()
        # 电影类型
        types = scrapy.Field()
    

    3、pipelines.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import MySQLdb
    from scrapy.exceptions import DropItem
    
    from douan_movie_spider.items import DouanMovieItem
    
    # 获取数据库连接
    def getDbConn():
        conn = MySQLdb.Connect(
            host='127.0.0.1',
            port=3306,
            user='root',
            passwd='123456',
            db='testdb',
            charset='utf8'
        )
        return conn
    
    # 关闭数据库资源
    def closeConn(cursor, conn):
        # 关闭游标
        if cursor:
            cursor.close()
        # 关闭数据库连接
        if conn:
            conn.close()
    
    
    class DouanMovieSpiderPipeline(object):
        def __init__(self):
            self.ids_seen = set()
    
        def process_item(self, item, spider):
            if item['title'] in self.ids_seen:
                raise DropItem("Duplicate item found: %s" % item)
            else:
                self.ids_seen.add(item['title'])
                if item.__class__ == DouanMovieItem:
                    self.insert(item)
                    return
            return item
    
        def insert(self, item):
            try:
                # 获取数据库连接
                conn = getDbConn()
                # 获取游标
                cursor = conn.cursor()
                # 插入数据库
                sql = "INSERT INTO db_movie(rank, cover, title, score, comment_num, quote, years, region, types)VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s)"
                params = (item['rank'], item['cover'], item['title'], item['score'], item['comment_num'], item['quote'], item['years'], item['region'], item['types'])
                cursor.execute(sql, params)
    
                #事务提交
                conn.commit()
            except Exception, e:
                # 事务回滚
                conn.rollback()
                print 'except:', e.message
            finally:
                # 关闭游标和数据库连接
                closeConn(cursor, conn)
    

    4、main.py

    # encoding: utf-8
    '''
    @author: feizi
    @file: main.py
    @Software: PyCharm
    @desc:
    '''
    
    from scrapy import cmdline
    
    name = "douban_movie_top250"
    # cmd = "scrapy crawl {0} -o douban.csv".format(name)
    cmd = "scrapy crawl {0}".format(name)
    cmdline.execute(cmd.split())
    

    5、settings.py

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for douan_movie_spider project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     http://doc.scrapy.org/en/latest/topics/settings.html
    #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'douan_movie_spider'
    
    SPIDER_MODULES = ['douan_movie_spider.spiders']
    NEWSPIDER_MODULE = 'douan_movie_spider.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3013.3 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'douan_movie_spider.middlewares.DouanMovieSpiderSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'douan_movie_spider.middlewares.MyCustomDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'douan_movie_spider.pipelines.DouanMovieSpiderPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    

    需要注意一点,为了防止爬虫被ban,我们可以设置一下USER-AGENT.
    还是F12键,查看一下Request Headers请求头,找到User-Agent信息然后设置到settings文件中即可。当然,这只是一种简单的方式,其他更复杂的策略如IP池,User-Agent池请自行google吧,这里不做赘述。


    image.png

    四、运行爬虫

    image.png

    五、保存结果

    image.png

    六、简单数据可视化分析

    最后,给大家看下简单的数据可视化分析效果。

    6.1、评分top10

    image.png

    6.2、标题云

    image.png

    6.3、语录云

    image.png

    6.4、评论TOP10

    image.png

    6.5、每一年电影上映数统计

    image.png

    6.6、上映地区统计

    image.png

    6.7、电影类型汇总

    image.png

    项目完整代码已上传至github:https://github.com/hu1991die/douan_movie_spider,欢迎fork~~~

    相关文章

      网友评论

        本文标题:Scrapy+MySQL爬取豆瓣电影TOP250

        本文链接:https://www.haomeiwen.com/subject/jdegvxtx.html