美文网首页程序员大数据 爬虫Python AI Sql
Scrapy爬取猫眼电影并存入MongoDB数据库

Scrapy爬取猫眼电影并存入MongoDB数据库

作者: Treehl | 来源:发表于2017-12-21 22:45 被阅读0次

    之前入门了Scrapy,用Scrapy框架爬取了豆瓣电影TOP250,最近打算学习下scrapy-redis分布式爬虫,学习之前再重新温故下Scrapy,这个总结我缩写了很多内容,很多介绍可以看下我之前写的豆瓣movie

    实战应用

    打开CMD输入

    scrapy startproject maoyan

    C:.
    │  scrapy.cfg
    │
    └─maoyan
        │  items.py
        │  middlewares.py
        │  pipelines.py
        │  settings.py
        │  __init__.py
        │
        └─spiders
                __init__.py
    

    编辑 item.py

    import scrapy
    
    
    class MaoyanItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        movie_name = scrapy.Field()
        movie_ename = scrapy.Field()
        movie_type = scrapy.Field()
        movie_publish = scrapy.Field()
        movie_time = scrapy.Field()
        movie_star = scrapy.Field()
        movie_total_price = scrapy.Field()
        pass
    
    • 首先,引入Scrapy
    • 接着,创建一个类,继承自scrapy.item,这个是用来储存要爬下来的数据的存放容器,类似orm的写法
    • 我们要记录的是:电影的名字、电影的评分、电影的上映时间、电影类型、电影英文名

    获取网页数据

    网页数据使用Xpath来索取元素非常方便,附上W3SCHOOL xpath学习 接下来,我们定义一下url的爬取规则

    [图片上传失败...(image-4f0d9a-1513867500981)]

    http://maoyan.com/films?offset=30
    用正则表达式定义下一页链接
    r'http://maoyan.com/films?offset=\d+'

    [图片上传失败...(image-e05661-1513867500981)]
    我们真正要抓取电影详情页的是这个链接 http://maoyan.com/films/1170264

    正则定义链接
    r'http://maoyan.com/films/\d+'

    好了,到这一步编辑spider

    
    from scrapy.spiders import Rule, CrawlSpider
    from scrapy.selector import Selector
    from scrapy.linkextractors import LinkExtractor
    from maoyan.items import MaoyanItem
    
    
    class MaoyanmovieSpider(CrawlSpider):
        name = 'my'
        # allowed_domains = ['http://maoyan.com/']
        start_urls = ['http://maoyan.com/films']
        rules = (
            Rule(LinkExtractor(allow=(r'http://maoyan.com/films\?offset=\d+'))),
            Rule(LinkExtractor(allow=(r'http://maoyan.com/films/\d+')), callback='parse_item')
        )
    
        def parse_item(self, response):
            # print(response.body)
            sel = Selector(response)
            movie_name = sel.xpath('/html/body/div[3]/div/div[2]/div[1]/h3/text()').extract()
            movie_ename = sel.xpath('/html/body/div[3]/div/div[2]/div[1]/div/text()').extract()
            movie_type = sel.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[1]/text()').extract()
            movie_publish = sel.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[2]/text()').extract()
            movie_time = sel.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[3]/text()').extract()
            movie_star = sel.xpath('/html/body/div[3]/div/div[2]/div[3]/div[1]/div/span/span/text()').extract()
            # movie_total_price = sel.xpath('/html/body/div[3]/div/div[2]/div[3]/div[2]/div/span[1]/text()').extract()
            # movie_introd = sel.xpath('//*[@id="app"]/div/div[1]/div/div[2]/div[1]/div[1]/div[2]/span/text()').extract()
            # print(movie_name)
            # print(movie_ename)
            # print(movie_type)
            # print(movie_publish)
            # print(movie_time)
            # print(movie_star)
            # print(movie_total_price)
    
            item = MaoyanItem()
            item['movie_name'] = movie_name
            item['movie_ename'] = movie_ename
            item['movie_type'] = movie_type
            item['movie_publish'] = movie_publish
            item['movie_time'] = movie_time
            item['movie_star'] = movie_star
            # item['movie_total_price'] = movie_total_price
            # item['movie_introd'] = movie_introd
    
            yield item
    

    spider写完后我们要将数据存进MongoDB数据库内,编辑pipeline.py

    import pymongo
    from scrapy.conf import settings
    from scrapy.exceptions import DropItem
    from scrapy import log
    
    
    
    class MongoDBPipeline(object):
        def __init__(self):
            client = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
            db = client[settings['MONGODB_DB']]
            self.collection = db[settings['MONGODB_COLLECTION']]
    
    
        def process_item(self, item, spider):
            # item:  (Item 对象) – 被爬取的item
            # (Spider 对象) – 爬取该item的spider
            # 去重,删除重复的数据
            valid = True
            for data in item:
                if not data:
                    valid = False
                    raise DropItem('Missing %s of blogpost from %s' % (data, item['url']))
            if valid:
                movies = [{
                    'movie_name': item['movie_name'],
                    'movie_ename': item['movie_ename'],
                    'movie_type': item['movie_type'],
                    'movie_publish': item['movie_publish'],
                    'movie_time': item['movie_time'],
                    'movie_star': item['movie_star']
                }]
                # 插入数据库集合中
                self.collection.insert(movies)
                log.msg('Item wrote to MongoDB database %s/%s' % (settings['MONGODB_DB'], settings['MONGODB_COLLECTION']),
                        level=log.DEBUG, spider=spider)
            return item
    

    配置文件
    打开setting.py

    BOT_NAME = 'maoyan'
    
    SPIDER_MODULES = ['maoyan.spiders']
    NEWSPIDER_MODULE = 'maoyan.spiders'
    ROBOTSTXT_OBEY = False
    COOKIES_ENABLED = True
    DOWNLOAD_DELAY = 3
    LOG_LEVEL = 'DEBUG'
    RANDOMIZE_DOWNLOAD_DELAY = True
    # 关闭重定向
    REDIRECT_ENABLED = False
    # 返回302时,按正常返回对待,可以正常写入cookie
    HTTPERROR_ALLOWED_CODES = [302,]
    
    ITEM_PIPELINES = {
        'maoyan.pipelines.MongoDBPipeline': 300,
    }
    
    MONGODB_SERVER = 'localhost'
    MONGODB_PORT = 27017
    MONGODB_DB = 'maoyan'
    MONGODB_COLLECTION = 'movies'
    

    好了,现在开启爬虫

    scrapy crawl my

    [图片上传失败...(image-1f54d8-1513867500981)]

    写这个爬虫应该会遇到302重定向或者被网站发现是机器人操作,建议延长delay时间,不过爬取效率会非常低!!总共有23110页,每页有30条数据,总共693300条数据,就算不被ban掉,那得爬到猴年马月............................................................
    不说了,赶紧学习分布式爬虫!!!!

    [图片上传失败...(image-b399d9-1513867500981)]

    欢迎访问博客Treehl的博客
    完整代码GitHub
    简书
    最后放一个爬虫集合,是我最近学习Python写的,喜欢的亲!给个Star呗!!!
    SpiderList

    相关文章

      网友评论

        本文标题:Scrapy爬取猫眼电影并存入MongoDB数据库

        本文链接:https://www.haomeiwen.com/subject/japzwxtx.html