美文网首页
Python爬虫——电影top榜

Python爬虫——电影top榜

作者: _羊羽_ | 来源:发表于2018-10-02 22:38 被阅读801次

    猫眼电影TOP100榜

    爬取内容名分析

    image.png
    image.png

    图片的实际是data-src,而不是src需要实际看一下请求数据返回的response值

    from Toscrape.items import MaoyanItem
    import scrapy
    
    class MaoyanMovieSpider(scrapy.Spider):
        name = 'maoyanMovie'
        allowed_domains = ['maoyan.com']
        start_urls = ['http://maoyan.com/board/4']
    
        def parse(self, response):
              itemList = response.xpath("//dl[@class='board-wrapper']/dd")
              for item in itemList:
                  movie = MaoyanItem()
                  movie['index'] = item.xpath(".//i[starts-with(@class,'board-index')]/text()").extract_first()
                  movie['title'] = item.xpath(".//div[@class='movie-item-info']/p[@class='name']/a[1]/text()").extract_first()
                  movie['pict'] = item.xpath(".//a[@class='image-link']/img[@class='board-img']/@data-src").extract_first()
                  movie['star'] = item.xpath("normalize-space(.//div[@class='movie-item-info']/p[@class='star']/text())").extract_first()
                  movie['time'] = item.xpath(".//div[@class='movie-item-info']/p[@class='releasetime']/text()").extract_first()
                  scoreList = item.xpath(".//div[@class='movie-item-number score-num']/p[@class='score']")
                  for scroe in scoreList:
                      integer = scroe.xpath("./i[@class='integer']/text()").extract_first()
                      fraction = scroe.xpath("./i[@class='fraction']/text()").extract_first()
                      movie['score'] = integer+fraction
    
                  yield movie
    
              next_url= response.xpath("//ul[@class='list-pager']//a[contains(text(),'下一页')]/@href").extract_first()
              if next_url:
                  yield scrapy.Request(url=response.urljoin(next_url), callback=self.parse)
    

    创建item需要获取的内容选项、

    class MaoyanItem(scrapy.Item):
        # define the fields for your item here like:
         index = scrapy.Field()
         title = scrapy.Field()
         pict = scrapy.Field()
         time = scrapy.Field()
         star = scrapy.Field()
         score = scrapy.Field()
    

    创建main.py执行爬虫任务

    
    from scrapy.cmdline import execute
    
    # 执行命令启动爬虫
    execute(["scrapy", "crawl", 'maoyanMovie'])
    
    image.png

    豆瓣电影 Top 250

    爬取内容名分析

    image.png
    <li>
                <div class="item">
                    <div class="pic">
                        <em class="">1</em>
                        <a href="https://movie.douban.com/subject/1292052/">
                            <img width="100" alt="肖申克的救赎" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
                        </a>
                    </div>
                    <div class="info">
                        <div class="hd">
                            <a href="https://movie.douban.com/subject/1292052/" class="">
                                <span class="title">肖申克的救赎</span>
                                        <span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
                                    <span class="other">&nbsp;/&nbsp;月黑高飞(港)  /  刺激1995(台)</span>
                            </a>
                                <span class="playable">[可播放]</span>
                        </div>
                        <div class="bd">
                            <p class="">
                                导演: 弗兰克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                                1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;犯罪 剧情
                            </p> 
                            <div class="star">
                                    <span class="rating5-t"></span>
                                    <span class="rating_num" property="v:average">9.6</span>
                                    <span property="v:best" content="10.0"></span>
                                    <span>1153984人评价</span>
                            </div>
                                <p class="quote">
                                    <span class="inq">希望让人自由。</span>
                                </p>
                        </div>
                    </div>
                </div>
            </li>
    

    爬取内容字段

    内容 描述
    index 电影排名
    name 电影名称
    director 电影导演
    starring 电影主演
    rating 电影评分
    evaluate 电影评分
    pict 电影剧照
    year 电影上映时间
    nation 电影所属国家
    tags 电影类型

    items.py 增加需要爬取的内容

    class MovieItem(scrapy.Item):
        # define the fields for your item here like:
         index = scrapy.Field()
         name = scrapy.Field()
         director =scrapy.Field()
         starring = scrapy.Field()
         rating= scrapy.Field()
         evaluate = scrapy.Field()
         pict = scrapy.Field()
         year = scrapy.Field()
         nation = scrapy.Field()
         tags = scrapy.Field()
    
    import scrapy
    import re
    from Toscrape.items import MovieItem
    
    class MovietopSpider(scrapy.Spider):
        name = 'movieTop'
        allowed_domains = ['movie.douban.com']
        start_urls = ['http://movie.douban.com/top250/']
    
        def parse(self, response):
            movieList = response.xpath("//ol[@class='grid_view']//li")
            for item in movieList:
                movie = MovieItem()
                movie['index'] = item.xpath(".//div[@class='pic']//em/text()").extract_first()
                movie['name'] = item.xpath(".//div[@class='hd']/a/span[@class='title'][1]/text()").extract_first()
                movie['rating'] = item.xpath(".//div[@class='star']/span[@class='rating_num']/text()").extract_first()
                movie['evaluate'] = item.xpath(".//p[@class='quote']/span[@class='inq']/text()").extract_first()
                movie['pict'] = item.xpath(".//div[@class='pic']/a[1]/img[1]/@src").extract_first()
                list = item.xpath(".//div[@class='bd']/p[1]/text()").extract()
                for content in list:
                    location = re.search(r'导演:(.*?)\s{3}(.*)...', content)
                    separator = re.search(r'(\d{4}.*\)?\s)/(.*?)/(.*)', content)
                    if location:
                        movie['director'] = "".join(location.group(1).split())
                        movie['starring'] = "".join(location.group(2).split())[3:]
                    elif separator:
                        movie['year'] = separator.group(1).strip()
                        movie['nation'] = separator.group(2).strip()
                        movie['tags'] = separator.group(3).strip()
                    else:
                        movie['director'] = "".join(content.split())
                yield movie
    
            next_url = response.xpath('//span[@class="next"]/link/@href').extract_first()
            if next_url:
                yield scrapy.Request(url=response.urljoin(next_url), callback=self.parse)
    

    main.py文件执行爬虫

    from scrapy.cmdline import execute
    # 执行命令启动爬虫
    execute(["scrapy", "crawl", 'movieTop'])
    
    image.png

    相关文章

      网友评论

          本文标题:Python爬虫——电影top榜

          本文链接:https://www.haomeiwen.com/subject/hmoooftx.html