美文网首页
Scrapy抓取豆瓣电影TOP250

Scrapy抓取豆瓣电影TOP250

作者: 我的袜子都是洞 | 来源:发表于2018-10-26 11:11 被阅读12次

    目标站点:


    Jietu20181026-110711@2x.jpg

    提取结构化条目(电影排名、电影名称、电影评分、电影评价人数):
    iterms.py

    import scrapy
    
    class DoubanMovieItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        ranking = scrapy.Field()
        movie_name = scrapy.Field()
        score = scrapy.Field()
        score_num = scrapy.Field()
    

    爬取源码:
    spider.py

    import scrapy
    from ..items import  DoubanMovieItem
    
    class SinaSpider(scrapy.Spider):
       name = 'douban'
       start_urls = [
           "https://movie.douban.com/top250",
       ]
    
       def parse(self, response):
           item = DoubanMovieItem()
           movies = response.xpath("//div[@class='item']")
           for movie in movies:
               item['ranking'] =  movie.xpath("./div/em/text()").extract_first()
               item['movie_name'] = movie.xpath("./div/div/a/span[1]/text()").extract_first()
               item['score'] = movie.xpath("./div/div/div[@class='star']/span[@class='rating_num']/text()").extract_first()
               item['score_num'] = movie.xpath("./div/div/div[@class='star']/span[4]/text()").extract_first()
               yield item
           
           next_page = response.xpath("//div[@class='paginator']/span[@class='next']/a/@href").extract_first()
           if next_page is not None:
               next_url = "https://movie.douban.com/top250" + next_page
               yield scrapy.Request(next_url)
    

    运行效果:


    Jietu20181026-111047@2x.jpg

    相关文章

      网友评论

          本文标题:Scrapy抓取豆瓣电影TOP250

          本文链接:https://www.haomeiwen.com/subject/qdjktqtx.html