目标站点:
Jietu20181026-110711@2x.jpg
提取结构化条目(电影排名、电影名称、电影评分、电影评价人数):
iterms.py
import scrapy
class DoubanMovieItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
ranking = scrapy.Field()
movie_name = scrapy.Field()
score = scrapy.Field()
score_num = scrapy.Field()
爬取源码:
spider.py
import scrapy
from ..items import DoubanMovieItem
class SinaSpider(scrapy.Spider):
name = 'douban'
start_urls = [
"https://movie.douban.com/top250",
]
def parse(self, response):
item = DoubanMovieItem()
movies = response.xpath("//div[@class='item']")
for movie in movies:
item['ranking'] = movie.xpath("./div/em/text()").extract_first()
item['movie_name'] = movie.xpath("./div/div/a/span[1]/text()").extract_first()
item['score'] = movie.xpath("./div/div/div[@class='star']/span[@class='rating_num']/text()").extract_first()
item['score_num'] = movie.xpath("./div/div/div[@class='star']/span[4]/text()").extract_first()
yield item
next_page = response.xpath("//div[@class='paginator']/span[@class='next']/a/@href").extract_first()
if next_page is not None:
next_url = "https://movie.douban.com/top250" + next_page
yield scrapy.Request(next_url)
运行效果:
Jietu20181026-111047@2x.jpg
网友评论