2、python爬虫——爬虫豆瓣电影

作者: abeb6ca9bb86 | 来源:发表于2017-03-01 22:44 被阅读205次

Python学习
Python爬虫-豆瓣电影Top250-各项信息爬取及xls保存
Python学习
2、python爬虫——爬虫豆瓣电影
2017.11.6项目环境搭建
使用爬虫爬取豆瓣电影影评数据Python版
豆瓣爬虫实践-python版
python使用requests+re爬取豆瓣电影top250简
Python 爬虫豆瓣电影
Python爬虫-豆瓣电影2020最新版

本篇内容需要大家对scrapy框架有了解，并完成了入门学习才能继续使用。

创建项目

scrape startproject tutorial

定义Item如下:

import scrapy

class TutorialItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

title = scrapy.Field()

movieInfo = scrapy.Field()

star = scrapy.Field()

quote = scrapy.Field()

制作爬虫（Spider）

代码内容如下：

import scrapy

class Douban(scrapy.Spider):

name = "douban"

start_urls = ['http://movie.douban.com/top250']

def parse(self,response):

print response.body

运行一下看看

scrapy crawl douban

INFO: Closing spider (finished)表明爬虫已经成功运行并且自行关闭了。

创建主函数

创建文件main.py文件，内容如下：

from scrapy import cmdline

cmdline.execute("scrapy crawl douban".split())

DEBUG1：HTTP status code is not handled or not allowed

DEBUG: Ignoring response <403 http://movie.douban.com/top250>: HTTP status code is not handled or not allowed

Answer:被屏蔽了，在settings.py里加上USER_AGENT：

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'

取数据

OK，获取网页后，接下去就是取数据了。

导入包：

import scrapy

from tutorial.items import TutorialItem

from scrapy.selector import Selector

from scrapy.http import Request

class Douban(scrapy.Spider):

name = "douban250"

start_urls = ['http://movie.douban.com/top250']

url = 'http://movie.douban.com/top250'

def parse(self,response):

item = TutorialItem()

selector = Selector(response)

#print selector

Movies = selector.xpath('//div[@class="info"]')

#print Movies

for eachMoive in Movies:

title = eachMoive.xpath('div[@class="hd"]/a/span/text()').extract()

fullTitle = ''

for each in title:

fullTitle += each

movieInfo = eachMoive.xpath('div[@class="bd"]/p/text()').extract()

star = eachMoive.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]

quote = eachMoive.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()

if quote:

quote = quote[0]

else:

quote = ''

#print fullTitle

#print movieInfo

#print star

#print quote

item['title'] = fullTitle

item['movieInfo'] = ';'.join(movieInfo)

item['star'] = star

item['quote'] = quote

yield item

nextLink = selector.xpath('//span[@class="next"]/link/@href').extract()

if nextLink:

nextLink = nextLink[0]

print nextLink

yield Request(self.url + nextLink, callback=self.parse)

存储数据一

-o 后面是导出文件名，-t 后面是导出类型。

scrapy crawl douban -o items.csv -t csv

然后用numbers程序打开即可看到250个电影排序了。

存储数据二

可以直接在settings.py文件中设置输出的位置和文件类型，如下：

FEED_URI = u'file:///E:/douban/douban.csv'

FEED_FORMAT = 'CSV'

网友评论

网络爬虫框架——scrapy

本文标题：2、python爬虫——爬虫豆瓣电影

本文链接：https://www.haomeiwen.com/subject/wrgcgttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

2、python爬虫——爬虫豆瓣电影

创建项目

定义Item如下:

制作爬虫（Spider）

创建主函数

取数据

存储数据一

存储数据二

相关文章

Python学习

Python爬虫-豆瓣电影Top250-各项信息爬取及xls保存

Python学习

2、python爬虫——爬虫豆瓣电影

2017.11.6项目环境搭建

使用爬虫爬取豆瓣电影影评数据Python版

豆瓣爬虫实践-python版

python使用requests+re爬取豆瓣电影top250简

Python 爬虫豆瓣电影

Python爬虫-豆瓣电影2020最新版

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

网络爬虫框架——scrapy