说真的,不知道为啥!只要一问那些做过爬虫的筒靴,不管是自己平时兴趣爱好亦或是刚接触入门,都喜欢拿豆瓣网作为爬虫练手对象,以至于到现在都变成了没爬过豆瓣的都不好意思说自己搞过爬虫了。好了,切入正题......
一、系统环境
Python版本:2.7.12(64位)
Scrapy版本:1.4.0
Mysql版本:5.6.35(64位)
系统版本:Win10(64位)
MySQLdb版本: MySQL-python-1.2.3.win-amd64-py2.7(64位)
开发IDE:PyCharm-2106.3.3(64位)
二、安装MySQL数据库
官网下载地址:http://www.mysql.com/downloads/
可以顺带装个图形化工具,我用的是Navicat-for-MySQL11.0.9,官网下载地址:http://www.formysql.com/xiazai_mysql.html
2.1、安装MySQLdb
ok,到这里,说明上面的MySQL已经安装成功了,接下来你需要安装MySQLdb了。
2.2、什么是MySQLdb?
MySQLdb 是用于Python链接Mysql数据库的接口,它实现了 Python 数据库 API 规范 V2.0,基于 MySQL C API 上建立的;简单来说,就是类似于Java中的JDBC。
2.3、如何安装MySQLdb?
目前你有两个选择:
- 1、安装已编译好的版本(强烈推荐)
- 2、从官网下载,自己编译安装(这个真要取决于个人的RP人品了,如果喜欢折腾的话不妨可以试他一试,在此不做介绍,请自行度娘即可)
ok,我们选择第一种方式,官网下载地址:http://www.codegood.com/downloads,大家根据自己的系统自行下载即可,下载完毕直接双击进行安装,可以修改下安装路径,然后一路next即可。
2.4、验证MySQLdb是否安装成功
cmd
——》输入python
——》输入import MySQLdb
,查看是否报错,没有报错则说明MySQLdb
安装成功!
2.5、如何使用MySQLdb
请大家自行参考W3C教程:http://www.runoob.com/python/python-mysql.html
2.6、熟悉XPath
抓取网页时,你做的最常见的任务是从HTML源码中提取数据。现有的一些库可以达到这个目的。
- BeautifulSoup:是在程序员间非常流行的网页分析库,它基于HTML代码的结构来构造一个Python对象, 对不良标记的处理也非常合理,但它有一个缺点:慢。
- lxml:是一个基于 ElementTree (不是Python标准库的一部分)的python化的XML解析库(也可以解析HTML)。
- XPath:即为XML路径语言,它是一种用来确定XML(标准通用标记语言的子集)文档中某部分位置的语言。XPath基于XML的树状结构,有不同类型的节点,包括元素节点,属性节点和文本节点,提供在数据结构树中找寻节点的能力。
Scrapy提取数据有自己的一套机制。它们被称作选择器(seletors),因为他们通过特定的 XPath 或者 CSS 表达式来“选择” HTML文件中的某个部分。
关于XPath的使用,大家可以自行参考官网教程:https://www.w3.org/TR/xpath/
或者中文教程:http://www.w3school.com.cn/xpath/index.asp
ok,有了上面这些基本的准备工作之后,我们可以开始正式编写爬虫程序了。这里以豆瓣电影TOP250为例:https://movie.douban.com/top250
三、编写爬虫
首先我们使用Chrome
或者Firefox
浏览器打开这个地址,然后一起分析下这个页面的html
元素结构,按住F12
键即可查看网页源代码。分析页面我们可以看到,最终需要提取的信息都已经被包裹在class
属性为grid_view
的这个ol
里面了,所以我们就可以基本确定解析范围了,以这个ol
元素为整个大的边框,然后再在里面进行查找定位即可。
然后具体细节在此就不罗嗦了,直接撸代码吧:
完整的代码已经上传至github上git@github.com:hu1991die/douan_movie_spider.git,欢迎fork,欢迎clone!
1、DoubanMovieTop250Spider.py
# encoding: utf-8
'''
@author: feizi
@file: DoubanMovieTop250Spider.py
@Software: PyCharm
@desc:
'''
import re
from scrapy import Request
from scrapy.spiders import Spider
from douan_movie_spider.items import DouanMovieItem
class DoubanMovieTop250Spider(Spider):
name = 'douban_movie_top250'
def start_requests(self):
url = 'https://movie.douban.com/top250'
yield Request(url)
def parse(self, response):
item = DouanMovieItem()
movieList = response.xpath('//ol[@class="grid_view"]/li')
for movie in movieList:
# 排名
rank = movie.xpath('.//div[@class="pic"]/em/text()').extract_first()
# 封面
cover = movie.xpath('.//div[@class="pic"]/a/img/@src').extract_first()
# 标题
title = movie.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first()
# 评分
score = movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract_first()
# 评价人数
comment_num = movie.xpath('.//div[@class="star"]/span[4]/text()').re(ur'(\d+)')[0]
# 经典语录
quote = movie.xpath('.//p[@class="quote"]/span[@class="inq"]/text()').extract_first()
# 上映年份,上映地区,电影分类
briefList = movie.xpath('.//div[@class="bd"]/p/text()').extract()
if briefList:
# 以'/'进行分割
briefs = re.split(r'/', briefList[1])
# 电影分类
types = re.compile(u'([\u4e00-\u9fa5].*)').findall(briefs[len(briefs) - 1])[0]
# 上映地区
region = re.compile(u'([\u4e00-\u9fa5]+)').findall(briefs[len(briefs) - 2])[0]
if len(briefs) <= 3:
# 上映年份
years = re.compile(ur'(\d+)').findall(briefs[len(briefs) - 3])[0]
else:
# 上映年份
years = ''
for brief in briefs:
if hasNumber(brief):
years = years + re.compile(ur'(\d+)').findall(brief)[0] + ","
print years
if types:
# 替换空格为“,”
types = types.replace(" ", ",")
print(rank, cover, title, score, comment_num, quote, years, region, types)
item['rank'] = rank
item['cover'] = cover
item['title'] = title
item['score'] = score
item['comment_num'] = comment_num
item['quote'] = quote
item['years'] = years
item['region'] = region
item['types'] = types
yield item
# 获取下一页url
next_url = response.xpath('//span[@class="next"]/a/@href').extract_first()
if next_url:
next_url = 'https://movie.douban.com/top250' + next_url
yield Request(next_url)
def hasNumber(str):
return bool(re.search('\d+', str))
2、items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
# 电影实体类
class DouanMovieItem(scrapy.Item):
# 排名
rank = scrapy.Field()
# 封面
cover = scrapy.Field()
# 标题
title = scrapy.Field()
# 评分
score = scrapy.Field()
# 评价人数
comment_num = scrapy.Field()
# 经典语录
quote = scrapy.Field()
# 上映年份
years = scrapy.Field()
# 上映地区
region = scrapy.Field()
# 电影类型
types = scrapy.Field()
3、pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import MySQLdb
from scrapy.exceptions import DropItem
from douan_movie_spider.items import DouanMovieItem
# 获取数据库连接
def getDbConn():
conn = MySQLdb.Connect(
host='127.0.0.1',
port=3306,
user='root',
passwd='123456',
db='testdb',
charset='utf8'
)
return conn
# 关闭数据库资源
def closeConn(cursor, conn):
# 关闭游标
if cursor:
cursor.close()
# 关闭数据库连接
if conn:
conn.close()
class DouanMovieSpiderPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['title'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['title'])
if item.__class__ == DouanMovieItem:
self.insert(item)
return
return item
def insert(self, item):
try:
# 获取数据库连接
conn = getDbConn()
# 获取游标
cursor = conn.cursor()
# 插入数据库
sql = "INSERT INTO db_movie(rank, cover, title, score, comment_num, quote, years, region, types)VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s)"
params = (item['rank'], item['cover'], item['title'], item['score'], item['comment_num'], item['quote'], item['years'], item['region'], item['types'])
cursor.execute(sql, params)
#事务提交
conn.commit()
except Exception, e:
# 事务回滚
conn.rollback()
print 'except:', e.message
finally:
# 关闭游标和数据库连接
closeConn(cursor, conn)
4、main.py
# encoding: utf-8
'''
@author: feizi
@file: main.py
@Software: PyCharm
@desc:
'''
from scrapy import cmdline
name = "douban_movie_top250"
# cmd = "scrapy crawl {0} -o douban.csv".format(name)
cmd = "scrapy crawl {0}".format(name)
cmdline.execute(cmd.split())
5、settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for douan_movie_spider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'douan_movie_spider'
SPIDER_MODULES = ['douan_movie_spider.spiders']
NEWSPIDER_MODULE = 'douan_movie_spider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3013.3 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'douan_movie_spider.middlewares.DouanMovieSpiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'douan_movie_spider.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'douan_movie_spider.pipelines.DouanMovieSpiderPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
需要注意一点,为了防止爬虫被ban,我们可以设置一下USER-AGENT.
还是F12键,查看一下Request Headers请求头,找到User-Agent信息然后设置到settings文件中即可。当然,这只是一种简单的方式,其他更复杂的策略如IP池,User-Agent池请自行google吧,这里不做赘述。
image.png
四、运行爬虫
image.png五、保存结果
image.png六、简单数据可视化分析
最后,给大家看下简单的数据可视化分析效果。
6.1、评分top10
image.png6.2、标题云
image.png6.3、语录云
image.png6.4、评论TOP10
image.png6.5、每一年电影上映数统计
image.png6.6、上映地区统计
image.png6.7、电影类型汇总
image.png项目完整代码已上传至github:https://github.com/hu1991die/douan_movie_spider,欢迎fork~~~
网友评论