一 说明
文本不介绍scrapy的安装,关于scrapy的安装网上能够找到很多文章,本文主要讲解如何爬取网页上信息。
本文要爬的网站是豆瓣电影,https://movie.douban.com/top250/页面有排名前250的电影信息,我需要将这些电影信息爬下来,并写入DB。
二 页面分析
从页面的源码中可以得到上图所示的信息
这些页面信息可以通过scrapy的xpath方法提取出来
Movie = selector.xpath('//div[@class="item"]')
for info in Movie:
rank = info.xpath('div[@class="pic"]/em/text()').extract()[0]
title = info.xpath('div[@class="pic"]/a/img/@alt').extract()[0]
link = info.xpath('div[@class="pic"]/a/@href').extract()[0]
pic = info.xpath('div[@class="pic"]/a/img/@src').extract()[0]
star = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
rate = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span/text()').extract()[1]
quote = info.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span[@class="inq"]/text()').extract()[0]
三 主要代码
items.py定义了一些对象,用于存储scrapy items
import scrapy
class Movie250Item(scrapy.Item):
# define the fields for your item here like:
rank = scrapy.Field()
title = scrapy.Field()
link = scrapy.Field()
pic = scrapy.Field()
star = scrapy.Field()
rate = scrapy.Field()
quote = scrapy.Field()
spiders/movies.py,这是爬虫的主文件,实现爬虫任务。将爬到的各个数据项存到item里。
import scrapy
from scrapy.selector import Selector
from ..items import Movie250Item
class Movie250Spider(scrapy.Spider):
name = "movie250"
start_urls = ["https://movie.douban.com/top250/"]
def parse(self,response):
item = Movie250Item()
selector = scrapy.Selector(response)
Movie = selector.xpath('//div[@class="item"]')
for info in Movie:
rank = info.xpath('div[@class="pic"]/em/text()').extract()[0]
title = info.xpath('div[@class="pic"]/a/img/@alt').extract()[0]
link = info.xpath('div[@class="pic"]/a/@href').extract()[0]
pic = info.xpath('div[@class="pic"]/a/img/@src').extract()[0]
star = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
rate = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span/text()').extract()[1]
quote = info.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span[@class="inq"]/text()').extract()[0]
item['rank'] = rank.encode('utf-8')
item['title'] = title.encode('utf-8')
item['link'] = link.encode('utf-8')
item['pic'] = pic.encode('utf-8')
item['star'] = star.encode('utf-8')
item['rate'] = rate.encode('utf-8')
item['quote'] = quote.encode('utf-8')
yield item
next_page = response.xpath('//span[@class="next"]/a/@href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url,callback=self.parse)
下面一段实现了翻页的效果。抓取到了next页面的href后再次递归parse,实现二级页面的爬取。
next_page = response.xpath('//span[@class="next"]/a/@href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url,callback=self.parse)
pipelines.py实现数据的持久化,pipelines将从spider中传来的item写入数据库。
import MySQLdb
import datetime
from ..myconfig import DbConfig
class DBPipeline(object):
def __init__(self):
self.conn = MySQLdb.connect(user = DbConfig['user'],
passwd = DbConfig['passwd'],
db = DbConfig['db'],
host = DbConfig['host'],
charset = 'utf8',
use_unicode = True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
curTime = datetime.datetime.now()
try:
self.cursor.execute(
"""INSERT IGNORE INTO movies (rank,title,link,pic,star,rate,quote)
VALUES (%s, %s, %s, %s, %s, %s, %s)""",
(
item['rank'],
item['title'],
item['link'],
item['pic'],
item['star'],
item['rate'],
item['quote']
)
)
self.conn.commit()
except MySQLdb.Error, e:
print 'Error %d %s' % (e.args[0], e.args[1])
return item
这段代码中的DbConfig是要从外部文件导入的
myconfig.py
DbConfig = {
# db config
'user': 'xxx',
'passwd': '123456',
'db': 'spider',
'host': '127.0.0.1',
}
接下来,需要在settings中定义下pipelines,定义该爬虫是通过哪个pipelines持久化的,后面的数字'1',是持久化的优先级,数字越小,优先级越高。
ITEM_PIPELINES = {
'pachong.pipelines.DBPipeline': 1,
}
执行爬虫
scrapy crawl 爬虫名 爬虫名是在spider/movie250中定义的
scrpay crawl movie250
效果图部分截图
网友评论