Scrapy 爬虫实战-爬取字幕库
1.首先,创建Scrapy框架
创建工程
scrapy startproject zimuku
创建爬虫程序
cd zimuku
scrapy genspider zimu zimuku.cn
如图:
snipaste_20181110_074005.png
snipaste_20181110_074302.png
我们会发现所有的框架以及模板都已经创建好了,
依次给大家看看:
zimu.py
# -*- coding: utf-8 -*-
import scrapy
class ZimuSpider(scrapy.Spider):
name = 'zimu'
allowed_domains = ['zimuku.cn']
start_urls = ['http://zimuku.cn/']
def parse(self, response):
pass
items.py
-*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ZimukuItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class ZimukuPipeline(object):
def process_item(self, item, spider):
return item
这是三个比较重要,其他的我先不一一列举了。
接下来,我们要进行分析网页了
2.网页分析
我们的目的是把这个内容保存下来,因为Scrapy框架自带一下工具,所以我们就用xpath来做内容匹配。
提取红框的内容的xpath语句为:/html/body/div[2]/div/div/div[2]/table/tbody/tr[1]/td[1]/a/b/text()
3.编写代码
(1)首先编写items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ZimukuItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#要爬取的内容定义
text = scrapy.Field()
(2)编写zimu.py
# -*- coding: utf-8 -*-
import scrapy
#需要把items中的类导进来
import zimuku.items import ZimukuItem
class ZimuSpider(scrapy.Spider):
name = 'zimu'
allowed_domains = ['zimuku.cn']
start_urls = ['http://zimuku.cn/']
def parse(self, response):
'''
:param response: 解析网页返回的内容
:return:
'''
name = response.xpath("/html/body/div[2]/div/div/div[2]/table/tbody/tr[1]/td[1]/a/b/text()")
item = {}
item['text'] = name
yield item
(3)编写piplines(处理爬到的内容的)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class ZimukuPipeline(object):
def process_item(self, item, spider):
with open("F:\\python\\1.txt", 'a') as fp:
fp.write(str(item['text']))
print(item['text'])
(4)settings.py
#通过配置告诉Scrapy明白是谁来处理结果
ITEM_PIPELINES={'zimuku.pipelines.ZimukuPipeline':300,}
(5)运行
#不打印日志
scrapy crawl zimu --nolog
#打印日志
scrapy crawl meiju
建议最好打印日志,不然有些错误不会发现,除了问题还不知道出在哪块
snipaste_20181110_094710.png
我们会发现运行成功了,我们再看看文件是否保存成功
snipaste_20181110_094749.png
OK!!!终于成功了。
这篇Scrapy是一步一步做的,对自己还是入门的朋友都是不错的参考。
网友评论