4.1 scrapy库框架
5+2 结构
ENGINE, SCHEDULER, SPIDERS, ITEM PIPELINE, DOWNLOADER
SPIDER MIDDLEWARE, DOWNLOAD MIDDLEWARE
用户编写配置
SPIDERS
ITEM PIPELINE
4.2 scrapy常用命令
scrapy<command> [optins][args]
startproject : 创建一个新工程 scrapy startproject <name>[dir]
genspider: 创建一个爬虫 scrapy genspider [options]<name><domain>
settings: 获取爬虫配置信息 scrapy settings [options]
crawl: 运行一个爬虫 scrapy craw<spider>
list: 列出工程中所有爬虫 scrapy list
shell: 启动URL调试命令行 scrapy shell[url]
4.3 scrapy例子
cd pycodes
scrapy startproject python123demo
生成文件目录
scrapy.cfg:部署scrapy爬虫的配置文件
init.py
items.py : items代码模板
middlewares.py : Middlewares代码模板
pipelines.py: Pipelines 代码模板
settings.py : scrapy爬虫的配置文件
spiders: Spiders代码模板目录
import scrapy
class DemoSpider(scrapy.Spider):
name = 'demo'
start_urls =['http://python123.io/ws/demo.html']
def parse(self, parse):
fname = response.url.split('/')[-1]
with open(fname, 'wb') as f:
f.write('saved files %s' % fname)
import scrapy
class DemoSpider(scrapy.Spider):
name = 'demo'
def start_requests(self):
urls =[
'http:// python123.io/ws/demo.html'
]
for url in urls:
yield scrapy.Request(url = url, callback = self.parse)
def parse(self, response):
fname = response.url.split('/')[-1]
with open(fname, 'wb' ) as f:
f.write(response.body)
self.log('saved file %s.', % fname)
scrapy startproject demo
scrapy crawl demo
网友评论