什么是scrapy
scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,我们只需要实现少量的代码,就能实现快速的抓取。
Scrapy使用了Twisted['twistid']异步网络框架,可以加快我们的下载速度。
安装指南
支持的python版本
Scrapy 需要 Python 3.6+,CPython 实现(默认)或 PyPy 7.2.0+ 实现(请参阅替代实现)。
Scrapy 是用纯 Python 编写的,并且依赖于一些关键的 Python 包(以及其他):
- lxml,一个高效的 XML 和 HTML 解析器
- parsel,一个写在 lxml 之上的 HTML/XML 数据提取库,
- w3lib,用于处理 URL 和网页编码的多用途助手
- twisted,一个异步网络框架
-
cryptography and pyOpenSSL,
处理各种网络级安全需求
安装 Scrapy
pip install Scrapy
创建项目
scrapy startproject <项目名称>
$ scrapy startproject tutorial
输出内容
# scrapy 项目存放位置 D:\project\python\tutorial
New Scrapy project 'tutorial', using template directory 'd:\tool\templates\project', created in:
D:\project\python\tutorial
# 提示你可以使用 scrapy genspider example example.com 创建一个爬虫
You can start your first spider with:
cd tutorial
scrapy genspider example example.com
目录结构
tutorial/
scrapy.cfg # 项目配置文件
tutorial/ # 项目的 Python 模块,你将从这里导入你的代码
__init__.py
items.py # 项目项定义文件
middlewares.py # 定义的一些爬虫中间件,甚至包括自定义中间件
pipelines.py # 项目管道文件
settings.py # 项目设置文件
spiders/ # 创建好的爬虫都会存放到该项目中。
__init__.py
scrapy.cfg
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html
#指定项目配置的地方( settings.py )
[settings]
default = tutorial.settings
# 发布,scrapy提供了这样的功能,这样的功能能帮助我们发布到服务器上或本机上。
[deploy]
#url = http://localhost:6800/
project = tutorial
items.py
生成一个爬虫
scrapy genspider <爬虫名字> <爬虫范围>
- 爬虫名称,通常按照爬取的网站来命名,如 jindong、taobao、dangdang 等
- 爬虫范围,防止爬虫爬取范围太大,爬取到其他网站上了,所以需要指定爬虫范围。通常指定域名 如:taobao.com
进入到爬虫项目中
$ cd tutorial/
生成一个爬虫
$ scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
tutorial.spiders.quotes
此时会在spiders/项目中创建一个quotes.py的python文件,内容如下
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes' #爬虫名字
allowed_domains = ['quotes.toscrape.com'] #运行爬虫的范围
start_urls = ['http://quotes.toscrape.com/'] # 最开始请求的url地址,告诉爬虫从哪个地址开始爬。是scrapy 默认生成的,通常情况下,需要更改。
def parse(self, response):
pass
爬取当页数据
需要掌握 xpath 语法
元素定位
获取每个内容div的位置。可以通过chrome中的xpath工具进行定位。
元素定位 最后确定内容都存放到 class='quote' 的div中。
xpath节点信息://div[@class='col-md-8']/div[@class='quote']
最终代码如下:并没有相关参数的讲解(不在本章分享内容内)
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
div_list=response.xpath("//div[@class='col-md-8']/div[@class='quote']")
for div in div_list:
item={}
# 获取 text 内容
item["text"]=div.xpath("./span[@class='text']/text()").extract_first()
# 获取 by 后的内容
item["by_text"]=div.xpath(".//small[@class='author']/text()").extract_first()
# 获取 by 后a标签中href的值
item['by_href']=div.xpath("./span/a/@href").extract_first()
# 获取所有的标签
tags_list=div.xpath("./div[@class='tags']/a")
tags_item_list=[]
for tags in tags_list:
tags_item={}
tags_item["href"]=tags.xpath('./@href').extract_first()
tags_item["text"]=tags.xpath('./text()').extract_first()
tags_item_list.append(tags_item)
#将标签信息添加到item中
item["tags"]=tags_item_list
print(item)
#为了展示好看,最后按照 - 进行分隔
print('-'*20)
执行爬虫
$ scrapy crawl quotes
爬去内容如下:
$ scrapy crawl quotes
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/change/page/1/', 'text': 'change'}, {'href': '/tag/deep-thoughts/page/1/', 'text': 'deep-thoughts'}, {'href': '/tag/thinking/page/1/', 'text': 'thinking'}, {'href': '/tag/world/page/1/', 'text': 'world'}]}
--------------------
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'by_text': 'J.K. Rowling', 'by_href': '/author/J-K-Rowling', 'tags': [{'href': '/tag/abilities/page/1/', 'text': 'abilities'},
{'href': '/tag/choices/page/1/', 'text': 'choices'}]}
--------------------
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}, {'href': '/tag/life/page/1/', 'text': 'life'}, {'href': '/tag/live/page/1/', 'text': 'live'}, {'href': '/tag/miracle/page/1/', 'text': 'miracle'}, {'href': '/tag/miracles/page/1/', 'text': 'miracles'}]}
--------------------
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'by_text': 'Jane Austen', 'by_href': '/author/Jane-Austen', 'tags': [{'href': '/tag/aliteracy/page/1/', 'text': 'aliteracy'}, {'href': '/tag/books/page/1/', 'text': 'books'}, {'href': '/tag/classic/page/1/', 'text': 'classic'}, {'href': '/tag/humor/page/1/', 'text': 'humor'}]}
--------------------
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'by_text': 'Marilyn Monroe', 'by_href': '/author/Marilyn-Monroe', 'tags': [{'href': '/tag/be-yourself/page/1/', 'text': 'be-yourself'}, {'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}]}
--------------------
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'by_text': 'Albert Einstein', 'by_href': '/author/Albert-Einstein', 'tags': [{'href': '/tag/adulthood/page/1/', 'text': 'adulthood'}, {'href': '/tag/success/page/1/', 'text': 'success'}, {'href': '/tag/value/page/1/', 'text': 'value'}]}
--------------------
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'by_text': 'André Gide', 'by_href': '/author/Andre-Gide', 'tags': [{'href': '/tag/life/page/1/', 'text': 'life'}, {'href': '/tag/love/page/1/', 'text': 'love'}]}
--------------------
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'by_text': 'Thomas A. Edison', 'by_href': '/author/Thomas-A-Edison', 'tags': [{'href': '/tag/edison/page/1/', 'text': 'edison'}, {'href': '/tag/failure/page/1/', 'text': 'failure'}, {'href': '/tag/inspirational/page/1/', 'text': 'inspirational'}, {'href': '/tag/paraphrased/page/1/', 'text': 'paraphrased'}]}
--------------------
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'by_text': 'Eleanor Roosevelt', 'by_href': '/author/Eleanor-Roosevelt', 'tags': [{'href': '/tag/misattributed-eleanor-roosevelt/page/1/', 'text': 'misattributed-eleanor-roosevelt'}]}
--------------------
{'text': '“A day without sunshine is like, you know, night.”', 'by_text': 'Steve Martin', 'by_href': '/author/Steve-Martin', 'tags': [{'href': '/tag/humor/page/1/', 'text': 'humor'}, {'href': '/tag/obvious/page/1/', 'text': 'obvious'}, {'href': '/tag/simile/page/1/', 'text': 'simile'}]}
--------------------
结束
以上案例参考官网给的网站进行爬取,本章内容只是我使用爬虫这么久的一次入门总结,只供参考,若是小白的童鞋,建议上B站系统的学习一下,然后自己整理一份爬虫。后续将陆续整理有关scrapy的其他内容。
网友评论