scrapy

作者: 星际探索者 | 来源:发表于2020-04-14 15:53 被阅读0次

简单 Scrapy 使用小结
scrapy框架
Scrapy笔记
scrapy笔记
Scrapy基础（一）：安装和使用
python爬虫13：scrapy
Scrapy笔记
PyCharm运行和调试Scrapy
11- Scrapy-Redis分布式
scrapy框架基本使用

1、安装scrpay
pip install -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com scrapy

import scrapy
class QuotesSpider(scrapy.Spider):

    # 定义名称
    name = 'quotes'

    # 定义需要爬取的地址，scrapy在执行爬取命令时首先会请求该地址
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    # 默认的回调函数，scrapy请求完成地址时，会调用parse函数
    def parse(self, response):

        # 循环标签元素为div的quote属性列表
        for quote in response.css('div.quote'):

            # 使用生成器生成一个包含author和text属性的dict
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }
        # 查看是否存在下一页
        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            # 继续调用下一页
            yield response.follow(next_page, self.parse)

运行spider
scrapy runspider quotes_spider.py -o quotes.json
运行效果如下

image.png

2、使用scrapy创建项目
scrapy startproject tutorial

目录结构如下

image.png

在tutorial/tutorial/spiders下创建一个quotes_spider的文件

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"

    # 如果不定义该方法，scrapy默认会调用该函数

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        # 获取倒数第二个元素
        page = response.url.split("/")[-2]
        print('------------------------------')
        print(page)
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

# 我们可以用下面更简洁的方法

import scrapy
  class QuotesSpider(scrapy.Spider):
  name = "quotes"
  start_urls = [
      'http://quotes.toscrape.com/page/1/',
      'http://quotes.toscrape.com/page/2/',
  ]

  def parse(self, response):
      page = response.url.split("/")[-2]
      filename = 'quotes-%s.html' % page
      with open(filename, 'wb') as f:
          f.write(response.body)

在最顶层的tutorial目录执行scrapy crawl quotes，就会下载两个html页面

image.png

3、我们也可以直接使用scrapy shell 爬取数据，windows下需要使用双赢号
scrapy shell "http://quotes.toscrape.com/page/1/"

执行脚本的效果如图

image.png

接下来我们就可以在本地console获取响应体的类容

image.png

获取包含特定标签的html标签
response.css('title')

获取包含指定标签的所有内容
response.css('title::text').getall()

获取包含指定标签的所有标签
response.css('title').getall()

获取包含指定标签的第一个内容
response.css('title::text').get()
response.css('title::text')[0].get()

使用正则匹配数据
response.css('title::text').re(r'Quotes.*')

scrapy 除了css选择器还支持xpath进行数据选择

image.png

5、上面保存的直接是html页面，接下来我们从页面提取数据到json

案例2：爬取网页并把数据提取到json文件中

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            # python使用yield关键字提取数据
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

6、上面爬取的是两个网页，但是我们经常需要爬取整个网页，这时就需要用到翻页爬取
下面还是以http://quotes.toscrape.com/page/1/网页进行分析

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

我们先使用scrapy shell演示提取下页

scrapy shell "http://quotes.toscrape.com/page/1/"
获取a标签
>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
获取href属性内容，scrapy有两种方式获取
第一种方式
>>> response.css('li.next a::attr(href)').get()
'/page/2/
第二种方式
>>> response.css('li.next a').attrib['href']
'/page/2/'

接下来我们使用几种方式递归提取数据
下面的案例都可以在根目录执行程序，可以生成json和jl文件，jl(json line),jl的优点是不管你执行多少次，文件都不会报错，它总是一行一行追加到文件

image.png

案例1
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

            next_page = response.css('li.next a::attr(href)').get()
          
            # 当没有下一页地址时，就不再生成新请求
            if next_page is not None:
                next_page = response.urljoin(next_page)
                # 生成一个新的请求，并把新的请求地址作为参数回调parse函数
                yield scrapy.Request(next_page, callback=self.parse)

"""
案例2，使用request.follow函数，自动关联url，不需要urljoin
"""
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

"""
案例3，直接循环标签属性href
"""

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):

        # 这里第一次提取第一页数据
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # for循环css选择器从response a标签href属性得到的列表，如果包含下一页标签元素，就重新生    
        #成一 个新的请求回调parse函数
        for href in response.css('ul.pager a::attr(href)'):
            yield response.follow(href, callback=self.parse)

"""
案例4 直接循环a标签更简洁的方式
"""
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):

        # 这里第一次提取第一页数据
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # 直接使用a标签生成新的请求，如果存在的新的请求就会回调parse函数
        for a in response.css('ul.pager a'):
            yield response.follow(a, callback=self.parse)

"""
案例5，使用follow_all替代follow，与follow不同的事，follow_all可以创建多个请求，不需要像上面循环获取到的标签列表，response.css('ul.pager a')获取的是一个列表，所以需要for循环得到每一个a标签，follow每次只能处理一次请求。
"""

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):

        # 这里第一次提取第一页数据
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # 使用follow_all创建多个请求，这样多个请求可以被follow_all处理
        anchors = response.css('ul.pager a')
        yield from response.follow_all(anchors, callback=self.parse)

"""
案例6 follow_all的简化方式
"""
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):

        # 这里第一次提取第一页数据
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # follow_all更简洁的写法，使用选择器
        yield from response.follow_all(css='ul.pager a', callback=self.parse)

上面的例子都是在一次请求中提取数据，很多时候需要多次请求才能得到我们的数据，比如我们要获取http://quotes.toscrape.com/每一个页面的所有作者的相关信息。

第一次请求是获取到了第一页的所有作者列表，

image.png

点击作者链接才得到具体的作者详情

image.png
接下来我们一起看看如何获取上面的数据

先一起看下selector的使用
我们要获取class=author下面的a标签的地址，从图中可以看出a标签是没有选择器的并且紧跟author选择器的
首先执行scrapy shell http://quotes.toscrape.com/](http://quotes.toscrape.com/
我们着重看data数据，可以看到data数据刚好是author选择器的标签

image.png
下面我们使用 response.css('.author + a')，获取a标签
我们可以看到data的数据是我们需要的a标签，并且是一个list集合，因为在一个页面存在多个作者的详情链接

image.png
我们从首页获取下一页的a标签，可以看到list集合只有一个值，因为每一页只有一个指向下一页的标签

image.png

"""
案例7
"""
import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        
        # 获取页面所有作者详情链接，得到一个list集合
        author_page_links = response.css('.author + a')
        """
        上面有讲过,follow_all可以处理多个请求,需要注意的是这里不再将作者详情的链接回调，而是自定义一个解析函数来专门解析作者详情链接地址，
        因为本parse函数会不断获取作者列表的地址，也就是每一页的next，每一页包含多个作者详情的链接，交给自定义parse_author函数，这样follow_all每次会生成多个作者详情的请求，
        其实内部是一次一次发送给parse_author的，这样本parse每循环一次请求，parse_author就会执行多次，直到所有页面的所有作者详情执行获取完毕
        """
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css('li.next a')
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        print("URL:" + response.request.url)
        def extract_with_css(query):
            print("query:" + query)
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

author名称需要和spider定义的name一致
执行scrapy crawl author -o author.jl

获取到的数据如图

image.png