Scrapy学习纪录

作者: 随喜公子 | 来源:发表于2019-02-19 14:35 被阅读0次

Scrapy学习纪录
scrapy 源代码阅读笔记（0）-- 背景
scrapy
scrapy小记
scrapy的概念和流程 (黑马教育)
28.scrapy的入门使用
27. scrapy的概念和流程
scrapy学习笔记(有示例版）
scrapy 学习日记
Python爬虫Scrapy(三)_Scrapy Shell

title: Scrapy爬虫项目纪录
date: 2019年2月20日 14:14
tags:
- Scrapy
- Python
- 爬虫

目标

从零开始学习scrapy，从搭建环境到完成一个图片网站爬取实例。

编程环境

VSCode
Python3
Scrapy

安装记录

win下安装

用pip命令安装Scrapy时提示没有MS框架

安装MS Build TOOL

提示没有安装win32api

用pip 安装win32：

pip install pywin32

安装命令

pip install scrapy

更新命令

sudo pip install --upgrade scrapy

mac 下安装

mac 自带的python是2.7版本的，而且不能升级，否则会影响系统的功能。
mac下用Homebrew来进行升级

安装xcode命令行工具

xcode-select --install

https://brew.sh/ 安装Homebrew
将Homebrew加入环境变量中

echo "export PATH=/usr/local/bin:/usr/local/sbin:$PATH" >> ~/.bashrc
source ~/.bashrc

安装python

brew install python

如果已经安装，可以进行升级

brew update; brew upgrade python

安装scrapy

pip3 install scrapy

学习记录

生成Scrapy框架

SCrapy必须在固定的框架下运行，可以自动生成后再去改动。

scrapy startproject 工程名

HelloWorld代码

import scrapy
class QuotesSpider(scrapy.Spider):  # 任何爬虫都要继承Scrapy.Spider这个类，复写它的方法

    name = "quotes"    # 唯一的爬虫名字，在运行时要用到
    
    def start_requests(self):    # 复写的方法，初始请求的网址
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):       # 复写的方法，在这里对爬下的数据进行处理
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

运行命令：

scrapy crawl quotes

深入学习

例子1-提取内容

# 提取相关格言以及作者等信息

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

输出json或者jl(JSON Lines)命令

scrapy crawl quotes -o quotes.json

scrapy crawl quotes -o quotes.jl

例子2-爬取下一个链接


import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)     #获得真实的链接地址
            yield scrapy.Request(next_page, callback=self.parse)  #下一个链接的处理回调

后面两句可以用下面的代替，不用写urljoin了。

 yield response.follow(next_page, callback=self.parse)

进一步简化：

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

再进一步简化：
对于a 标签，会自动使用它的href属性

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

进阶例子


import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

命令行参数例子


import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)    #从命令行参数获得
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

命令

scrapy crawl quotes -o quotes-humor.json -a tag=humor

结果

http://quotes.toscrape.com/tag/humor

item

可以自己定义的数据结构
格式如下

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

item pipeline

处理item数据的地方，在parse中返回item,就会调用该方法。
格式如下

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item.get('price'):
            if item.get('price_excludes_vat'):
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

在setting里启动pipeline

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,   #数字表示优先顺序，越小的越先执行
    'myproject.pipelines.JsonWriterPipeline': 800,
}

例子：

from mySpider.items import ItcastItem

def parse(self, response):
    #open("teacher.html","wb").write(response.body).close()

    # 存放老师信息的集合
    #items = []

    for each in response.xpath("//div[@class='li_txt']"):
        # 将我们得到的数据封装到一个 `ItcastItem` 对象
        item = ItcastItem()
        #extract()方法返回的都是unicode字符串
        name = each.xpath("h3/text()").extract()
        title = each.xpath("h4/text()").extract()
        info = each.xpath("p/text()").extract()

        #xpath返回的是包含一个元素的列表
        item['name'] = name[0]
        item['title'] = title[0]
        item['info'] = info[0]

        #items.append(item)

        #将获取的数据交给pipelines
        yield item

    # 返回数据，不经过pipeline
    #return items

中文乱码转为utf-8

python3默认为unicode,如果输出为中文，则要转为utf-8，不然会是乱码
代码如下：

import json
import codecs
import os

class Pipeline(object):
    def __init__(self):
        self.file = codecs.open(
            'items.json', 'w', encoding='utf-8')

    def close_spider(self, spider):
        self.file.seek(-1, os.SEEK_END)
        self.file.truncate()
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

imagepipeline各函数运行流程

imagepipeline启动
get_media_requests 将所有的下载请求一次全部完成
下载完成后再统一执行item_completed

同时下载多个图片并改名

重写file_path函数实现

    def get_media_requests(self, item, info):
        """
        :param item: spider.py中返回的item
        :param info:
        :return:
        """
        #这里传递字符，或者图片列表，如果是单个的对象，则非常容易被覆盖
        yield scrapy.Request(item['pic_url'], meta={'item': item['pic_name']})

    def file_path(self, request, response=None, info=None):
        """
        : param request: 每一个图片下载管道请求
        : param response:
        : param info:
        : param strip: 清洗Windows系统的文件夹非法字符，避免无法创建目录
        : return: 每套图的分类目录
        """
        item = request.meta['item']
        folder = item
        folder_strip = strip(folder)
        # img_path = "%s%s" % (self.img_store, folder_strip)
        filename = folder_strip + '/' + folder_strip + '.jpg'
        return filename
        
  def strip(path):
    """
    :param path: 需要清洗的文件夹名字
    :return: 清洗掉Windows系统非法文件夹名字的字符串
    """
    path = re.sub(r'[？\\*|“<>:/]', '', str(path))
    return path

Request 回调传递参数

 scrapy.Request(next_page, callback=self.parse_imgs, meta={'item': item, 'param': name})
 
 在parse中提取参数
 item = response.meta['item']

结果去重

Request的参数 dont_filter=False 默认去重
启用一个爬虫的持久化，运行以下命令:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

然后，你就能在任何时候安全地停止爬虫(按Ctrl-C或者发送一个信号)。
恢复这个爬虫也是同样的命令:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

这样爬虫断掉后，再启动会接着上次的 url 跑。

如果命令行里不想看到那么多输出的话，可以加个 -L WARNING 参数
运行爬虫如：

scrapy crawl spider1 -L WARNING

不打印Debug信息，可以清楚得看到运行过程。

scrapy-redis

错误记录

pipeline is not a full path

应该在 setting 中填入完整的管道的路径，如：

pic.pipelines.PicImagesDownloadPipeline

如果只填PicImagesDownloadPipeline,就会出现这个错误。

Symbol not found: _PyInt_AsLong 错误

将系统python目录下的PIL和Pillow库都删除，再用pip3安装在 Python3的安装目录下
系统python安装目录：

/Library/Python/2.7/site-packages

Missing scheme in request url: h

相关URL必须是一个List，所以遇到该错误只需要将url转换成list即可。
例如：
start_urls = ['someurls']
如果是images_url也是如此，使用item存储的时候改成list即可。
item['images_urls'] = ['image_url']

Request url must be str or unicode

请求的url参数不能是一个列表，必须是一个字符

在item_complete中改名多个图片不成功

item_complete并不是在get_media_requests下载图片后马上启动的，它是要等所有的图片下载完成，再统一启动complete事件，这样就导致多个图片没法改名，不能获得之前的item的字段。改名需要重写file_path

get_media_requests中回调参数要小心

meta中可以加入回调的参数，如果传递的是对象要非常小心，如果对象发生变化，会导致后面所有的回调参数发生变化，传递的如果是字符，就没有这个风险。

 def get_media_requests(self, item, info):
        """
        :param item: spider.py中返回的item
        :param info:
        :return:
        """
        yield scrapy.Request(item['pic_url'], meta={'item': item['pic_name']})

最终代码

piczz.py

import scrapy
from piczz.items import PiczzItem


class piczzSpider(scrapy.Spider):
    name = "piczz"
    allowed_domains = [""]
    start_urls = [""]
    img_paths = []
    def parse(self, response):

        for each in response.xpath(
                "//div[@class = 'post_box']"):
            # extract()方法返回的都是unicode字符串
            item = PiczzItem()
            item['name'] = 'startpage'

            self.img_paths.clear()
            item['pic_name'] = each.xpath(
                "descendant::div[@class = 'tit']/h2[@class = 'h1']/a/text()").extract()[0]
            item['pic_url'] = each.xpath(
                "descendant::div[@class = 'tit']/h2[@class = 'h1']/a/@href").extract()[0]

            yield scrapy.Request(item['pic_url'],
                                 callback=self.parse_imgs, meta={'item': item})

        #递归下一页图片
        next_path = response.xpath(
            "descendant::div[@class = 'page_num']/a[last()]")
        next_con = next_path.xpath("text()").extract()[0]
        next_con = next_con.strip()
        next_page = ""
        if next_con == "下一頁 »":
            next_page = next_path.xpath("@href").extract()[0]
            print(next_page)
            if next_path is not None:
                yield scrapy.Request(next_page, self.parse)
        else:
            return
    
    # 下载一个索引页的图片
    def parse_imgs(self, response):
        self.img_paths.clear()
        item = response.meta['item']
        imgs = response.xpath(
            "descendant::div[@class = 'entry-content']/p/img/@src").extract()
        for e in imgs:
            self.img_paths.append(e)
        item['pic_paths'] = self.img_paths
        next_path = response.xpath(
            "descendant::div[@class = 'wp-pagenavi']/p/a[last()]")
        next_con = next_path.xpath("text()").extract()[0]
        next_con = next_con.strip()
        if next_con == "下一页":
            next_page = next_path.xpath("@href").extract()[0]
            if next_page is not None:
                yield scrapy.Request(next_page, callback=self.parse_imgs, meta={'item': item})
        yield item

item.py

import scrapy

class PiczzItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    pic_name = scrapy.Field()  # 图片目录名
    pic_url = scrapy.Field()  # 图片索引首页地址
    pic_paths = scrapy.Field()  # 图片下载地址列表

pipeline.py

import json
import shutil
import codecs
import os
import re
import scrapy
import PIL
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.utils.project import get_project_settings

class PiczzImagesDownloadPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        """
        :param item: spider.py中返回的item
        :param info:
        :return:
        """
        for img_url in item['pic_paths']:
            yield scrapy.Request(img_url, meta={'item': item['pic_name']})

    def file_path(self, request, response=None, info=None):
        """
        : param request: 每一个图片下载管道请求
        : param response:
        : param info:
        : param strip: 清洗Windows系统的文件夹非法字符，避免无法创建目录
        : return: 每套图的分类目录
        """
        item = request.meta['item']
        folder = item
        folder_strip = strip(folder)
        image_guid = request.url.split('/')[-1]
        filename = folder_strip + '/' + image_guid + '.jpg'
        return filename

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item


def strip(path):
    """
    :param path: 需要清洗的文件夹名字
    :return: 清洗掉Windows系统非法文件夹名字的字符串
    """
    path = re.sub(r'[？\\*|“<>:/]', '', str(path))
    return path

总结

从搭建环境到断断续续的学习花了大概五天时间，每天平均花二个小时学习，终于成功的将设定的目标完成。

参考网站

官网
 中文参考网站
 xPath语法
 Python中yield的解释
 mac os Python路径总结
 Scrapy框架入门简介
 ImagesPipeline下载图片
 ImagesPipeline下载图片保持原文件名
 小白进阶之Scrapy第四篇
 Python中yield的解释
 scrapy调用parse()中使用yield引发对yield的分析

Scrapy学习纪录
title: Scrapy爬虫项目纪录date: 2019年2月20日 14:14tags:- Scrapy- P...
scrapy 源代码阅读笔记（0）-- 背景
初探 scrapy可以服务与中小型爬虫项目，异步下载性能很出色，（50M电信，scrapy单进程，半小时，最高纪录...
scrapy
scrapy学习一、scrapy框架介绍 Scrapy Engine(引擎): 负责Spider、ItemPip...
scrapy小记
scrapy入门学习地图 scrapy 框架：http://doc.scrapy.org/en/latest/to...
scrapy的概念和流程 (黑马教育)
scrapy的概念和流程学习目标：了解 scrapy的概念了解 scrapy框架的作用掌握 scrapy框...
28.scrapy的入门使用
scrapy的入门使用学习目标：掌握 scrapy的安装应用创建scrapy的项目应用创建scrapy...
27. scrapy的概念和流程
scrapy的概念和流程学习目标：了解 scrapy的概念了解 scrapy框架的作用掌握 scrapy框...
scrapy学习笔记(有示例版）
scrapy学习笔记(有示例版）我的博客 scrapy学习笔记1.使用scrapy1.1创建工程1.2创建爬虫模...
scrapy 学习日记
文章出处：【scrapy】学习Scrapy入门整体结构引擎(Scrapy Engine)，用来处理整个系统的数...
Python爬虫Scrapy(三)_Scrapy Shell
本篇将介绍使用scrapy的命令，更多内容请参考：Python学习指南 Scrapy Shell Scrapy终端...