python - scrapy安装部署

作者: coderfl | 来源:发表于2020-04-09 15:12 被阅读0次

python - scrapy安装部署
爬虫-Scrapy框架使用
2018-07-18
scrapy学习
第十二章 scrapyd 部署爬虫
2、scrapy使用步骤
Scrapy的安装
python3.6安装scrapy
Python scrapy框架爬取瓜子二手车信息数据！
Python 安装 Scrapy 简记

部署项目

第一次用scrapy框架需要安装scrapy
pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple

创建scrapy项目
scrapy startproject demo
cd demo
scrapy genspider index detail.1688.com
pip安装依赖太慢：
可以在使用pip的时候加参数-i https://pypi.tuna.tsinghua.edu.cn/simple
下载时换成清华源，一般都十几M每秒
pip install XXX -i https://pypi.tuna.tsinghua.edu.cn/simple
中文乱码在setting.py中配置：
FEED_EXPORT_ENCODING = 'utf-8-sig'
IDEA中选择python编译器

image.png

下图片

安装依赖：
pip install pillow -i https://pypi.tuna.tsinghua.edu.cn/simple
settings.py

ITEM_PIPELINES = {
    'demo.pipelines.ImagesPipelinse': 1,  // demo是pipelines.py的父目录名
}
IMAGES_STORE = os.getcwd() + '\\images'

COOKIES_ENABLED = False

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'cookie': 'cna=Z/0SF6rBRAMCAXGPt8YH2PO...'
}

pipelines.py

import os
import shutil
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.project import get_project_settings

class ImagesPipelinse(ImagesPipeline):
    IMAGES_STORE = get_project_settings().get("IMAGES_STORE")

    def get_media_requests(self, item, info):
        for image_url in item['image']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_path = [x['path'] for ok, x in results if ok]
        img_path = "%s\%s" % (self.IMAGES_STORE, item['tit'])
        # 目录不存在则创建目录
        if os.path.exists(img_path) == False:
            os.mkdir(img_path)
        # 循环将图片从默认路径移动到指定路径下，图片字段需要是图片链接数组
        for index in range(len(image_path)):
            shutil.move(self.IMAGES_STORE + "\\" + image_path[index], img_path + "\\" + image_path[index][5:])
        # item['image_Path'] = img_path + "\\" + image_path[index][image_path[index].find("full\\") + 6:]
        return item

items.py

import scrapy


class DemoItem(scrapy.Item):
    image = scrapy.Field()
    tit = scrapy.Field()
    pass

重定向问题302

scrapy携带cookie-解决重定向302问题。

有的网站防爬虫需要验证登录信息的cookie，这时候给scrapy携带上登录过的网站的cookie就可以继续爬。

用账号登录淘宝。
把爬取的链接直接粘贴到浏览器看能否正常显示页面（不被重定向）。
如果2正常，F12查看cookie。
在爬虫settings.py文件的DEFAULT_REQUEST_HEADERS中设置cookie开始爬数据。
如果爬了没多少有有滑动验证的话，滑动验证之后从2-4重新执行一遍。

settings.py中 #COOKIES_ENABLED = False。
注释状态使用scrapy内置的cookie。
非注释状态值为False时使用settings.py中DEFAULT_REQUEST_HEADERS里头的cookie。
非注释状态值为True时使用自定义的cookie。

image.png

抓取js渲染后的页面（vue-spa，京东）

下面就来讲一下如何使用scrapy-splash抓取js渲染后的页面：

利用pip安装scrapy-splash库：
$ pip install scrapy-splash
scrapy-splash使用的是Splash HTTP API，所以需要一个splash instance，一般采用docker运行splash，所以需要安装docker。
安装docker, 安装好后运行docker。
拉取镜像(pull the image)：
$ docker pull scrapinghub/splash
用docker运行scrapinghub/splash：
$ docker run -p 8050:8050 scrapinghub/splash
配置splash服务（以下操作全部在settings.py）：

     1）添加splash服务器地址：
          SPLASH_URL = 'http://localhost:8050'
     2）将splash middleware添加到DOWNLOADER_MIDDLEWARE中：
          DOWNLOADER_MIDDLEWARES = {
          'scrapy_splash.SplashCookiesMiddleware': 723,
          'scrapy_splash.SplashMiddleware': 725,
          'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
          }
     3)Enable SplashDeduplicateArgsMiddleware:
          SPIDER_MIDDLEWARES = {
          'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
          }
     4)Set a custom DUPEFILTER_CLASS:
          DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
     5)a custom cache storage backend:
          HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

例子
获取HTML内容：

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        # ...