scrapy + xpath 爬取amazon商品信息

作者: 小董不太懂 | 来源:发表于2019-07-22 15:00 被阅读0次

scrapy + xpath 爬取amazon商品信息
使用Scrapy框架爬取简书首页文章（Selenium）
提取Scrapy 爬虫概念
python实战计划第一周，第二个项目
单元八·实例
Spring Boot集成WebMagic爬取商品信息
爬虫技术scrapy
Scrapy爬取网易云音乐和评论（二、Scrapy框架每个模块的
Scrapy爬取网易云音乐和评论（一、思路分析）
Scrapy爬取网易云音乐和评论（四、关于API）

小小练手项目，毕竟刚刚接触xpath和scrapy，从项目中自己也学到了一些新的知识，欢迎大家留言共同学习

创建项目
查看response.text的返回状态
修改一下settings

# -*- coding: utf-8 -*-

# Scrapy settings for amazon project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'amazon'

SPIDER_MODULES = ['amazon.spiders']
NEWSPIDER_MODULE = 'amazon.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'amazon.middlewares.AmazonSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'amazon.middlewares.AmazonDownloaderMiddleware': 543,
#}
# LOG_LEVEL = 'WARN'
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'amazon.pipelines.AmazonPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

然后就是编写spider部分了：

我们发现这个地址是相对地址，如何才能变成一个完整的地址呢？

下面给出主代码：

# -*- coding: utf-8 -*-
import time
from urllib import parse

import scrapy
from lxml import etree
from scrapy import  Request


class MobileSpider(scrapy.Spider):
    name = 'mobile'
    allowed_domains = ['amazon.cn']
    start_urls = ['https://www.amazon.cn/s?k=%E6%89%8B%E6%9C%BA&__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&ref=nb_sb_noss_1']

    def parse(self, response):
        print(response.url)
        time.sleep(2)
        html = etree.HTML(response.body)
        image = html.xpath('//div[@class="sg-col-inner"]'
                           '/div[@class="a-section a-spacing-none"]/span/a/div/img/@src')
        title = html.xpath('//div[@class="sg-col-inner"]/div/h2/'
                         'a[@class="a-link-normal a-text-normal"]/span/text()')
        price = html.xpath('//div[@class="sg-col-inner"]//div[@class="a-row"]'
                             '//span[@class="a-price"]/span[@class="a-offscreen"]/text()')
        for item in zip(image, title, price ):
            yield {
                'image':item[0],
                'title':item[1],
                'price':item[2]
            }

        url_1 = 'https://www.amazon.cn'
        url_2 =response.xpath('//div[@class="a-text-center"]/ul[@class="a-pagination"]/li[@class="a-last"]/a/@href').extract_first()
        next_url = parse.urljoin(url_1, url_2)
        yield Request(next_url)

其中有两个注意的地方：

start_url那行创建项目的时候，写的是amazon.cn，一直返回空列表，最终才搞明白是网址问题。

parse.urljoin()这个拼接url的方式很不错，之前会，长时间不用忘记了。

zip()函数也值得好好记住

amazon的反爬挺讨厌，如果不加time.sleep()就只能抓前四页
这只是个练手的小项目，所以反爬的问题就没有多考虑，只求通过即可