美文网首页
scrapy + xpath 爬取amazon商品信息

scrapy + xpath 爬取amazon商品信息

作者: 小董不太懂 | 来源:发表于2019-07-22 15:00 被阅读0次

    小小练手项目,毕竟刚刚接触xpath和scrapy,从项目中自己也学到了一些新的知识,欢迎大家留言共同学习

    • 创建项目
    • 查看response.text的返回状态
    • 修改一下settings
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for amazon project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'amazon'
    
    SPIDER_MODULES = ['amazon.spiders']
    NEWSPIDER_MODULE = 'amazon.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
    }
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'amazon.middlewares.AmazonSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'amazon.middlewares.AmazonDownloaderMiddleware': 543,
    #}
    # LOG_LEVEL = 'WARN'
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    #ITEM_PIPELINES = {
    #    'amazon.pipelines.AmazonPipeline': 300,
    #}
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    

    然后就是编写spider部分了:






    我们发现这个地址是相对地址,如何才能变成一个完整的地址呢?

    下面给出主代码:
    # -*- coding: utf-8 -*-
    import time
    from urllib import parse
    
    import scrapy
    from lxml import etree
    from scrapy import  Request
    
    
    class MobileSpider(scrapy.Spider):
        name = 'mobile'
        allowed_domains = ['amazon.cn']
        start_urls = ['https://www.amazon.cn/s?k=%E6%89%8B%E6%9C%BA&__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&ref=nb_sb_noss_1']
    
        def parse(self, response):
            print(response.url)
            time.sleep(2)
            html = etree.HTML(response.body)
            image = html.xpath('//div[@class="sg-col-inner"]'
                               '/div[@class="a-section a-spacing-none"]/span/a/div/img/@src')
            title = html.xpath('//div[@class="sg-col-inner"]/div/h2/'
                             'a[@class="a-link-normal a-text-normal"]/span/text()')
            price = html.xpath('//div[@class="sg-col-inner"]//div[@class="a-row"]'
                                 '//span[@class="a-price"]/span[@class="a-offscreen"]/text()')
            for item in zip(image, title, price ):
                yield {
                    'image':item[0],
                    'title':item[1],
                    'price':item[2]
                }
    
            url_1 = 'https://www.amazon.cn'
            url_2 =response.xpath('//div[@class="a-text-center"]/ul[@class="a-pagination"]/li[@class="a-last"]/a/@href').extract_first()
            next_url = parse.urljoin(url_1, url_2)
            yield Request(next_url)
    
    其中有两个注意的地方:
    • start_url那行创建项目的时候,写的是amazon.cn,一直返回空列表,最终才搞明白是网址问题。
    • parse.urljoin()这个拼接url的方式很不错,之前会,长时间不用忘记了。
    • zip()函数也值得好好记住
    • amazon的反爬挺讨厌,如果不加time.sleep()就只能抓前四页
      这只是个练手的小项目,所以反爬的问题就没有多考虑,只求通过即可

    相关文章

      网友评论

          本文标题:scrapy + xpath 爬取amazon商品信息

          本文链接:https://www.haomeiwen.com/subject/zucwlctx.html