美文网首页
python网络爬虫笔记三

python网络爬虫笔记三

作者: 肖一二三四 | 来源:发表于2018-01-24 13:40 被阅读0次

    一、Scrapy爬虫框架常用命令

    命令 说明 格式
    startproject 创建一个新工程 scrapy startproject <name> [dir]
    genspider 创建一个爬虫 scrapy genspider [options]<name> <domain>
    settings 获取爬虫配置信息 scrapy settings [options]
    crawl 运行一个爬虫 scrapy crawl <spider>
    list 列出工程中所有爬虫 scrapy list
    shell 启动URL调试命令行 scrapy shell [url]

    二、scrapy使用

    1. scrapy startproject demo 新建一个项目demo
    2. scrapy genspider spiderdemo www.baidu.com 创建一个百度的爬虫
    3. scrapy crawl spiderdemo 运行爬虫

    三、Request类

    方法 说明
    .url Request对应的URL
    .method 'GET' 'POST'等
    .headers 字典类型风格的请求头
    .body 请求内容主体,字符串类型
    .meta 用户添加的扩展信息,scrapy内部模块间传递
    .copy() 复制请求

    四、Response类

    方法 说明
    .url Response对应的URL
    .status HTTP状态码
    .headers Response的头部
    .body Response内容主体,字符串类型
    .flags 一组标记
    .request 产生对应的Request对象
    .copy() 复制响应

    五、Scrapy框架的股票数据定向爬虫

    # pipelines.py
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class Python123DemoPipeline(object):
        def process_item(self, item, spider):
            return item
    
    class BaidustocksPipeline(object):
        def open_spider(self, spider):
            self.f = open('Stock.txt', 'w')
    
        def close_spider(self,spider):
            self.f.close()
    
        def process_item(self, item, spider):
            try:
                line = str(dict(item)) + '\n'
                self.f.write(line)
            except:
                pass
            return item
    
    # demo.py
    # -*- coding: utf-8 -*-
    import scrapy
    import re
    
    class DemoSpider(scrapy.Spider):
        name = 'demo'
        # allowed_domains = ['python123.io']
        start_urls = ['http://quote.eastmoney.com/stocklist.html']
        stock_info_url = 'https://gupiao.baidu.com/stock/'
    
        def parse(self, response):
            for href in response.css('a::attr(href)').extract():
                try:
                    stock = re.findall(r"[s][hz]\d{6}", href)[0]
                    url = 'https://gupiao.baidu.com/stock/' + stock + '.html'
                    yield scrapy.Request(url, callback=self.parse_stock)
                except:
                    continue
    
        def parse_stock(self, response):
            infoDict = {}
            stockInfo = response.css('.stock-bets')
            name = stockInfo.css('.bets-name').extract()[0]
            keyList = stockInfo.css('dt').extract()
            valueList = stockInfo.css('dd').extract()
            for i in range(len(keyList)):
                key = re.findall(r'>.*</dt>', keyList[i])[0][1:-5]
                try:
                    val = re.findall(r'\d+\.?.*</dd>', valueList[i])[0][0:-5]
                except:
                    val = "--"
            infoDict.update({'股票名称':re.findall('\s.*\(', name)[0].split()[0]
                            + re.findall('\>.*\<', name)[0][1:-1]}
                            )
            yield infoDict
    
    # settings.py
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for python123demo project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'python123demo'
    
    SPIDER_MODULES = ['python123demo.spiders']
    NEWSPIDER_MODULE = 'python123demo.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'python123demo (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'python123demo.middlewares.Python123DemoSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'python123demo.middlewares.Python123DemoDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'python123demo.pipelines.BaidustocksPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    

    相关文章

      网友评论

          本文标题:python网络爬虫笔记三

          本文链接:https://www.haomeiwen.com/subject/aqjsaxtx.html