美文网首页
1.初识scrapy框架

1.初识scrapy框架

作者: 思绪太重_飘不动 | 来源:发表于2019-06-10 19:10 被阅读0次

    scrapy框架的使用

    1.创建爬虫项目

    1.创建scrapy项目 : 
        scrapy startproject project_name(项目的名称)
    
    2.创建爬虫:
        cd 到project_name (切换到工程目录下,然后再创建爬虫)
        scrapy genspider spiser_name(爬虫的名字) spider.com(要爬取网站的域名)
    
    3.在spider_name中书写爬虫代码
    
    4.启动项目: 
        方式一: cd到爬虫所在的文件夹,执行代码 scrapy runspider sipder_name.py
        方式二: scrapy crawl spider_name
        方式三: 创建一个start.py , 编写如下代码:
        import scrapy.cmdline  
    
        # 执行scrapy命令
        def main():
            # 启动爬虫显示日志
            # scrapy.cmdline.execute(['scrapy', 'crawl', 'movie'])
            # scrapy.cmdline.execute("scrapy crawl movie".split())
            # 启动爬虫,但不显示日志
            scrapy.cmdline.execute("scrapy crawl movie --nolog".split()) 
    
    
        if __name__ == '__main__':
            main()
    
    

    2.在爬虫文件中如何提取文本内容

    print(type(response))   # 显示相应类型
    print(response.text)    # 显示字符串内容
    print(response.body)    # 显示二进制内容
    extract()函数,抽取对象的文本内容
    extract_first()函数,抽取对象的第一个文内容
    
    1.在scrapy框架中自带xpath,所以我们一般使用xpath来解析内容
    2.尽量使用extract_first()函数来抽取文本,如果文本为空不会报错
    

    3.实例 ,爬取美剧网站的电影

    爬取  url= 'https://www.meijutt.com/new100.html' 的最新电影
      
    
    爬取的数据为 :{ 电影的名字 :name , 电影的分类:mjjp, 电影的播放电视台:mjtv, 电影的更新时间:data_time }
    
    

    4.具体代码

    # 1.自己在工程目录下创建的start.py文件
    import scrapy.cmdline
    
    
    # 执行scrapy命令
    def main():
        # 启动爬虫显示日志
        # scrapy.cmdline.execute(['scrapy', 'crawl', 'movie'])
        # scrapy.cmdline.execute("scrapy crawl movie".split())
        # 启动爬虫,但不显示日志
        scrapy.cmdline.execute("scrapy crawl movie --nolog".split())
        # 通过命令保存成json文件
        # scrapy.cmdline.execute("scrapy crawl movie -o movie.json --nolog".split())
        # 通过命令保存成xml文件
        # scrapy.cmdline.execute("scrapy crawl movie -o movie.xml --nolog".split())
        # 通过命令保存成csv文件
        # scrapy.cmdline.execute("scrapy crawl movie -o movie.csv --nolog".split())
    
    
    if __name__ == '__main__':
        main()
    
    # 2.movie.py  (自己创建的爬虫文件)
    # -*- coding: utf-8 -*-
    import scrapy
    from ..items import MeijuItem
    
    
    # 继承自基类scrapy.Spider
    class MovieSpider(scrapy.Spider):
        name = 'movie'  # 项目名称
        allowed_domains = ['www.meijutt.com']   # 允许爬取的url的域名
        start_urls = ['https://www.meijutt.com/new100.html']    # 开始爬取url的列表
    
        # 定义parse()用来解析数据
        # 参数response: 就是服务端的响应,里面有我们想要的数据
        def parse(self, response):
            movie_list = response.xpath('//ul[@class="top-list  fn-clear"]/li')
            for movie in movie_list:
                name = movie.xpath('./h5/a/text()').extract_first()
                mjjp = movie.xpath('./span[@class="mjjq"]/text()').extract_first()
                mjtv = movie.xpath('./span[@class="mjtv"]/text()').extract_first()
                data_time = movie.xpath('./div[@class="lasted-time new100time fn-right"]/text()').extract_first()
                # print(name, mjjp, mjtv, data_time)
    
                item = MeijuItem()
                item['name'] = name
                item['mjjp'] = mjjp
                item['mjtv'] = mjtv
                item['data_time'] = data_time
    
                # yield会将item传入piplines.py文件中
                yield item
    
    # 3.items.py (创建数据存储的模型)
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    # 定义模型,相当于Django中的Model
    class MeijuItem(scrapy.Item):
        # 定义爬取内容的字段
        # define the fields for your item here like:
        # name = scrapy.Field()
        name = scrapy.Field()  # 名字
        mjjp = scrapy.Field()  # 分类
        mjtv = scrapy.Field()  # 电视台
        data_time = scrapy.Field()  # 更新时间
    
    # 4.pipelines.py(管道用来处理存储操作)
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    # pipelines管道,用于存储爬取到的内容的操作
    # 使用管道来存储数据的好处是,自动帮我们去重
    class MeijuPipeline(object):
        def __init__(self):
            pass
        
        # 开始爬取的函数,这是系统默认的写法,使用要自己添加
        def open_spider(self, spider):
            print('开始爬取......')
            self.fp = open('movie.txt', 'a', encoding='utf-8')
    
        # 处理传入进来的每个item.会被毒刺调用
        # 参数item : 在爬虫.py中的parse()函数yield返回的每个item
        # 参数spider: 爬虫对象
        def process_item(self, item, spider):
            string = str((item['name'], item['mjjp'], item['mjtv'], item['data_time'])) + '\n'
            self.fp.write(string)
            self.fp.flush()
            return item
    
        # 结束爬取的函数,这是系统默认的写法,使用要自己添加
        def close_spider(self, spider):
            print('爬取结束......')
            self.fp.close()
    
    # 5.settings.py (文件中大多数配置是默认不适用的,要是用它就要去掉注释)
    # -*- coding: utf-8 -*-
    
    # 爬虫的配置文件
    # Scrapy settings for meiju project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    # 工程的名称
    BOT_NAME = 'meiju'
    
    # 指定爬虫文件位置
    SPIDER_MODULES = ['meiju.spiders']
    # 新建的爬虫位置
    NEWSPIDER_MODULE = 'meiju.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    # 设置User_Agent, 默认未使用
    #USER_AGENT = 'meiju (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    # 默认遵守robots协议, 不遵守可改为Flase
    # ROBOTSTXT_OBEY = True
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    # 连接的请求数, 默认是16个
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    # 爬虫的中间键,默认未使用
    #SPIDER_MIDDLEWARES = {
    #    'meiju.middlewares.MeijuSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    # 下载中间键,默认未使用
    #DOWNLOADER_MIDDLEWARES = {
    #    'meiju.middlewares.MeijuDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    # 设置pipeline管道,默认未使用,要是用必须放开它
    ITEM_PIPELINES = {
       'meiju.pipelines.MeijuPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    

    相关文章

      网友评论

          本文标题:1.初识scrapy框架

          本文链接:https://www.haomeiwen.com/subject/vqytfctx.html