美文网首页iOS 好东西
Python爬虫——Scrapy

Python爬虫——Scrapy

作者: _羊羽_ | 来源:发表于2018-07-16 00:54 被阅读44次

    安装环境

    macOS 环境

    需要安装c语言的编译环境

    xcode-select --install
    

    安装Scrapy

    pip3 install Scrapy
    

    创建项目

    scrapy startproject xxx(项目名称)

    scrapy startproject firstProject
    New Scrapy project 'firstProject', using template directory '/usr/local/lib/python3.6/site-packages/scrapy/templates/project', created in:
        /Users/baxiang/Documents/Python/Scrapy/firstProject
    
    You can start your first spider with:
        cd firstProject
        scrapy genspider example example.com
    

    项目结构

    .
    ├── firstProject
    │   ├── __init__.py
    │   ├── __pycache__
    │   ├── items.py
    │   ├── middlewares.py
    │   ├── pipelines.py
    │   ├── settings.py
    │   └── spiders
    │       ├── __init__.py
    │       └── __pycache__
    └── scrapy.cfg
    
    4 directories, 7 files
    

    Items.py

    定义需要抓取并需要后期处理的数据

    settings.py

    文件设置scapy

    pipeline.py

    用于存放后期数据处理的功能。

    常用命令

    $ scrapy -h
    Scrapy 1.5.0 - project: firstProject
    
    Usage:
      scrapy <command> [options] [args]
    
    Available commands:
      bench         Run quick benchmark test
      check         Check spider contracts
      crawl         Run a spider
      edit          Edit spider
      fetch         Fetch a URL using the Scrapy downloader
      genspider     Generate new spider using pre-defined templates
      list          List available spiders
      parse         Parse URL (using its spider) and print the results
      runspider     Run a self-contained spider (without creating a project)
      settings      Get settings values
      shell         Interactive scraping console
      startproject  Create new project
      version       Print Scrapy version
      view          Open URL in browser, as seen by Scrapy
    
    Use "scrapy <command> -h" to see more info about a command
    
    

    创建爬虫

    创建爬虫工程

    scrapy startproject Toscrape
    

    创建爬虫文件

    scrapy genspider news www.163.com
    

    创建news.py文件内容

    class NewsSpider(scrapy.Spider):
        name = 'news'
        allowed_domains = ['www.163.com']
        start_urls = ['http://www.163.com/']
        def parse(self, response):
            pass
    
    
    1. name属性,是一个爬虫子项目的唯一标识,一个项目工程可存在多个爬虫子项目.
    2. allowed_domains 允许爬去的域名
    3. start_urls属性,一个爬虫的起始网址页面地址
    4. parse函数,当爬虫引擎下载完成一个页面时候,页面的解析函数回调

    相关文章

      网友评论

        本文标题:Python爬虫——Scrapy

        本文链接:https://www.haomeiwen.com/subject/lmpfyftx.html