美文网首页
Architecture overview

Architecture overview

作者: hobit | 来源:发表于2016-12-12 09:40 被阅读0次

    <h3>Architecture overview</h3>
    This paper describes the architecture of Scrapy and how its components interact.


    Data flow

    The data flow in Scrapy is controlled by the execution engine ,and goes like this:

    1. The Engine gets the inital requests to crawl from the Spider.
    2. The Engine schedules the requests in the Scheduler and ask for the next requests for crawl.
    3. The Schedular return the next requests to the Engine.
    4. The Engine sends the requests to the Donwnloader, passing through the Downloader Middlewares (see process_request()).
    5. Once the page finishes downloading the downloader generates a response(with that page) and sends it to the engine,passing through the downloader middlewares (see process_response()).
    6. The engine receives the response from the downloader and sends it to he spider for processing,passing through spider middleware (see process_spider_input()).
    7. The spider processes the response and returns the scraped items and new requests to the engine ,passing through the spider middleware(see process_spider_output()).
    8. The engine sends the processed items to the item pipelines ,then send processed requests to the scheduler and ask for possible next request to crawl.
    9. The process repeats (from step 1 ) until there are no more requests from the scheduler.

    <h3>componets</h3>
    <h4>Scrapy Engine</h4>
    The engine is responsible for contrilling the data flow betweent all components of the system,and trigger events when certeain actions occur. See the data flow above for more details.
    <h4>Scheduler</h4>
    The Scheduler receives the requests from the engine and enqueues them and feeding them later(also to the engine) when the engine requests them.
    <h4>Downloader</h4>
    The Downloader is responsible for fetching web pages and feeding them to the engine which .in turn,feeds them to the spiders.
    <h4>Spiders></h4>
    Spiders are custom classes written by Scrapy users to parse responses and extract items from them or additional requests to follow.
    <h4>Item Pipeline</h4>
    cleansing,validationand persistems.

    相关文章

      网友评论

          本文标题:Architecture overview

          本文链接:https://www.haomeiwen.com/subject/dekmmttx.html