<h3>Architecture overview</h3>
This paper describes the architecture of Scrapy and how its components interact.
Data flow
The data flow in Scrapy is controlled by the execution engine ,and goes like this:
- The Engine gets the inital requests to crawl from the Spider.
- The Engine schedules the requests in the Scheduler and ask for the next requests for crawl.
- The Schedular return the next requests to the Engine.
- The Engine sends the requests to the Donwnloader, passing through the Downloader Middlewares (see process_request()).
- Once the page finishes downloading the downloader generates a response(with that page) and sends it to the engine,passing through the downloader middlewares (see process_response()).
- The engine receives the response from the downloader and sends it to he spider for processing,passing through spider middleware (see process_spider_input()).
- The spider processes the response and returns the scraped items and new requests to the engine ,passing through the spider middleware(see process_spider_output()).
- The engine sends the processed items to the item pipelines ,then send processed requests to the scheduler and ask for possible next request to crawl.
- The process repeats (from step 1 ) until there are no more requests from the scheduler.
<h3>componets</h3>
<h4>Scrapy Engine</h4>
The engine is responsible for contrilling the data flow betweent all components of the system,and trigger events when certeain actions occur. See the data flow above for more details.
<h4>Scheduler</h4>
The Scheduler receives the requests from the engine and enqueues them and feeding them later(also to the engine) when the engine requests them.
<h4>Downloader</h4>
The Downloader is responsible for fetching web pages and feeding them to the engine which .in turn,feeds them to the spiders.
<h4>Spiders></h4>
Spiders are custom classes written by Scrapy users to parse responses and extract items from them or additional requests to follow.
<h4>Item Pipeline</h4>
cleansing,validationand persistems.
网友评论