美文网首页
simple crawler with scrapy

simple crawler with scrapy

作者: Zihowe | 来源:发表于2017-07-30 05:04 被阅读13次

    Installation (python 3.4)
    We need the Scrapy library (v1.3.3) along with PyMongo (v3.4.0) (latest version when this blog created) for storing the data in MongoDB. You need to install MongoDB as well(not covered).

    $ pip install Scrapy==1.3.3
    $ pip freeze > requirements.txt
    
    $ pip install pymongo==3.4.0
    $ pip freeze > requirements.txt
    

    start project

    $ scrapy startproject stack
    

    Specify Data
    Those familiar with Django will notice that Scrapy Items are declared similar to Django Models, except that Scrapy Items are much simpler as there is no concept of different field types.

    In items.py file

    #stack/items.py
    from scrapy.item import Item, Field
    
    class StackItem(Item):
        title = Field()
        url = Field()
    

    Create the Spider
    Create a file called stack_spider.py in the “spiders” directory.
    Using Chrome -> inspect to copy XPath of the craped element.

    # stack/spider/stack_spider.py file
    from scrapy import Spider
    from scrapy.selector import Selector
    
    from stack.items import StackItem
    
    
    class StackSpider(Spider):
        name = "stack"
        allowed_domains = ["stackoverflow.com"]
        start_urls = [
            "http://stackoverflow.com/questions?pagesize=50&sort=newest",
        ]
    
        def parse(self, response):
            questions = Selector(response).xpath('//div[@class="summary"]/h3')
    
            for question in questions:
                item = StackItem()
                item['title'] = question.xpath(
                    'a[@class="question-hyperlink"]/text()').extract()[0]
                item['url'] = question.xpath(
                    'a[@class="question-hyperlink"]/@href').extract()[0]
                yield item
    

    Test

    $ scrapy crawl stack -o items.json -t json
    

    相关文章

      网友评论

          本文标题:simple crawler with scrapy

          本文链接:https://www.haomeiwen.com/subject/kpxyzttx.html