美文网首页py爬虫我爱编程
scrapy + mongodb +redis 实现爬虫

scrapy + mongodb +redis 实现爬虫

作者: a十二_4765 | 来源:发表于2017-03-16 19:30 被阅读246次

    1. 安装scrapy  pip install scrapy

        安装scrapy-redis   pip install scrapy

    2.安装mongodb 

    mongo.exe 服务端 mongod.exe客户端

    安装mongodb服务  存放在F盘下php/mongodb

    F:\php\mongodb\bin>dir  查看目录

    mongo --dbpath F:/php/mongodb  F:\php\mongodb 表示数据存放位置

    启动mongo 安装

    mongod.exe  --dbpath F:/php/mongodb/bin/

    在启动一个cmd  然后进入到bin目录下  输入mongo.exe

    py安装pymongo

    pip install pymongo

    3安装readis

    爬取目标:彩票网站开奖的数据  http://www.bwlc.net/

    首选创建爬虫

    scrapy startproject fucai 

    进入 目录 >fucai

    然后创建爬虫 scrpy genspider ff 

    在 items.py 进行编进

    import scrapy

    class FucaiItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    qihao =scrapy.Field()

    kaijiang =scrapy.Field()

    riqi =scrapy.Field()

    然后进入spiders 目录里对ff.py进行编辑

    import scrapy

    from scrapy.http import Request

    from fucai.items import FucaiItem

    from scrapy_redis.spiders import RedisSpider

    class FfSpider(RedisSpider):

    name = "ff"

    redis_key='ff:start_urls'

    allowed_domains = ["bwlc.net"]

    def start_requests(self):

    for url in self.start_urls:

    yield Requset(url=url,callback=self.parse)

    def parse(self, response):

    url= response.xpath('//div[@class="fc_fanye"]/span[2]/b[@class="col_red"]/text()').extract()

    print url

    for j in range(1,3):

    page = "http://www.bwlc.net/bulletin/prevqck3.html?page="+str(j)

    yield Request(url=page,callback=self.next2)

    def next2(self,response):

    urla = response.xpath('//tr[@class]')

    for i in urla:

    item = FucaiItem()

    item["qihao"]=i.xpath('td/text()').extract()[0]

    item["kaijiang"] =i.xpath('td/text()').extract()[1]

    item["riqi"] =i.xpath('td/text()').extract()[2]

    yield item

    配置settings.py

    SCHEDULER ="scrapy_redis.scheduler.Scheduler"

    DUPEFILTER_CLASS ="scrapy_redis.dupefilter.RFPDupeFilter"

    SCHEDULER_PERSIST = True

    SCHEDULER_QUEUE_CLASS ='scrapy_redis.queue.SpiderQueue'

    ITEM_PIPELINES = {

    'fucai.pipelines.FucaiPipeline':300,

    }

    MONGODB_HOST='127.0.0.1'

    MONGODB_POST =27017

    MONGODB_DBNAME='jike'

    MONGODB_DOCNAME='reada'

    进入pipelines

    在redis 里输入要爬取的内容

    然后 scrapy crawl  ff  进行爬取

    如有问题 请加qq:1158219108

    相关文章

      网友评论

        本文标题:scrapy + mongodb +redis 实现爬虫

        本文链接:https://www.haomeiwen.com/subject/azxzgttx.html