python3 scrapy_redis 分布式爬取房天下存mo

作者: 简书用户9527 | 来源:发表于2018-05-02 09:29 被阅读53次

python3 scrapy_redis 分布式爬取房天下存mo
分布式爬取
python3 scrapy 爬虫实战之爬取站长之家
二十八. 模拟登陆实战 - 爬取拉勾网招聘信息
0.Python 爬虫之Scrapy入门实践指南（Scrapy基
34.scrapy_redis原理分析并实现断点续爬以及分布式爬
分布式爬虫scrappy-redis抓取房天下数据
python 爬取fcoin比特币交易市场上文
分布式爬虫爬取知乎用户—页面分析篇
分布式爬虫

（一）scrapy_redis 简单介绍

scrapy_redis基于scrapy框架的基础上集成了redis，通过了redis实现了去重，多台服务器进行分布式的爬取数据。

（二）scrapy_redis 简单配置

（1）settings.py 文件中加入两行代码：

#启用Redis调度存储请求队列
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

#确保所有的爬虫通过Redis去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

image.png

（2）spider文件中把scrapy.Spider改为RedisSpider；加入redis_key

image.png

以上就是scrapy_redis在scrapy框架中的简单配置，更多的配置内容，请查看以往博客介绍

（三）房天下爬虫代码的编写

（一）获取的内容是优选房源

image.png

起始页：https://m.fang.com/zf/bj/?jhtype=zf

因为这个页面是下滑刷新的，并没有点击下一页的操作，而是动态JS进行加载的，我们可以使用审查元素中的网络，查看接口请求信息

image.png

其中的一个链接：

https://m.fang.com/zf/?purpose=%D7%A1%D5%AC&jhtype=zf&city=%B1%B1%BE%A9&renttype=cz&c=zf&a=ajaxGetList&city=bj&r=0.7838634037101673&page=3

我们可以看到 page=3 只要我们操控这个变量就完全可以了。

但是：当我们打开上面的链接的时候，出现一堆的乱码：

image.png

（二）我们在parse()方法中使用decode 方法解码一下，就可以显示正常了。

    def parse(self,response):
          print(response.body.decode('utf-8'))

敲黑板！！！

因为这里用了分布式，我使用的方法是一台专门爬url，就是列表页的url，另外一台专门进行列表页url的解析工作。

基于现在的情况，我现在只有一台电脑，所以我进行了两个爬虫进行运行，一个进行url的爬取，一个进行页面的解析工作。

（1）url爬取：

image.png

（2）页面解析：

image.png

（1）爬取url的spider代码：

# -*- coding: utf-8 -*-
# @Time    : 2018/4/30 14:14
# @Author  : 蛇崽
# @Email   : 643435675@QQ.com
# @File    : fangtianxia.py（房天下）
import scrapy
import redis
from scrapy_redis.spiders import RedisSpider

from zhilianspider.settings import REDIS_HOST,REDIS_PWD


class FangtianxiaSpider(RedisSpider):

    name = 'fangtianxia'

    allowed_domains = ['m.fang.com']
    """
    44684 p:16  index 3192
    """
   # start_urls = ['https://m.fang.com/zf/?purpose=%D7%A1%D5%AC&jhtype=zf&city=%B1%B1%BE%A9&renttype=cz&c=zf&a=ajaxGetList&city=bj&r=0.7782449595236586&page=1']

    base_url = 'https://m.fang.com/zf/?purpose=%D7%A1%D5%AC&jhtype=zf&city=%B1%B1%BE%A9&renttype=cz&c=zf&a=ajaxGetList&city=bj&r=0.7782449595236586&page='

    # 获取到redis

    pool = redis.ConnectionPool(host=REDIS_HOST, port=6379, db=0, password=REDIS_PWD)
    redis = redis.StrictRedis(connection_pool=pool)

    for index in range(1,3192):
        star_url = base_url+str(index)
        redis.lpush('fangtianxia:start_urls',star_url)

    redis_key = 'fangtianxia:start_urls'


    def parse(self,response):
        #print(response.body.decode('utf-8'))
        url = response.xpath("//*[@class='tongjihref']/@href").extract()
        for v_url in url:
            print(v_url)
            n_v_url = 'https:'+v_url
            print('nvurl  ',n_v_url)
            self.redis.rpush('fangtianxia:house_urls',n_v_url)

（2）解析页面的代码

# -*- coding: utf-8 -*-
# @Time    : 2018/4/30 14:14
# @Author  : 蛇崽
# @Email   : 643435675@QQ.com
# @File    : fangtianxia.py（房天下）
import scrapy
import redis
from scrapy_redis.spiders import RedisSpider

from zhilianspider.items import FanItem
from zhilianspider.settings import REDIS_HOST,REDIS_PWD


class FangtianxiaSpider(RedisSpider):

    name = 'fangtianxia_down'

    allowed_domains = ['m.fang.com']

    redis_key = 'fangtianxia:house_urls'

    # start_urls = ['https://m.fang.com/zf/bj/JHAGT_404572021_11444434x1010063105_163711602.html']

    def parse(self,response):
        item = FanItem()
        item["title"] = response.xpath('//*[@class="xqCaption mb8"]/h1/text()')[0].extract()
        item["area"] = response.xpath('//*[@class="xqCaption mb8"]/p/a[2]/text()')[0].extract()
        item["location"] = response.xpath('//*[@class="xqCaption mb8"]/p/a[3]/text()')[0].extract()
        item["housing_estate"] = response.xpath('//*[@class="xqCaption mb8"]/p/a[1]/text()')[0].extract()
        item["rent"] = response.xpath('//*[@class="f18 red-df"]/text()')[0].extract()
        item["rent_type"] = response.xpath('//*[@class="f12 gray-8"]/text()')[0].extract()
        item["floor_area"] = response.xpath('//*[@class="flextable"]/li[3]/p/text()')[0].extract()
        item["house_type"] = response.xpath('//*[@class="flextable"]/li[2]/p/text()')[0].extract()
        item["floor"] = response.xpath('//*[@class="flextable"]/li[4]/p/text()')[0].extract()
        item["orientations"] = response.xpath('//*[@class="flextable"]/li[5]/p/text()')[0].extract()
        item["decoration"] = response.xpath('//*[@class="flextable"]/li[6]/p/text()')[0].extract()
        item["house_info"] = response.xpath('//*[@class="xqIntro"]/p/text()')[0].extract()
        item["house_tags"] = ",".join(response.xpath('//*[@class="stag"]/span/text()').extract())
        yield item

（三）items.py代码：

class FanItem(scrapy.Item):
    # 标题
    title = scrapy.Field()
    # 区（朝阳）
    area = scrapy.Field()
    # 区域 (劲松)
    location = scrapy.Field()
    # 小区 （劲松五区）
    housing_estate = scrapy.Field()
    # 租金
    rent = scrapy.Field()
    # 建筑面积
    floor_area = scrapy.Field()
    # 户型
    house_type = scrapy.Field()
    # 楼层
    floor = scrapy.Field()
    # 朝向
    orientations = scrapy.Field()
    # 装修
    decoration = scrapy.Field()
    # 房源描述
    house_info = scrapy.Field()
    # 标签
    house_tags = scrapy.Field()
    # 租房类型（押一付三etc）
    rent_type = scrapy.Field()

（四）数据展示

现在的数据还没有爬完，到现在redis的详情url已经是60万的数据了，怕要是撑爆了。

image.png

mongo数据库里面的数据是3万左右：

image.png

总结一下：scrapy_redis 中的url爬取，这是用这个框架以来第一次用的这种方式，或许这种方式更支持分布式操作，一个爬url，多个通过url进行页面的解析操作，比较解析页面是比较费时的。

其余代码都是跟前面爬取智联招聘的代码都差不多一样的，这里就不贴出来了，完整的代码我会上传上来。

python3 scrapy_redis 分布式爬取房天下存mo
（一）scrapy_redis 简单介绍 scrapy_redis基于scrapy框架的基础上集成了redis，通...
分布式爬取
分布式爬取需要安装pip3 install scrapy_redis 首先修改setings.py文件： 1.设置...
python3 scrapy 爬虫实战之爬取站长之家
爬取目标站长之家：http://top.chinaz.com/all/ 爬取工具 win10 python3 ...
二十八. 模拟登陆实战 - 爬取拉勾网招聘信息
爬取网址：https://www.lagou.com/爬取信息：工作岗位等信息爬取方式：json数据存储方式：Mo...
0.Python 爬虫之Scrapy入门实践指南（Scrapy基
[TOC] 0.0、Scrapy基础 Python2：适合爬取非中文 Python3：适合爬取中文 Scrapy是...
34.scrapy_redis原理分析并实现断点续爬以及分布式爬
scrapy_redis原理分析并实现断点续爬以及分布式爬虫学习目标了解 scrapy实现去重的原理了解 s...
分布式爬虫scrappy-redis抓取房天下数据
分布式爬虫scrapy-redis来爬取房天下的各个省份的房源信息（新房和二手房房源信息）newhouse.jso...
python 爬取fcoin比特币交易市场上文
爬取环境 win10 python3 scrapy 爬取思路首先你得先了解以下 http（https）请求（百科...
分布式爬虫爬取知乎用户—页面分析篇
使用python3爬取知乎用户信息并分析参考了：呓语 » 如何写一个简单的分布式知乎爬虫？打算自己做一个pyth...
分布式爬虫
一、什么是分布式爬虫之前的爬虫只能在主机爬取，爬取的效率有限。，分布式爬虫则是将多台主机组合起来，共同完成一个爬...