美文网首页大数据 爬虫Python AI Sql
Python爬取一揽子我爱我家租房信息

Python爬取一揽子我爱我家租房信息

作者: 不存在的一角 | 来源:发表于2018-12-21 23:19 被阅读6次

    需求

    快要毕业了,出来找实习,所以要找个房子租,但是是第一次出来找房子住,所以也不太清楚这边的租房情况,该租个单间带独卫的还是多人间的?房价大概又在多少?哪里的房价又高一点?

    所以打算爬取我爱我家上的某地区的租房信息来进行数据分析一波(本篇暂讲解如何爬取)

    https://sh.5i5j.com/zufang/

    爬取区域选择在上海

    一、详细需求

    可以看到我爱我家上海租房信息共有13234条,我们需要获取的有房源标题、类型几室几厅、面积、朝向、楼层、地址、发布时间、标签、月租金额、出租方式、所属地区等

    二、数据如何加载

    来看看网页源代码中

    我们所需要的数据基本上都在,这就很舒服

    初步分析可以得知,每一页有30条房源信息,也就是说共有430多页的数据

    对于这种大小的数据我们一般会想到用 scrapy ,毕竟 scrapy 是基于 twisted 开发的所以其异步的请求方式无疑会提高爬虫的效率,也是爬虫开发的一大利器

    但是对于这次的需求,scrapy 貌似有点力不从心;结果我可以这么说,如果用scrapy,当你访问我爱我家的官网的时候,一般来说我们会设置

    ROBOTSTXT_OBEY = False
    

    但是这里必须将其设置为 True 才可正确访问到官网,这也就意味着我们遵从了这个机器人协议,会限制我们的爬虫,这对一个爬虫来说是不利的

    这还不算什么,更让人头疼的是,你就简单的爬取了几页之后会发现之后怎样都爬取不了数据,为什么?

    打印一下响应的内容之后你会发现这么一句:该请求已被网宿云WAF拦截......

    简单来说就是你的爬虫已经被识别了,并且人家还封了你的IP

    好,那我换个IP访问可以了吧?可以,但是没过多久你又会被封......

    也就是说我们还是通过脚本的形式爬取,但是官网的反爬比较严重

    那怎么办,该换一个网站爬吗?当然不可能,作为一个虫师,怎么能面对这么点困难就要跳过

    重点来了,仔细想想,能不能换种别的方式获取数据?比如我们之前提到过的,微博有三个站点可以获取数据,那么我爱我家会不会有m站,会不会有wap站呢?

    wap站没有,但是好在,有我爱我家的m站!

    https://m.5i5j.com/sh/zufang/

    惊喜之余,对其进行常规的分析可以知道,m站的数据在网页源代码中也有,但是通过获取网页源代码中的数据还是会被封,所以我们还是通过ajax加载数据的接口来获取json数据才是最稳妥的,而且json数据中有我们所需的全部数据,甚至更多

    这里的url链接是一样的,但要获取json数据,必须带上请求头中的内容才可以,否则返回的还是网页源代码的数据

    为了方便我们可以直接把整个请求头拿过来用,对于得到返回只要 response.json() 就可以取到其中所有的值了

    {
                "_index": "shanghaiv1_shexchangehouse",
                "_type": "shexchangehouse",
                "_id": "sale_9_9_38153363_0",
                "_score": 5,
                "_source": {
                    "qsdy": "产权清晰,无抵押",
                    "heattypeid": null,
                    "memo": "房子已装修可拎包入住 没有户口产权一人 房子干净明了",
                    "location": [
                        121.253125,
                        31.10956
                    ],
                    "buildage": 7,
                    "memo4": "产权清晰,无抵押",
                    "housetype": "普通住宅",
                    "housetypeid": 1,
                    "sqname": "泗泾",
                    "memo1": "此房为南向二室一厅一卫,建筑面积为76平米,南北通透户型,透风性好,采光佳。",
                    "buildarea": 76,
                    "pricechangetimelong": null,
                    "buildingfloor": 1,
                    "premisespermit": "",
                    "tag": [
                        4,
                        8
                    ],
                    "loopline": null,
                    "searchphrase": "一手动迁精装修两房边套有钥匙随时看房家具家电全送,泽悦路325弄1-30号,新凯家园四期茉莉雅苑,松江区,泗泾,songjiangqu,sijing",
                    "uptime": 20181219203748864,
                    "flag3d": 0,
                    "house_quality": "优质房源",
                    "unitprice": 26316,
                    "contacttime": "随时看房",
                    "houseallfloor": 14,
                    "housetitle": "一手动迁精装修两房边套有钥匙随时看房家具家电全送",
                    "downtime": null,
                    "traffic": null,
                    "floorPositionStr": "底层",
                    "livingroom_cn": "一厅",
                    "communityid": 325090,
                    "firstuptime": 1521939038968,
                    "tags": [
                        "jdbc_logstash_sale_sh"
                    ],
                    "jtcx": "小区门口公交总站,可坐191路公交车,直达泗泾站,十分钟车程。",
                    "communityname": "新凯家园四期茉莉雅苑",
                    "cityid": 9,
                    "gptime": "2017-07-07",
                    "memo3": "",
                    "subwaystationids": [],
                    "heading": "南",
                    "x": 121.253125,
                    "pre_price": 0,
                    "pricechangeflag": 0,
                    "y": 31.10956,
                    "subway": null,
                    "img3d": null,
                    "decoratelevel": "精装",
                    "headingid": 3,
                    "bedroom_cn": "二室",
                    "qyspell": "songjiangqu",
                    "sqid": 40000067,
                    "dkqk": "业主接受:商贷、公积金贷款、组合贷、现金。房东接受正常首付,贷款贷款情况仅供参考,最终以实际情况为准。",
                    "isdeleted": 0,
                    "istop": 0,
                    "government_qr": "",
                    "joins": 0,
                    "rim": "191.45.1845路直达泗泾站",
                    "price": 200,
                    "hasimg": 1,
                    "othertypeid": 1,
                    "hxjs": "此房为南向二室一厅一卫,建筑面积为76平米,南北通透户型,透风性好,采光佳。",
                    "checkintime": "",
                    "toilet_cn": "一卫",
                    "pricetrend": "业主接受:商贷、公积金贷款、组合贷、现金。房东接受正常首付,贷款贷款情况仅供参考,最终以实际情况为准。",
                    "decorate_time": "",
                    "cjdatestr": "2018-12-13 22:41:52",
                    "housesid": 38153363,
                    "floorPositionId": -1,
                    "qyname": "松江区",
                    "isnew": 0,
                    "imgs": [
                        "https://image18.5i5j.com/erp/house/3815/38153363/shinei/mgdamgkfe0d02fed.jpg_P5.jpg",
                        "https://image17.5i5j.com/erp/house/3815/38153363/shinei/elgaegjpe0dcedce.jpg_P5.jpg",
                        "https://image17.5i5j.com/erp/house/3815/38153363/shinei/lnooanoke0de2c1d.jpg_P5.jpg",
                        "https://image18.5i5j.com/erp/house/3815/38153363/shinei/dogeeiane0de3926.jpg_P5.jpg",
                        "https://image17.5i5j.com/erp/house/3815/38153363/shinei/bjojeagce0dc3ab4.jpg_P5.jpg",
                        "https://image16.5i5j.com/erp/house/3815/38153363/shinei/nhnfhjnje0d03998.jpg_P5.jpg",
                        "https://image16.5i5j.com/erp/house/3815/38153363/shinei/cfapbkkce0d054c5.jpg_P5.jpg",
                        "https://image17.5i5j.com/erp/house/3815/38153363/shinei/idahicfae0d2837c.jpg_P5.jpg",
                        "https://image18.5i5j.com/erp/house/3815/38153363/shinei/pkkjjomne0d27ad9.jpg_P5.jpg",
                        "https://image16.5i5j.com/erp/house/3815/38153363/huxing/knhhpcmne0c95f57.jpg_P5.jpg"
                    ],
                    "hxmd": "房子已装修可拎包入住 没有户口产权一人 房子干净明了",
                    "buildage_cn": "七年",
                    "sqspell": "sijing",
                    "livingroom": 1,
                    "house_quality_id": 2,
                    "bedroom": 2,
                    "memo2": "小区门口公交总站,可坐191路公交车,直达泗泾站,十分钟车程。",
                    "updown": 1,
                    "parking": "",
                    "sfjx": "",
                    "bookin_time": "2017-07-07",
                    "tagwall": [
                        "随时看",
                        "满二年"
                    ],
                    "memo5": "小区2012年交房,适合居住",
                    "rightprop": "使用权房",
                    "toilet": 1,
                    "buildyear": 2012,
                    "housesexchangescore": 17.9,
                    "subwaystations": [],
                    "cjflag": 0,
                    "subwaylineids": [],
                    "updatetimelong": 1545223070844,
                    "floortypeid": null,
                    "sectionname": "泽悦路325弄1-30号",
                    "qyid": 73,
                    "decoratelevelid": 3,
                    "floortype": "",
                    "esid": "sale_9_9_38153363_0",
                    "hximg": "",
                    "pricechangetime": null,
                    "xqxx": "小区2012年交房,适合居住",
                    "cjdate": 1544712112000,
                    "citycode": null,
                    "payment": "70.00",
                    "zbpt": "191.45.1845路直达泗泾站",
                    "heattype": "",
                    "firstuptimestr": "2018-03-25 08:50:38",
                    "updatetimestr": "2018-12-19 20:37:50",
                    "government_code": "",
                    "img3durl": null,
                    "imgurl": "https://image16.5i5j.com/erp/house/3815/38153363/shinei/bjojeagce0dc3ab4.jpg_P7.jpg",
                    "subwaylines": []
                }
            },
    

    之后就是保存到数据库中啦,通过这个接口爬取,完全不用担心被封IP,但是也要注意控制时延,避免对其服务器造成过大负担,要是之后我爱我家对这个接口进行了限制,就很难再获取大量的数据了

    三、数据提取中几个注意的点

    1、爬取对象为我爱我家的m站,访问接口时需要带上请求头

    2、控制时延

    四、实战结果

    取完我爱我家上海地区租房的信息,共13000多条,可用作之后的数据分析用

    十一 十二

    代码我就搁这儿了!

    #!/usr/bin/python
    # -*- coding:utf-8 -*-
    # author:joel 18-6-5
    
    import random
    import re
    import time
    import pymysql
    import requests
    
    # CREATE TABLE `wiwj_sh_zufang` (
    #   `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
    #   `house_id` char(16) NOT NULL,
    #   `house_url` varchar(127) NOT NULL,
    #   `house_jpg` varchar(512) CHARACTER SET utf8mb4 DEFAULT NULL COMMENT '封面图',
    #   `house_title` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_type` varchar(256) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_buildarea` varchar(64) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_heading` varchar(16) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_floor` varchar(128) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_decoratelevel` varchar(128) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_place` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_firstuptime` varchar(128) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_price` int(16) NOT NULL,
    #   `house_renttype` varchar(16) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_paytype` varchar(16) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_area` varchar(32) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_tags` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_subwaylines` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_traffic` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
    #   `house_quality` varchar(256) CHARACTER SET utf8mb4 NOT NULL,
    #   PRIMARY KEY (`id`),
    #   KEY `houseid` (`house_id`) USING BTREE
    # ) ENGINE=InnoDB AUTO_INCREMENT=2671 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;
    
    
    class Wiwj(object):
        def __init__(self):
            """
            13014 按每页30个 共有434页
            """
            self.start_url = 'https://m.5i5j.com/sh/zufang/index-n{}'
            # 只添加'x-requested-with' 可能获取不到json数据,可以直接把整个请求头加上
            self.headers = {
                'accept': 'application/json, text/javascript, */*; q=0.01',
                'accept-encoding': 'gzip, deflate, br',
                'accept-language': 'zh-CN,zh;q=0.9',
                'cache-control': 'no-cache',
                'cookie': '',
                'pragma': 'no-cache',
                'referer': 'https://m.5i5j.com/sh/zufang/index',
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36',
                'x-requested-with': 'XMLHttpRequest',
            }
    
        def gethouselist(self):
            """ 5i5j 上海租房 """
            for page in range(1, 435):
                print("-----------------------" + str(page) + "------------------")
                r = requests.get(self.start_url.format(page), headers=self.headers)
                result = r.json()
                houses = result['houses']
                # print(r.json())
                for i in range(0, len(houses)):
                    # print(houses[i]['_source']['housesid'])
                    house_id = houses[i]['_source']['housesid']
                    house_url = 'https://m.5i5j.com/sh/zufang/{}.html'.format(houses[i]['_source']['housesid'])
                    house_jpg = houses[i]['_source']['imgurl']
                    house_title = houses[i]['_source']['housetitle']
                    house_type = houses[i]['_source']['bedroom_cn'] + houses[i]['_source']['livingroom_cn'] + houses[i]['_source']['toilet_cn']
                    house_buildarea = houses[i]['_source']['area']
                    house_heading = houses[i]['_source']['heading']
                    house_floor = houses[i]['_source']['floorPositionStr'] + '/' + str(houses[i]['_source']['houseallfloor'])
                    house_decoratelevel = houses[i]['_source']['decoratelevel']
                    house_place = str(houses[i]['_source']['sqname']) + ' ' + str(houses[i]['_source']['communityname'])
                    house_firstuptime = houses[i]['_source']['firstuptimestr']
                    house_price = houses[i]['_source']['price']
                    house_renttype = houses[i]['_source']['rentmodename']
                    house_paytype = houses[i]['_source']['pay']
                    house_area = houses[i]['_source']['qyname']
                    house_tag = houses[i]['_source']['tagwall']
                    house_tags = ','.join(house_tag)
                    house_subwayline = houses[i]['_source']['subwaylines']
                    house_subwaylines = ','.join(house_subwayline)
                    house_traffic = houses[i]['_source']['traffic']
                    house_quality = houses[i]['_source']['house_quality']
                    # print(house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
                    #       house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype,
                    #       house_paytype, house_area, house_tags, house_subwaylines, house_traffic, house_quality)
                    self.insertmysql(house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
                          house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype,
                          house_paytype, house_area, house_tags, house_subwaylines, house_traffic, house_quality)
                time.sleep(random.randint(0, 2))
    
        @staticmethod
        def insertmysql(house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
                          house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype,
                          house_paytype, house_area, house_tags, house_subwaylines, house_traffic, house_quality):
            conn = pymysql.connect(host='', port=, user='', passwd='', db='wiwj')
            cursor = conn.cursor()
    
            insert_sql = "insert into `wiwj_sh_zufang` (`house_id`, `house_url`, `house_jpg`, `house_title`, " <br  />                     "`house_type`, `house_buildarea`, `house_heading`, `house_floor`, `house_decoratelevel`, " <br  />                     "`house_place`, `house_firstuptime`, `house_price`, `house_renttype`, `house_paytype`, " <br  />                     " `house_area`, `house_tags`, `house_subwaylines`, " <br  />                     "`house_traffic`, `house_quality`)values('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s', " <br  />                     "'%s','%s','%s','%s','%s','%s','%s','%s','%s')" % <br  />                     (house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
                          house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype, house_paytype,
                          house_area, house_tags, house_subwaylines, house_traffic, house_quality)
            select_sql = "select `house_id` from `wiwj_sh_zufang` where `house_id`='%s'" % house_id
    
            try:
                response = cursor.execute(select_sql)
                conn.commit()
                if response == 1:
                    print(u'该房源存在...')
                else:
                    try:
                        cursor.execute(insert_sql)
                        conn.commit()
                        print(u'房源插入成功...')
                    except Exception as e:
                        print(u'房源插入错误...', e)
                        conn.rollback()
            except Exception as e:
                print(u'查询错误...', e)
                conn.rollback()
            finally:
                cursor.close()
                conn.close()
    
    
    if __name__ == '__main__':
        """ 上海 """
        wiwj = Wiwj()
        wiwj.gethouselist()
    

    print('微信公众号搜索 "猿狮的单身日常" ,Java技术升级、虫师修炼,我们 不见不散!')
    print('也可以扫下方二维码哦~')
    
    猿狮的单身日常

    相关文章

      网友评论

        本文标题:Python爬取一揽子我爱我家租房信息

        本文链接:https://www.haomeiwen.com/subject/ugjbkqtx.html