需求
快要毕业了,出来找实习,所以要找个房子租,但是是第一次出来找房子住,所以也不太清楚这边的租房情况,该租个单间带独卫的还是多人间的?房价大概又在多少?哪里的房价又高一点?
所以打算爬取我爱我家上的某地区的租房信息来进行数据分析一波(本篇暂讲解如何爬取)
爬取区域选择在上海
一一、详细需求
二 三可以看到我爱我家上海租房信息共有13234条,我们需要获取的有房源标题、类型几室几厅、面积、朝向、楼层、地址、发布时间、标签、月租金额、出租方式、所属地区等
二、数据如何加载
来看看网页源代码中
四我们所需要的数据基本上都在,这就很舒服
初步分析可以得知,每一页有30条房源信息,也就是说共有430多页的数据
对于这种大小的数据我们一般会想到用 scrapy ,毕竟 scrapy 是基于 twisted 开发的所以其异步的请求方式无疑会提高爬虫的效率,也是爬虫开发的一大利器
但是对于这次的需求,scrapy 貌似有点力不从心;结果我可以这么说,如果用scrapy,当你访问我爱我家的官网的时候,一般来说我们会设置
ROBOTSTXT_OBEY = False
但是这里必须将其设置为 True 才可正确访问到官网,这也就意味着我们遵从了这个机器人协议,会限制我们的爬虫,这对一个爬虫来说是不利的
这还不算什么,更让人头疼的是,你就简单的爬取了几页之后会发现之后怎样都爬取不了数据,为什么?
打印一下响应的内容之后你会发现这么一句:该请求已被网宿云WAF拦截......
简单来说就是你的爬虫已经被识别了,并且人家还封了你的IP
好,那我换个IP访问可以了吧?可以,但是没过多久你又会被封......
也就是说我们还是通过脚本的形式爬取,但是官网的反爬比较严重
那怎么办,该换一个网站爬吗?当然不可能,作为一个虫师,怎么能面对这么点困难就要跳过
重点来了,仔细想想,能不能换种别的方式获取数据?比如我们之前提到过的,微博有三个站点可以获取数据,那么我爱我家会不会有m站,会不会有wap站呢?
wap站没有,但是好在,有我爱我家的m站!
五 六
惊喜之余,对其进行常规的分析可以知道,m站的数据在网页源代码中也有,但是通过获取网页源代码中的数据还是会被封,所以我们还是通过ajax加载数据的接口来获取json数据才是最稳妥的,而且json数据中有我们所需的全部数据,甚至更多
七 八这里的url链接是一样的,但要获取json数据,必须带上请求头中的内容才可以,否则返回的还是网页源代码的数据
九 十为了方便我们可以直接把整个请求头拿过来用,对于得到返回只要 response.json() 就可以取到其中所有的值了
{
"_index": "shanghaiv1_shexchangehouse",
"_type": "shexchangehouse",
"_id": "sale_9_9_38153363_0",
"_score": 5,
"_source": {
"qsdy": "产权清晰,无抵押",
"heattypeid": null,
"memo": "房子已装修可拎包入住 没有户口产权一人 房子干净明了",
"location": [
121.253125,
31.10956
],
"buildage": 7,
"memo4": "产权清晰,无抵押",
"housetype": "普通住宅",
"housetypeid": 1,
"sqname": "泗泾",
"memo1": "此房为南向二室一厅一卫,建筑面积为76平米,南北通透户型,透风性好,采光佳。",
"buildarea": 76,
"pricechangetimelong": null,
"buildingfloor": 1,
"premisespermit": "",
"tag": [
4,
8
],
"loopline": null,
"searchphrase": "一手动迁精装修两房边套有钥匙随时看房家具家电全送,泽悦路325弄1-30号,新凯家园四期茉莉雅苑,松江区,泗泾,songjiangqu,sijing",
"uptime": 20181219203748864,
"flag3d": 0,
"house_quality": "优质房源",
"unitprice": 26316,
"contacttime": "随时看房",
"houseallfloor": 14,
"housetitle": "一手动迁精装修两房边套有钥匙随时看房家具家电全送",
"downtime": null,
"traffic": null,
"floorPositionStr": "底层",
"livingroom_cn": "一厅",
"communityid": 325090,
"firstuptime": 1521939038968,
"tags": [
"jdbc_logstash_sale_sh"
],
"jtcx": "小区门口公交总站,可坐191路公交车,直达泗泾站,十分钟车程。",
"communityname": "新凯家园四期茉莉雅苑",
"cityid": 9,
"gptime": "2017-07-07",
"memo3": "",
"subwaystationids": [],
"heading": "南",
"x": 121.253125,
"pre_price": 0,
"pricechangeflag": 0,
"y": 31.10956,
"subway": null,
"img3d": null,
"decoratelevel": "精装",
"headingid": 3,
"bedroom_cn": "二室",
"qyspell": "songjiangqu",
"sqid": 40000067,
"dkqk": "业主接受:商贷、公积金贷款、组合贷、现金。房东接受正常首付,贷款贷款情况仅供参考,最终以实际情况为准。",
"isdeleted": 0,
"istop": 0,
"government_qr": "",
"joins": 0,
"rim": "191.45.1845路直达泗泾站",
"price": 200,
"hasimg": 1,
"othertypeid": 1,
"hxjs": "此房为南向二室一厅一卫,建筑面积为76平米,南北通透户型,透风性好,采光佳。",
"checkintime": "",
"toilet_cn": "一卫",
"pricetrend": "业主接受:商贷、公积金贷款、组合贷、现金。房东接受正常首付,贷款贷款情况仅供参考,最终以实际情况为准。",
"decorate_time": "",
"cjdatestr": "2018-12-13 22:41:52",
"housesid": 38153363,
"floorPositionId": -1,
"qyname": "松江区",
"isnew": 0,
"imgs": [
"https://image18.5i5j.com/erp/house/3815/38153363/shinei/mgdamgkfe0d02fed.jpg_P5.jpg",
"https://image17.5i5j.com/erp/house/3815/38153363/shinei/elgaegjpe0dcedce.jpg_P5.jpg",
"https://image17.5i5j.com/erp/house/3815/38153363/shinei/lnooanoke0de2c1d.jpg_P5.jpg",
"https://image18.5i5j.com/erp/house/3815/38153363/shinei/dogeeiane0de3926.jpg_P5.jpg",
"https://image17.5i5j.com/erp/house/3815/38153363/shinei/bjojeagce0dc3ab4.jpg_P5.jpg",
"https://image16.5i5j.com/erp/house/3815/38153363/shinei/nhnfhjnje0d03998.jpg_P5.jpg",
"https://image16.5i5j.com/erp/house/3815/38153363/shinei/cfapbkkce0d054c5.jpg_P5.jpg",
"https://image17.5i5j.com/erp/house/3815/38153363/shinei/idahicfae0d2837c.jpg_P5.jpg",
"https://image18.5i5j.com/erp/house/3815/38153363/shinei/pkkjjomne0d27ad9.jpg_P5.jpg",
"https://image16.5i5j.com/erp/house/3815/38153363/huxing/knhhpcmne0c95f57.jpg_P5.jpg"
],
"hxmd": "房子已装修可拎包入住 没有户口产权一人 房子干净明了",
"buildage_cn": "七年",
"sqspell": "sijing",
"livingroom": 1,
"house_quality_id": 2,
"bedroom": 2,
"memo2": "小区门口公交总站,可坐191路公交车,直达泗泾站,十分钟车程。",
"updown": 1,
"parking": "",
"sfjx": "",
"bookin_time": "2017-07-07",
"tagwall": [
"随时看",
"满二年"
],
"memo5": "小区2012年交房,适合居住",
"rightprop": "使用权房",
"toilet": 1,
"buildyear": 2012,
"housesexchangescore": 17.9,
"subwaystations": [],
"cjflag": 0,
"subwaylineids": [],
"updatetimelong": 1545223070844,
"floortypeid": null,
"sectionname": "泽悦路325弄1-30号",
"qyid": 73,
"decoratelevelid": 3,
"floortype": "",
"esid": "sale_9_9_38153363_0",
"hximg": "",
"pricechangetime": null,
"xqxx": "小区2012年交房,适合居住",
"cjdate": 1544712112000,
"citycode": null,
"payment": "70.00",
"zbpt": "191.45.1845路直达泗泾站",
"heattype": "",
"firstuptimestr": "2018-03-25 08:50:38",
"updatetimestr": "2018-12-19 20:37:50",
"government_code": "",
"img3durl": null,
"imgurl": "https://image16.5i5j.com/erp/house/3815/38153363/shinei/bjojeagce0dc3ab4.jpg_P7.jpg",
"subwaylines": []
}
},
之后就是保存到数据库中啦,通过这个接口爬取,完全不用担心被封IP,但是也要注意控制时延,避免对其服务器造成过大负担,要是之后我爱我家对这个接口进行了限制,就很难再获取大量的数据了
三、数据提取中几个注意的点
1、爬取对象为我爱我家的m站,访问接口时需要带上请求头
2、控制时延
四、实战结果
取完我爱我家上海地区租房的信息,共13000多条,可用作之后的数据分析用
十一 十二代码我就搁这儿了!
#!/usr/bin/python
# -*- coding:utf-8 -*-
# author:joel 18-6-5
import random
import re
import time
import pymysql
import requests
# CREATE TABLE `wiwj_sh_zufang` (
# `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
# `house_id` char(16) NOT NULL,
# `house_url` varchar(127) NOT NULL,
# `house_jpg` varchar(512) CHARACTER SET utf8mb4 DEFAULT NULL COMMENT '封面图',
# `house_title` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
# `house_type` varchar(256) CHARACTER SET utf8mb4 NOT NULL,
# `house_buildarea` varchar(64) CHARACTER SET utf8mb4 NOT NULL,
# `house_heading` varchar(16) CHARACTER SET utf8mb4 NOT NULL,
# `house_floor` varchar(128) CHARACTER SET utf8mb4 NOT NULL,
# `house_decoratelevel` varchar(128) CHARACTER SET utf8mb4 NOT NULL,
# `house_place` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
# `house_firstuptime` varchar(128) CHARACTER SET utf8mb4 NOT NULL,
# `house_price` int(16) NOT NULL,
# `house_renttype` varchar(16) CHARACTER SET utf8mb4 NOT NULL,
# `house_paytype` varchar(16) CHARACTER SET utf8mb4 NOT NULL,
# `house_area` varchar(32) CHARACTER SET utf8mb4 NOT NULL,
# `house_tags` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
# `house_subwaylines` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
# `house_traffic` varchar(512) CHARACTER SET utf8mb4 NOT NULL,
# `house_quality` varchar(256) CHARACTER SET utf8mb4 NOT NULL,
# PRIMARY KEY (`id`),
# KEY `houseid` (`house_id`) USING BTREE
# ) ENGINE=InnoDB AUTO_INCREMENT=2671 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;
class Wiwj(object):
def __init__(self):
"""
13014 按每页30个 共有434页
"""
self.start_url = 'https://m.5i5j.com/sh/zufang/index-n{}'
# 只添加'x-requested-with' 可能获取不到json数据,可以直接把整个请求头加上
self.headers = {
'accept': 'application/json, text/javascript, */*; q=0.01',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'no-cache',
'cookie': '',
'pragma': 'no-cache',
'referer': 'https://m.5i5j.com/sh/zufang/index',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
}
def gethouselist(self):
""" 5i5j 上海租房 """
for page in range(1, 435):
print("-----------------------" + str(page) + "------------------")
r = requests.get(self.start_url.format(page), headers=self.headers)
result = r.json()
houses = result['houses']
# print(r.json())
for i in range(0, len(houses)):
# print(houses[i]['_source']['housesid'])
house_id = houses[i]['_source']['housesid']
house_url = 'https://m.5i5j.com/sh/zufang/{}.html'.format(houses[i]['_source']['housesid'])
house_jpg = houses[i]['_source']['imgurl']
house_title = houses[i]['_source']['housetitle']
house_type = houses[i]['_source']['bedroom_cn'] + houses[i]['_source']['livingroom_cn'] + houses[i]['_source']['toilet_cn']
house_buildarea = houses[i]['_source']['area']
house_heading = houses[i]['_source']['heading']
house_floor = houses[i]['_source']['floorPositionStr'] + '/' + str(houses[i]['_source']['houseallfloor'])
house_decoratelevel = houses[i]['_source']['decoratelevel']
house_place = str(houses[i]['_source']['sqname']) + ' ' + str(houses[i]['_source']['communityname'])
house_firstuptime = houses[i]['_source']['firstuptimestr']
house_price = houses[i]['_source']['price']
house_renttype = houses[i]['_source']['rentmodename']
house_paytype = houses[i]['_source']['pay']
house_area = houses[i]['_source']['qyname']
house_tag = houses[i]['_source']['tagwall']
house_tags = ','.join(house_tag)
house_subwayline = houses[i]['_source']['subwaylines']
house_subwaylines = ','.join(house_subwayline)
house_traffic = houses[i]['_source']['traffic']
house_quality = houses[i]['_source']['house_quality']
# print(house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
# house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype,
# house_paytype, house_area, house_tags, house_subwaylines, house_traffic, house_quality)
self.insertmysql(house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype,
house_paytype, house_area, house_tags, house_subwaylines, house_traffic, house_quality)
time.sleep(random.randint(0, 2))
@staticmethod
def insertmysql(house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype,
house_paytype, house_area, house_tags, house_subwaylines, house_traffic, house_quality):
conn = pymysql.connect(host='', port=, user='', passwd='', db='wiwj')
cursor = conn.cursor()
insert_sql = "insert into `wiwj_sh_zufang` (`house_id`, `house_url`, `house_jpg`, `house_title`, " <br /> "`house_type`, `house_buildarea`, `house_heading`, `house_floor`, `house_decoratelevel`, " <br /> "`house_place`, `house_firstuptime`, `house_price`, `house_renttype`, `house_paytype`, " <br /> " `house_area`, `house_tags`, `house_subwaylines`, " <br /> "`house_traffic`, `house_quality`)values('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s', " <br /> "'%s','%s','%s','%s','%s','%s','%s','%s','%s')" % <br /> (house_id, house_url, house_jpg, house_title, house_type, house_buildarea, house_heading,
house_floor, house_decoratelevel, house_place, house_firstuptime, house_price, house_renttype, house_paytype,
house_area, house_tags, house_subwaylines, house_traffic, house_quality)
select_sql = "select `house_id` from `wiwj_sh_zufang` where `house_id`='%s'" % house_id
try:
response = cursor.execute(select_sql)
conn.commit()
if response == 1:
print(u'该房源存在...')
else:
try:
cursor.execute(insert_sql)
conn.commit()
print(u'房源插入成功...')
except Exception as e:
print(u'房源插入错误...', e)
conn.rollback()
except Exception as e:
print(u'查询错误...', e)
conn.rollback()
finally:
cursor.close()
conn.close()
if __name__ == '__main__':
""" 上海 """
wiwj = Wiwj()
wiwj.gethouselist()
print('微信公众号搜索 "猿狮的单身日常" ,Java技术升级、虫师修炼,我们 不见不散!')
print('也可以扫下方二维码哦~')
猿狮的单身日常
网友评论