Python爬取上海链家网房源数据并存入MongoDB数据库

作者: Treehl | 来源:发表于2017-11-29 21:29 被阅读0次

Python爬取上海链家网房源数据并存入MongoDB数据库
【爬虫】-005-MongoDB数据库操作-练习
Python实战计划学习第二周
python 简单操作MySQL
使用XPath爬取起点网
：将爬取的数据存入Mongodb
爬虫入门练习（三）爬取小猪租房网信息
链家北京二手房python scrapy框架爬取
爬虫爬取豆瓣top250
Scrapy实战篇（一）之爬取链家网成交房源数据（上）

以下是我爬取上海链家网宝山区房源信息的学习总结

准备工作

用到的Python模块：

requests
bs4
pymongo
datetime
time
random

分析网页

登陆http://sh.lianjia.com/ershoufang/baoshan 用Chrome打开开发者工具

image
每条房源信息都在li元素中，我们再来看一下翻页链接

image
试着点击下一页，我们浏览器上的链接是有规律可循的

http://sh.lianjia.com/ershoufang/baoshan/d1
http://sh.lianjia.com/ershoufang/baoshan/d2
http://sh.lianjia.com/ershoufang/baoshan/d3
.........
http://sh.lianjia.com/ershoufang/baoshan/100

现在我们试着爬取前10页的链接

import requests
for i in range(1, 11):
    r = requests.get('http://sh.lianjia.com/ershoufang/baoshan/d' + str(i))
    print(r.url)

爬取结果

http://sh.lianjia.com/ershoufang/baoshan/d1
http://sh.lianjia.com/ershoufang/baoshan/d2
http://sh.lianjia.com/ershoufang/baoshan/d3
http://sh.lianjia.com/ershoufang/baoshan/d4
http://sh.lianjia.com/ershoufang/baoshan/d5
http://sh.lianjia.com/ershoufang/baoshan/d6
http://sh.lianjia.com/ershoufang/baoshan/d7
http://sh.lianjia.com/ershoufang/baoshan/d8
http://sh.lianjia.com/ershoufang/baoshan/d9
http://sh.lianjia.com/ershoufang/baoshan/d10

解析网页

image

要抓取的信息如下：

标题：room_title = room.find('div', attrs={'class': 'prop-title'})
房屋信息：room_info = room.find('span', attrs={'class': 'info-col row1-text'})
位置：room_location = room.find('span', attrs={'class': 'info-col row2-text'})
附加信息：extra_info = room.find('div', attrs={'class': 'property-tag-container'})
总价：room_price = room.find('span', attrs={'class': 'total-price strong-num'})
单价：room_unit_price = room.find('span', attrs={'class': 'info-col price-item minor'})

soup = BeautifulSoup(r.text, 'html.parser')
rooms = soup.find('ul', attrs={'class': 'js_fang_list'})
for room in rooms.find_all('li'):
    room_title = room.find('div', attrs={'class': 'prop-title'}).get_text()
    room_info = room.find('span', attrs={'class': 'info-col row1-text'}).get_text()
    room_location = room.find('span', attrs={'class': 'info-col row2-text'}).find('a').get_text()
    room_price = room.find('span', attrs={'class': 'total-price strong-num'}).get_text()
    room_unit_price = room.find('span', attrs={'class': 'info-col price-item minor'}).get_text()
    extra_info = room.find('div', attrs={'class': 'property-tag-container'}).get_text()


    print(room_title, room_info, room_location, room_price, room_unit_price, extra_info)

下面是网页解析下来的一个房源信息


厨卫全明，卧室带阳台，地铁房，高区采光好
 
                            1室1厅 | 44.73平
                            
                                | 高区/6层
                            
                            
                                | 朝南
                            
                         葑润华庭 255 
                            单价57008元/平
                         
距离7号线祁华路站698米
满二
有钥匙

存入MongoDB数据库

MongoDB数据结构是以键值对{key:value}形式组成，有点类似于JSON

image

# 链接数据库
client = MongoClient('localhost', 27017)
# 建立数据库
db = client.tests
# 建立集合
homes = db.homes

rooms_list = []

# 先将爬下来的数据赋值为字典
rooms_info ={
                'title': room_title,
                'info': room_info,
                'location': room_location,
                'price': room_price,
                'unit_proce': room_unit_price,
                'message': extra_info,
                'time': datetime.datetime.now()
            }

rooms_list.append(rooms_info)
# 存入数据库
result = homes.insert_many(rooms_list)
print(result)

运行代码，我们可以看到数据存入了MongoDB

<pymongo.results.InsertManyResult object at 0x00000260C536AB8>
<pymongo.results.InsertManyResult object at 0x00000260C536AAC>
<pymongo.results.InsertManyResult object at 0x00000260C536AA0>
<pymongo.results.InsertManyResult object at 0x00000260C536AB4>
<pymongo.results.InsertManyResult object at 0x00000260C536AB0>
<pymongo.results.InsertManyResult object at 0x00000260C536A28>
<pymongo.results.InsertManyResult object at 0x00000260C536AC8>
<pymongo.results.InsertManyResult object at 0x00000260C536A08>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>
<pymongo.results.InsertManyResult object at 0x00000260C536888>
<pymongo.results.InsertManyResult object at 0x00000260C536A08>
<pymongo.results.InsertManyResult object at 0x00000260C536AC8>
<pymongo.results.InsertManyResult object at 0x00000260C536A48>
<pymongo.results.InsertManyResult object at 0x00000260C536A88>

可以下载一个MongoDB可视化工具，我用的是Robo3T，数据就这样存入了