美文网首页程序员
Python爬取上海链家网房源数据并存入MongoDB数据库

Python爬取上海链家网房源数据并存入MongoDB数据库

作者: Treehl | 来源:发表于2017-11-29 21:29 被阅读0次

    以下是我爬取上海链家网宝山区房源信息的学习总结

    准备工作

    用到的Python模块:

    • requests
    • bs4
    • pymongo
    • datetime
    • time
    • random

    分析网页

    登陆http://sh.lianjia.com/ershoufang/baoshan 用Chrome打开开发者工具

    image
    每条房源信息都在li元素中,我们再来看一下翻页链接
    image
    试着点击下一页,我们浏览器上的链接是有规律可循的

    http://sh.lianjia.com/ershoufang/baoshan/d1
    http://sh.lianjia.com/ershoufang/baoshan/d2
    http://sh.lianjia.com/ershoufang/baoshan/d3
    .........
    http://sh.lianjia.com/ershoufang/baoshan/100

    现在我们试着爬取前10页的链接

    import requests
    for i in range(1, 11):
        r = requests.get('http://sh.lianjia.com/ershoufang/baoshan/d' + str(i))
        print(r.url)
      
    

    爬取结果

    http://sh.lianjia.com/ershoufang/baoshan/d1
    http://sh.lianjia.com/ershoufang/baoshan/d2
    http://sh.lianjia.com/ershoufang/baoshan/d3
    http://sh.lianjia.com/ershoufang/baoshan/d4
    http://sh.lianjia.com/ershoufang/baoshan/d5
    http://sh.lianjia.com/ershoufang/baoshan/d6
    http://sh.lianjia.com/ershoufang/baoshan/d7
    http://sh.lianjia.com/ershoufang/baoshan/d8
    http://sh.lianjia.com/ershoufang/baoshan/d9
    http://sh.lianjia.com/ershoufang/baoshan/d10
    
    

    解析网页

    image

    要抓取的信息如下:

    • 标题:room_title = room.find('div', attrs={'class': 'prop-title'})
    • 房屋信息:room_info = room.find('span', attrs={'class': 'info-col row1-text'})
    • 位置:room_location = room.find('span', attrs={'class': 'info-col row2-text'})
    • 附加信息:extra_info = room.find('div', attrs={'class': 'property-tag-container'})
    • 总价:room_price = room.find('span', attrs={'class': 'total-price strong-num'})
    • 单价:room_unit_price = room.find('span', attrs={'class': 'info-col price-item minor'})
    soup = BeautifulSoup(r.text, 'html.parser')
    rooms = soup.find('ul', attrs={'class': 'js_fang_list'})
    for room in rooms.find_all('li'):
        room_title = room.find('div', attrs={'class': 'prop-title'}).get_text()
        room_info = room.find('span', attrs={'class': 'info-col row1-text'}).get_text()
        room_location = room.find('span', attrs={'class': 'info-col row2-text'}).find('a').get_text()
        room_price = room.find('span', attrs={'class': 'total-price strong-num'}).get_text()
        room_unit_price = room.find('span', attrs={'class': 'info-col price-item minor'}).get_text()
        extra_info = room.find('div', attrs={'class': 'property-tag-container'}).get_text()
    
    
        print(room_title, room_info, room_location, room_price, room_unit_price, extra_info)
    
    

    下面是网页解析下来的一个房源信息

    
    厨卫全明,卧室带阳台,地铁房,高区采光好
     
                                1室1厅 | 44.73平
                                
                                    | 高区/6层
                                
                                
                                    | 朝南
                                
                             葑润华庭 255 
                                单价57008元/平
                             
    距离7号线祁华路站698米
    满二
    有钥匙
    
    

    存入MongoDB数据库

    MongoDB数据结构是以键值对{key:value}形式组成,有点类似于JSON


    image
    # 链接数据库
    client = MongoClient('localhost', 27017)
    # 建立数据库
    db = client.tests
    # 建立集合
    homes = db.homes
    
    rooms_list = []
    
    # 先将爬下来的数据赋值为字典
    rooms_info ={
                    'title': room_title,
                    'info': room_info,
                    'location': room_location,
                    'price': room_price,
                    'unit_proce': room_unit_price,
                    'message': extra_info,
                    'time': datetime.datetime.now()
                }
    
    rooms_list.append(rooms_info)
    # 存入数据库
    result = homes.insert_many(rooms_list)
    print(result)
    
    

    运行代码,我们可以看到数据存入了MongoDB

    <pymongo.results.InsertManyResult object at 0x00000260C536AB8>
    <pymongo.results.InsertManyResult object at 0x00000260C536AAC>
    <pymongo.results.InsertManyResult object at 0x00000260C536AA0>
    <pymongo.results.InsertManyResult object at 0x00000260C536AB4>
    <pymongo.results.InsertManyResult object at 0x00000260C536AB0>
    <pymongo.results.InsertManyResult object at 0x00000260C536A28>
    <pymongo.results.InsertManyResult object at 0x00000260C536AC8>
    <pymongo.results.InsertManyResult object at 0x00000260C536A08>
    <pymongo.results.InsertManyResult object at 0x00000260C536A88>
    <pymongo.results.InsertManyResult object at 0x00000260C536A88>
    <pymongo.results.InsertManyResult object at 0x00000260C536888>
    <pymongo.results.InsertManyResult object at 0x00000260C536A08>
    <pymongo.results.InsertManyResult object at 0x00000260C536AC8>
    <pymongo.results.InsertManyResult object at 0x00000260C536A48>
    <pymongo.results.InsertManyResult object at 0x00000260C536A88>
    
    

    可以下载一个MongoDB可视化工具,我用的是Robo3T,数据就这样存入了

    image

    总共有100页的数据,用time.sleep()来控制速度防止被封掉,但爬取效率实在很低,这两天准备学习pandas

    完整代码在GitHub
    简书
    欢迎访问博客Treehl的博客

    相关文章

      网友评论

        本文标题:Python爬取上海链家网房源数据并存入MongoDB数据库

        本文链接:https://www.haomeiwen.com/subject/hdapbxtx.html