美文网首页
【爬虫】-005-MongoDB数据库操作-练习

【爬虫】-005-MongoDB数据库操作-练习

作者: 9756a8680596 | 来源:发表于2019-01-03 20:15 被阅读12次

    目标

    1. 爬取小猪短租房源信息,暂定爬取前三页
    2. 列表页爬取信息为
      • 房源的URL
      • 房源的价格
    3. 将爬取的信息存入到数据库
    4. 数据库检索房源
      • 根据价格搜索价格大于某个数值的房源并展示

    代码示例

    # coding: UTF-8
    
    '''
    标题 div.result_btm_con.lodgeunitname > div.result_intro > a.sTitle > span
    
    价格 div.result_btm_con.lodgeunitname > div> span.result_price > i
    
    详情页链接 ul > li > a.resule_img_a
    '''
    
    from bs4 import BeautifulSoup
    from pymongo import MongoClient
    import requests, time
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
        'Cookie': 'TY_SESSION_ID=76a72fb7-6086-43eb-8d89-e940bc6e0b7c; xz_guid_4se=23919729-5f3c-490e-a3bd-eaab133bf8a7; _ga=GA1.2.548704816.1530023215; gr_user_id=481ee4b7-fbbb-4780-878f-d557e50f43cd; __utma=29082403.548704816.1530023215.1530104445.1530107679.4; 59a81cc7d8c04307ba183d331c373ef6_gr_last_sent_cs1=N%2FA; grwng_uid=4810f5c6-a7ea-4916-a2dd-0f6eca0735ae; _uab_collina=154600556723767889767416; abtest_ABTest4SearchDate=b; xzuuid=85d7e34e; 59a81cc7d8c04307ba183d331c373ef6_gr_session_id=4aecb865-3f10-45fa-907a-97bb62ec1c73; 59a81cc7d8c04307ba183d331c373ef6_gr_last_sent_sid_with_cs1=4aecb865-3f10-45fa-907a-97bb62ec1c73; 59a81cc7d8c04307ba183d331c373ef6_gr_session_id_4aecb865-3f10-45fa-907a-97bb62ec1c73=true',
    }
    
    host = 'localhost'
    port = 27017
    
    start_page, end_page = 1, 4
    max_payment = 500
    
    # 连接上Mongo DB
    client = MongoClient(host, port)
    
    # 创建数据库
    xz_houses = client['xz_houses']
    
    # 数据库中创建表
    links_prices = xz_houses['links_prices']
    
    def get_links_prices(url):
        time.sleep(3)
        # 发起请求,获取页码数据
        wb_data = requests.get(url, headers=headers)
        # 将页码放入BeautifulSoup进行解析
        soup = BeautifulSoup(wb_data.text, 'lxml')
    
        # 根据页面元素的样式获取需要抓取的内容html标签
        titles = soup.select('div.result_btm_con.lodgeunitname > div.result_intro > a.sTitle > span')
        prices = soup.select('div.result_btm_con.lodgeunitname > div > span.result_price > i')
        links = soup.select('ul > li > a.resule_img_a')
    
        # 标准化数据,存入数据库对应的表中
        for title, price, link in zip(titles, prices, links):
            data = {
                'title': title.get_text(),
                'price': price.get_text(),
                'link': link.get('href'),
            }
            # 插入数据
            links_prices.insert_one(data)
    
    if __name__ == '__main__':
        for i in range(start_page, end_page):
            url = 'http://hz.xiaozhu.com/search-duanzufang-p{page}-0/'.format(page = str(i))
            get_links_prices(url)
    
        # 查找数据库表中价格大于max_payment的房源信息
        print('价格大于{fee}的房源有:'.format(fee=max_payment))
        for item in links_prices.find({'price':{'$gt': str(max_payment)}}):
            print('标题:', item['title'], '\t\t 价格:', item['price'], '\t\t 详情:', item['link'])
    
    

    相关文章

      网友评论

          本文标题:【爬虫】-005-MongoDB数据库操作-练习

          本文链接:https://www.haomeiwen.com/subject/wdbqrqtx.html