美文网首页
python爬虫:青岛地区二手房信息

python爬虫:青岛地区二手房信息

作者: Taodede | 来源:发表于2018-12-07 19:15 被阅读10次

    近来准备开始做一个有关于房价的分析项目,以重新熟练一下之前的爬虫知识,并应用一下近来学习的Tableau作图技巧,本次项目仅做交流使用,非具有任何商业用途。
    为了保证信息对地区房价的真实反映,本项目选择链家网作为二手房信息的爬取网站,首先以青岛地区二手房为例进行爬取。

    第一步,导入需要用到的库或模块。本次使用urllib库,通过xpath进行网页解析,由于笔者习惯对DataFrame形式的数据进行处理,因此在此导入pandas库。
    import urllib.request
    from lxml import etree
    import pandas as pd
    
    第二步,为了后续的数据框转换更加顺利,在网页解析部分写的有些过于细致,如果你不习惯用DataFrame,可以采用别的数据结构。
    house_info = []
    for page in range(1,101):
        url = 'https://qd.lianjia.com/ershoufang/pg'+str(page)
        html = urllib.request.urlopen(url).read().decode('utf-8', 'ignore')
        selector = etree.HTML(html)
        page_info = selector.xpath('//li[@class="clear LOGCLICKDATA"]')
        print('正在爬第'+str(page)+'页')
        for i in range(len(page_info)):
            house_infor_one = []
            title = page_info[i].xpath('div[@class="info clear"]/div[@class="title"]/a/text()')
            house_infor_one.extend(title if title else ['.'])
            way = page_info[i].xpath('div[@class="info clear"]/div[@class="title"]/span/text()')
            house_infor_one.extend(way if way else ['.'])
            road = page_info[i].xpath('div[@class="info clear"]/div[@class="flood"]/div/a/text()')
            house_infor_one.extend(road if road else ['.'])
            community = page_info[i].xpath('div[@class="info clear"]/div[@class="address"]/div/a/text()')
            house_infor_one.extend(community if community else ['.'])
            house_des = page_info[i].xpath('div[@class="info clear"]/div[@class="address"]/div/text()')
            house_infor_one.extend(house_des if house_des else ['.'])
            floor = page_info[i].xpath('div[@class="info clear"]/div[@class="flood"]/div/text()')
            house_infor_one.extend(floor if floor else ['.'])
            popularity = page_info[i].xpath('div[@class="info clear"]/div[@class="followInfo"]/text()')
            house_infor_one.extend(popularity if popularity else ['.'])
            subway = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="subway"]/text()')
            house_infor_one.extend(subway if subway else ['.'])
            taxfree = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="taxfree"]/text()')
            house_infor_one.extend(taxfree if taxfree else ['.'])
            haskey = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="haskey"]/text()')
            house_infor_one.extend(haskey if haskey else ['.'])
            total_price = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[1]/span/text()')
            house_infor_one.extend(total_price if total_price else ['.'])
            price_unit = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[1]/text()')
            house_infor_one.extend(price_unit if price_unit else ['.'])
            per_price = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[2]/span/text()')
            house_infor_one.extend(per_price if per_price else ['.'])
            house_info.append(house_infor_one)
    
    第三步,将已经整理好格式的数据转换为数据框,并给他们的列进行命名,存到本地文件中,至此我们的数据就爬取结束啦
    house_df = pd.DataFrame(house_info)
    house_df.columns = ['房源描述', '房源来源', '房源地址(路)', '小区名称', '户型信息', '楼层', '人气', '距离地铁', '房本情况(个税)', '看房时间(钥匙)', '房源总价', '房源总价单位', '房源单价(平)','备注']
    house_df.to_excel('D:/Tsingtao.xls',)
    

    相关文章

      网友评论

          本文标题:python爬虫:青岛地区二手房信息

          本文链接:https://www.haomeiwen.com/subject/dijxhqtx.html