作者: 四咸一声如裂帛 | 来源:发表于2018-04-02 20:30 被阅读0次

    Data Mining Exp 1 Report: Rental Listing Inquiries


    Data Preprocessing

    1. 获取训练集(train.json)中的所有属性
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    train_df = pd.read_json("train.json")
    train_df.head(3).T
    

    output:

    10 10000 100004
    bathrooms 1.5 1 1
    bedrooms 3 2 1
    building_id 53a5b119ba8f7b61d4e010512e0dfc85 c5c8a357cba207596b04d1afd1e4f130 c3ba40552e2120b0acfc3cb5730bb2aa
    created 2016-06-24 07:54:24 2016-06-12 12:19:27 2016-04-17 03:26:41
    description A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... Top Top West Village location, beautiful Pre-w...
    display_address Metropolitan Avenue Columbus Avenue W 13 Street
    features [] [Doorman, Elevator, Fitness Center, Cats Allow... [Laundry In Building, Dishwasher, Hardwood Flo...
    interest_level medium low high
    latitude 40.7145 40.7947 40.7388
    listing_id 7211212 7150865 6887163
    longitude -73.9425 -73.9667 -74.0018
    manager_id 5ba989232d0489da1b5f2c45f6688adc 7533621a882f71e25173b27e3139d83d d9039c43983f6e564b1482b273bd7b01
    photos [https://photos.renthop.com/2/7211212_1ed4542e... [https://photos.renthop.com/2/7150865_be3306c5... [https://photos.renthop.com/2/6887163_de85c427...
    price 3000 5465 2850
    street_address 792 Metropolitan Avenue 808 Columbus Avenue 241 W 13 Street

    查看训练集大小

    train_df.shape
    

    output: (49352, 15)

    2. 获取测试集(test.json)中的所有属性
    test_df = pd.read_json("test.json")
    test_df.head(3).T
    

    output:

    0 1 100
    bathrooms 1 1 1
    bedrooms 1 2 1
    building_id 79780be1514f645d7e6be99a3de696c5 0 3dbbb69fd52e0d25131aa1cd459c87eb
    created 2016-06-11 05:29:41 2016-06-24 06:36:34 2016-06-03 04:29:40
    description Large with awesome terrace--accessible via bed... Prime Soho - between Bleecker and Houston - Ne... New York chic has reached a new level ...
    display_address Suffolk Street Thompson Street 101 East 10th Street
    features [Elevator, Laundry in Building, Laundry in Uni... [Pre-War, Dogs Allowed, Cats Allowed] [Doorman, Elevator, No Fee]
    latitude 40.7185 40.7278 40.7306
    listing_id 7142618 7210040 7103890
    longitude -73.9865 -74 -73.989
    manager_id b1b1852c416d78d7765d746cb1b8921f d0b5648017832b2427eeb9956d966a14 9ca6f3baa475c37a3b3521a394d65467
    photos [https://photos.renthop.com/2/7142618_1c45a2c8... [https://photos.renthop.com/2/7210040_d824cc71... [https://photos.renthop.com/2/7103890_85b33077...
    price 2950 2850 3758
    street_address 99 Suffolk Street 176 Thompson Street 101 East 10th Street

    查看测试集大小

    test_df.shape
    

    output: (74659, 14)

    3. 观察数据信息

    查看dataframe信息

    train_df.info()
    

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 49352 entries, 10 to 99994
    Data columns (total 15 columns):

    bathrooms 49352 non-null float64
    bedrooms 49352 non-null int64
    building_id 49352 non-null object
    created 49352 non-null object
    description 49352 non-null object
    display_address 49352 non-null object
    features 49352 non-null object
    interest_level 49352 non-null object
    latitude 49352 non-null float64
    listing_id 49352 non-null int64
    longitude 49352 non-null float64
    manager_id 49352 non-null object
    photos 49352 non-null object
    price 49352 non-null int64
    street_address 49352 non-null object

    dtypes: float64(3), int64(3), object(9)
    memory usage: 4.3+ MB

    查看数据是否存在缺失值:

    train_df.isnull().sum()
    
    bathrooms 0
    bedrooms 0
    building_id 0
    created 0
    description 0
    display_address 0
    features 0
    interest_level 0
    latitude 0
    listing_id 0
    longitude 0
    manager_id 0
    photos 0
    price 0
    street_address 0

    dtype: int64

    4. 处理price数据
    • 观察price是否存在异常点:
    plt.figure(figsize=(8, 10))
    plt.scatter(range(train_df.shape[0]), train_df["price"].values, color='red')
    plt.title("Distribution of Price")
    
    • 除去偏离总体值过远的异常点(>%99的部分)
    ulimit = np.percentile(train_df.price.values,99)
    train_df['price'].ix[train_df['price']>ulimit] = ulimit
    
    plt.figure(figsize=(8, 10))
    plt.scatter(range(train_df.shape[0]), train_df["price"].values, color='red')
    plt.title("Distribution of Price")
    
    5. 处理数组型数据

    对于数组形式的数据:features、photos、description,可以先做基本处理,例如:统计数组中元素个数、提取特征词逐个保存等。

    train_df['num_photos'] = train_df['photos'].apply(len)
    train_df['num_features'] = train_df['features'].apply(len)
    train_df['num_description_words'] = train_df['description'].apply(lambda x: len(x.split(' ')))
    train_df['num_sescription_len'] = train_df['description'].apply(len)
    
    6. 处理Interest_level数据

    将string形式存储的Interest_level映射为One hot 编码

    label_num_map = {'high':0, 'medium':1, 'low':2}
    train_df['label'] = train_df['interest_level'].apply(lambda x : label_num_map[x])
    interest_level = pd.get_dummies(train_df['interest_level'])
    train = train_df
    train = pd.concat([train, interest_level], axis = 1)
    train[['interest_level', 'high', 'low', 'medium']].head()
    

    处理结果示例如下:

    interest_level high low medium
    10 medium 0 0 1
    10000 low 0 1 0
    100004 high 1 0 0
    100007 low 0 1 0
    100013 low 0 1 0
    6. 处理Features数据

    可以使用TF-IDF算法将Features数据映射成稀疏矩阵

    • 用"null"填充缺失的数据
    def fill_empty(row):
     if row.num_features == 0:
     return ['null']
     return row.features
    train['features_filled'] = train.apply(lambda row: fill_empty(row), axis = 1)
    train[['features_filled']].head()
    
    features_filled
    10 [null]
    10000 [Doorman, Elevator, Fitness Center, Cats Allow...
    100004 [Laundry In Building, Dishwasher, Hardwood Flo...
    100007 [Hardwood Floors, No Fee]
    100013 [Pre-War]
    • 调用sklearn中的TF-IDF算法进行处理
    from sklearn import model_selection, preprocessing, ensemble
    from sklearn.metrics import log_loss
    from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
    train['features_word'] = train['features_filled'].apply(lambda x:" ".join([" ".join(i.split(" "))for i in x]))
    tfidf = CountVectorizer(stop_words='english', max_features=200 )
    train_sparse = tfidf.fit_transform(train['features_word'])
    train[['features_word']].head()
    
    features_word
    10 null
    10000 Doorman Elevator Fitness Center Cats Allowed D...
    100004 Laundry In Building Dishwasher Hardwood Floors...
    100007 Hardwood Floors No Fee
    100013 Pre-War
    7. 处理时间数据

    简单提取,分别保存即可

    train['created_year'] = pd.to_datetime(train['created']).dt.year
    train['created_month'] = pd.to_datetime(train['created']).dt.month
    train['created_day'] = pd.to_datetime(train['created']).dt.day
    train['created_hour'] = pd.to_datetime(train['created']).dt.hour
    train['created_date'] = train['created'].dt.date
    train[['created', 'created_year', 'created_month', 'created_day','created_hour']].head()
    
    created created_year created_month created_day created_hour
    10 2016-06-24 07:54:24 2016 6 24 7
    10000 2016-06-12 12:19:27 2016 6 12 12
    100004 2016-04-17 03:26:41 2016 4 17 3
    100007 2016-04-18 02:22:02 2016 4 18 2
    100013 2016-04-28 01:32:41 2016 4 28 1
    7. 词云处理Features,Descriptions文本数据
    from wordcloud import WordCloud
    text = ''
    text_da = ''
    for index,row in train_df.iterrows():
        for feature in row['features']:
            text = ' '.join([text,"_".join(feature.strip().split(" "))])
            text_da = " ".join([text_da,"_".join(row['display_address'].strip().split(" "))])
    text = text.strip()
    text_da = text_da.strip()
    
    text_desc =''
    for ind, row in train.iterrows():
        for feature in row['features']:
            text = " ".join([text, "_".join(feature.strip().split(" "))])
        text_desc=" ".join([text_desc, row['description']])
    text = text.strip()
    text_desc = text_desc.strip()
    

    Exploratory Data Analysis

    1. Interest_level统计分布
    level_count = train['interest_level'].value_counts()
    
    plt.figure(figsize=(8,4))
    bar_width = 0.5
    index = np.arange(len(level_count.values))
    plt.bar(index - bar_width/2, level_count.values, bar_width, alpha=0.8)
    
    plt.xticks(index, level_count.index)
    plt.ylabel('Interest level count', fontsize=12)
    plt.xlabel('Interest level', fontsize=12)
    plt.show()
    
    Distribution of interest_level
    2. 地理位置分布
    • 房屋纬度分布
    llimit = np.percentile(train.latitude.values,1)
    ulimit = np.percentile(train.latitude.values,99)
    train['latitude'].loc[train['latitude']<llimit]=llimit
    train['latitude'].loc[train['latitude']>ulimit]=ulimit
    plt.figure(figsize=(10,5))
    sns.distplot(train.latitude.values,bins=50,kde=True,color='blue')
    plt.xlabel('latitude',fontsize=12)
    plt.show()
    
    • 房屋经度分布
    llimit = np.percentile(train.longitude.values,1)
    ulimit = np.percentile(train.longitude.values,99)
    train['longitude'].loc[train['longitude']<llimit]=llimit
    train['longitude'].loc[train['longitude']>ulimit]=ulimit
    plt.figure(figsize=(12,6))
    sns.distplot(train.longitude.values,bins=50)
    plt.xlabel('longitude',fontsize=12)
    plt.show()
    

    可见训练集中的住房分布聚集在一片较小的区域内。将位置信息进一步可视化处理。

    • 将房屋位置信息分布到Google Map上
    import gpxpy as gpx
    import gpxpy.gpx
    gpx = gpxpy.gpx.GPX()
    for index, row in train.iterrows():
        if row['interest_level'] == 'high': #opting for all nominals results in poor performance of Google Earth
            gps_waypoint = gpxpy.gpx.GPXWaypoint(row['latitude'],row['longitude'],elevation=10)
            gpx.waypoints.append(gps_waypoint)        
    filename = "GoogleEarth.gpx"
    FILE = open(filename,"w")
    FILE.writelines(gpx.to_xml())
    FILE.close()
    

    观察地图可以发现出租的房屋主要分布在New York City中部地区。

    3. Description和Features文本数据词云分布
    • Description词云生成
    plt.figure(figsize=(12,6))
    wordcloud = WordCloud(background_color='white', width=600, height=300,
                          max_font_size=50, max_words=40).generate(text_desc)
    wordcloud.recolor(random_state=0)
    plt.imshow(wordcloud)
    plt.title("Wordcloud for Description", fontsize=20)
    plt.axis("off")
    plt.show()
    
    • Features词云生成
    plt.figure(figsize=(14,7))
    wordcloud = WordCloud(background_color='white',width=600,height=300,max_font_size=50,max_words=40).generate(text)
    wordcloud.recolor(random_state=0)
    plt.imshow(wordcloud)
    plt.title("Wordcloud for Features Distribution",fontsize=12)
    plt.axis("off")
    plt.show()
    

    观察词云可大致得出房屋Features中出现的高频词,与data preprocessing环节中TF-IDF算法得到的feature_words一定程度上重叠。

    4. 发布时间分布
    • 发布日期分布
    cnt_srs = train['created_date'].value_counts()
    plt.figure(figsize=(12,4))
    ax = plt.subplot(111)
    ax.bar(cnt_srs.index, cnt_srs.values, color='g')
    ax.xaxis_date()
    plt.xticks(rotation='vertical')
    plt.show()
    

    由图,总体来看房屋信息在2016年的4-6月间均匀发布。

    • 发布时间段分布
    hour_feature = train.groupby(['hour', 'interest_level'])['created_hour'].count().unstack('interest_level').fillna(0)
    hour_feature[['low','medium',"high"]].plot(kind='bar', stacked=True)
    

    由图,每天的凌晨1点至6点是信息发布的高峰时间段,其他时间段信息发布较少。同时interest_level与created_hour的关联性不强。

    5. 统计强业务能力Manager

    以high interest level房屋数目为Manager能力衡量标准,筛选出业务能力前十的Manager。

    manager = train.groupby(['manager_id', 'interest_level'])['manager_id'].count().unstack('interest_level').fillna(0)
    manager["sum"]=manager.sum(axis=1)
    x=manager[ (manager['sum'] > 80) & (manager['high'] > 30)]
    manager.head()
    x.sort_values("sum",inplace=True)
    plt.title("High Profile Managers")
    plt.xlabel("Num of High Interest level")
    plt.ylabel("Manager ID")
    x = x[x.index != 'e6472c7237327dd3903b3d6f6a94515a']
    x["sum"].plot(kind='barh',color="r");
    
    6. 统计受欢迎建筑(building)

    以high interest level房屋数目为building的interest level衡量标准,筛选出前十受欢迎的建筑。

    building = train.groupby(['building_id', 'interest_level'])['building_id'].count().unstack('interest_level').fillna(0)
    building["sum"]=building.sum(axis=1)
    x=building[ (building['sum'] > 60) & (building['high'] > 15)]
    building.head()
    

    Feature Engineering

    1. 房间卧室数目-兴趣度特征

    本特征引用自数据预处理应用课件

    获取对不同房间时数目的房屋感兴趣程度的数目以及兴趣程度占比

    bedroom_features = train.groupby(['bedrooms'])['high', 'low', 'medium'].sum().add_suffix('_count').reset_index()
    bedroom_features['all_count'] = train.groupby(['bedrooms'])['price'].count().reset_index().price
    bedroom_features['high_ratio'] = bedroom_features['high_count']/bedroom_features['all_count']
    bedroom_features['low_ratio'] = bedroom_features['low_count']/bedroom_features['all_count']
    bedroom_features['medium_ratio'] = bedroom_features['medium_count']/bedroom_features['all_count']
    bedroom_features.head()
    
    bedrooms high_count low_count medium_count all_count high_ratio low_ratio medium_ratio
    0 0 847.0 6518.0 2110.0 9475 0.089393 0.687916 0.222691
    1 1 943.0 11715.0 3094.0 15752 0.059865 0.743715 0.196420
    2 2 1306.0 9801.0 3516.0 14623 0.089311 0.670246 0.240443
    3 3 592.0 4727.0 1957.0 7276 0.081363 0.649670 0.268966
    4 4 149.0 1233.0 547.0 1929 0.077242 0.639191 0.283567

    将新特征整合到训练集中

    result = pd.merge(train, bedroom_features)
    
    result.head()
    
    bathrooms bedrooms building_id created description display_address features interest_level latitude listing_id ... created_year created_month created_day high_count low_count medium_count all_count high_ratio low_ratio medium_ratio
    0 1.5 3 53a5b119ba8f7b61d4e010512e0dfc85 2016-06-24 07:54:24 A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... Metropolitan Avenue [] medium 40.7145 7211212 ... 2016 6 24 592.0 4727.0 1957.0 7276 0.081363 0.64967 0.268966
    1 1.0 3 205f95d4a78f1f3befda48b89edc9669 2016-04-12 02:39:45 BEAUTIFUL 2 BEDROOM POSSIBLE CONVERSION INTO T... Madison Avenue [Doorman, Elevator, Dishwasher, Hardwood Floors] low 40.7454 6858062 ... 2016 4 12 592.0 4727.0 1957.0 7276 0.081363 0.64967 0.268966
    2 1.0 3 be6b7c3fdf3f63a2756306f4af7788a6 2016-04-18 04:46:30 These pictures are from a similarlisting. Thompson St [Washer/Dryer] low 40.7231 6890563 ... 2016 4 18 592.0 4727.0 1957.0 7276 0.081363 0.64967 0.268966
    3 1.0 3 6e9276972669ddcb097e71691f4c8f1b 2016-04-26 11:38:19 TRUE 3 bedroom! RENOVATED! Windows Everywhere!... 540 w 52nd street, new york, ny [Common Outdoor Space, Cats Allowed, Dogs Allo... medium 40.7660 6927880 ... 2016 4 26 592.0 4727.0 1957.0 7276 0.081363 0.64967 0.268966
    4 2.0 3 9ba2f2719b9ab170812a9ea79a3bd84c 2016-04-10 02:41:58 !!! NO FEE !!!It is a community-oriented, -fri... North Moore St. [Swimming Pool, Doorman, Elevator, Laundry in ... low 40.7196 6851714 ... 2016 4 10 592.0 4727.0 1957.0 7276 0.081363 0.64967 0.268966

    5 rows × 35 columns

    use_features = ['bedrooms','bathrooms','price','high_count','low_count','medium_count','high_ratio','low_ratio','medium_ratio']
    result_x = result[use_features]
    result_y = train[['label']]
    
    result_x.head()
    
    bedrooms bathrooms price high_count low_count medium_count high_ratio low_ratio medium_ratio
    0 3 1.5 3000.0 592.0 4727.0 1957.0 0.081363 0.64967 0.268966
    1 3 1.0 4395.0 592.0 4727.0 1957.0 0.081363 0.64967 0.268966
    2 3 1.0 3733.0 592.0 4727.0 1957.0 0.081363 0.64967 0.268966
    3 3 1.0 4500.0 592.0 4727.0 1957.0 0.081363 0.64967 0.268966
    4 3 2.0 6320.0 592.0 4727.0 1957.0 0.081363 0.64967 0.268966
    2. Features特征提取

    通过观察EDA部分获取的Features词云,我发现“Elevator”,“Dogs_allowed”,“Cats_allowed”
    是描述房屋环境的重要条件。因此我预测,三者符合数目越多,房屋条件越亲民,租客感兴趣程度就越高。
    因此用bool值描述各个房屋的description中是否包括这三个条件,做为新特征合并到训练集train中。

    def newfeat(name, df, series):
        """Create a Series for my feature building loop to fill"""
        feature = pd.Series(0, df.index, name=name)
        """Now populate the new Series with numeric values"""
        for row, word in enumerate(series):
            if name in word:
                feature.iloc[row] = 1
        df[name] = feature
        return(df)
       
    train = newfeat('Elevator', train, train.features)
    train = newfeat('Dogs Allowed', train, train.features)
    train = newfeat('Cats Allowed', train, train.features)
    train.head()
    
    3.时间特征提取

    房屋的信息发布时间分布跨度较大,考虑到租客可能对新信息的兴趣较大,因此提取发布信息距离最近一条信息的时间间隔为新特征。

    train['passed'] = train['date'].max()-train['date']
    train.head()
    

    相关文章

      网友评论

        本文标题:

        本文链接:https://www.haomeiwen.com/subject/pzgchftx.html