Data Mining Exp 1 Report: Rental Listing Inquiries
Data Preprocessing
1. 获取训练集(train.json)中的所有属性
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
train_df = pd.read_json("train.json")
train_df.head(3).T
output:
10 | 10000 | 100004 | |
---|---|---|---|
bathrooms | 1.5 | 1 | 1 |
bedrooms | 3 | 2 | 1 |
building_id | 53a5b119ba8f7b61d4e010512e0dfc85 | c5c8a357cba207596b04d1afd1e4f130 | c3ba40552e2120b0acfc3cb5730bb2aa |
created | 2016-06-24 07:54:24 | 2016-06-12 12:19:27 | 2016-04-17 03:26:41 |
description | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Top Top West Village location, beautiful Pre-w... | |
display_address | Metropolitan Avenue | Columbus Avenue | W 13 Street |
features | [] | [Doorman, Elevator, Fitness Center, Cats Allow... | [Laundry In Building, Dishwasher, Hardwood Flo... |
interest_level | medium | low | high |
latitude | 40.7145 | 40.7947 | 40.7388 |
listing_id | 7211212 | 7150865 | 6887163 |
longitude | -73.9425 | -73.9667 | -74.0018 |
manager_id | 5ba989232d0489da1b5f2c45f6688adc | 7533621a882f71e25173b27e3139d83d | d9039c43983f6e564b1482b273bd7b01 |
photos | [https://photos.renthop.com/2/7211212_1ed4542e... | [https://photos.renthop.com/2/7150865_be3306c5... | [https://photos.renthop.com/2/6887163_de85c427... |
price | 3000 | 5465 | 2850 |
street_address | 792 Metropolitan Avenue | 808 Columbus Avenue | 241 W 13 Street |
查看训练集大小
train_df.shape
output: (49352, 15)
2. 获取测试集(test.json)中的所有属性
test_df = pd.read_json("test.json")
test_df.head(3).T
output:
0 | 1 | 100 | |
---|---|---|---|
bathrooms | 1 | 1 | 1 |
bedrooms | 1 | 2 | 1 |
building_id | 79780be1514f645d7e6be99a3de696c5 | 0 | 3dbbb69fd52e0d25131aa1cd459c87eb |
created | 2016-06-11 05:29:41 | 2016-06-24 06:36:34 | 2016-06-03 04:29:40 |
description | Large with awesome terrace--accessible via bed... | Prime Soho - between Bleecker and Houston - Ne... | New York chic has reached a new level ... |
display_address | Suffolk Street | Thompson Street | 101 East 10th Street |
features | [Elevator, Laundry in Building, Laundry in Uni... | [Pre-War, Dogs Allowed, Cats Allowed] | [Doorman, Elevator, No Fee] |
latitude | 40.7185 | 40.7278 | 40.7306 |
listing_id | 7142618 | 7210040 | 7103890 |
longitude | -73.9865 | -74 | -73.989 |
manager_id | b1b1852c416d78d7765d746cb1b8921f | d0b5648017832b2427eeb9956d966a14 | 9ca6f3baa475c37a3b3521a394d65467 |
photos | [https://photos.renthop.com/2/7142618_1c45a2c8... | [https://photos.renthop.com/2/7210040_d824cc71... | [https://photos.renthop.com/2/7103890_85b33077... |
price | 2950 | 2850 | 3758 |
street_address | 99 Suffolk Street | 176 Thompson Street | 101 East 10th Street |
查看测试集大小
test_df.shape
output: (74659, 14)
3. 观察数据信息
查看dataframe信息
train_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 49352 entries, 10 to 99994
Data columns (total 15 columns):
bathrooms | 49352 | non-null | float64 |
bedrooms | 49352 | non-null | int64 |
building_id | 49352 | non-null | object |
created | 49352 | non-null | object |
description | 49352 | non-null | object |
display_address | 49352 | non-null | object |
features | 49352 | non-null | object |
interest_level | 49352 | non-null | object |
latitude | 49352 | non-null | float64 |
listing_id | 49352 | non-null | int64 |
longitude | 49352 | non-null | float64 |
manager_id | 49352 | non-null | object |
photos | 49352 | non-null | object |
price | 49352 | non-null | int64 |
street_address | 49352 | non-null | object |
dtypes: float64(3), int64(3), object(9)
memory usage: 4.3+ MB
查看数据是否存在缺失值:
train_df.isnull().sum()
bathrooms | 0 |
bedrooms | 0 |
building_id | 0 |
created | 0 |
description | 0 |
display_address | 0 |
features | 0 |
interest_level | 0 |
latitude | 0 |
listing_id | 0 |
longitude | 0 |
manager_id | 0 |
photos | 0 |
price | 0 |
street_address | 0 |
dtype: int64
4. 处理price数据
- 观察price是否存在异常点:
plt.figure(figsize=(8, 10))
plt.scatter(range(train_df.shape[0]), train_df["price"].values, color='red')
plt.title("Distribution of Price")
- 除去偏离总体值过远的异常点(>%99的部分)
ulimit = np.percentile(train_df.price.values,99)
train_df['price'].ix[train_df['price']>ulimit] = ulimit
plt.figure(figsize=(8, 10))
plt.scatter(range(train_df.shape[0]), train_df["price"].values, color='red')
plt.title("Distribution of Price")
5. 处理数组型数据
对于数组形式的数据:features、photos、description,可以先做基本处理,例如:统计数组中元素个数、提取特征词逐个保存等。
train_df['num_photos'] = train_df['photos'].apply(len)
train_df['num_features'] = train_df['features'].apply(len)
train_df['num_description_words'] = train_df['description'].apply(lambda x: len(x.split(' ')))
train_df['num_sescription_len'] = train_df['description'].apply(len)
6. 处理Interest_level数据
将string形式存储的Interest_level映射为One hot 编码
label_num_map = {'high':0, 'medium':1, 'low':2}
train_df['label'] = train_df['interest_level'].apply(lambda x : label_num_map[x])
interest_level = pd.get_dummies(train_df['interest_level'])
train = train_df
train = pd.concat([train, interest_level], axis = 1)
train[['interest_level', 'high', 'low', 'medium']].head()
处理结果示例如下:
interest_level | high | low | medium | |
---|---|---|---|---|
10 | medium | 0 | 0 | 1 |
10000 | low | 0 | 1 | 0 |
100004 | high | 1 | 0 | 0 |
100007 | low | 0 | 1 | 0 |
100013 | low | 0 | 1 | 0 |
6. 处理Features数据
可以使用TF-IDF算法将Features数据映射成稀疏矩阵
- 用"null"填充缺失的数据
def fill_empty(row):
if row.num_features == 0:
return ['null']
return row.features
train['features_filled'] = train.apply(lambda row: fill_empty(row), axis = 1)
train[['features_filled']].head()
features_filled | |
---|---|
10 | [null] |
10000 | [Doorman, Elevator, Fitness Center, Cats Allow... |
100004 | [Laundry In Building, Dishwasher, Hardwood Flo... |
100007 | [Hardwood Floors, No Fee] |
100013 | [Pre-War] |
- 调用sklearn中的TF-IDF算法进行处理
from sklearn import model_selection, preprocessing, ensemble
from sklearn.metrics import log_loss
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
train['features_word'] = train['features_filled'].apply(lambda x:" ".join([" ".join(i.split(" "))for i in x]))
tfidf = CountVectorizer(stop_words='english', max_features=200 )
train_sparse = tfidf.fit_transform(train['features_word'])
train[['features_word']].head()
features_word | |
---|---|
10 | null |
10000 | Doorman Elevator Fitness Center Cats Allowed D... |
100004 | Laundry In Building Dishwasher Hardwood Floors... |
100007 | Hardwood Floors No Fee |
100013 | Pre-War |
7. 处理时间数据
简单提取,分别保存即可
train['created_year'] = pd.to_datetime(train['created']).dt.year
train['created_month'] = pd.to_datetime(train['created']).dt.month
train['created_day'] = pd.to_datetime(train['created']).dt.day
train['created_hour'] = pd.to_datetime(train['created']).dt.hour
train['created_date'] = train['created'].dt.date
train[['created', 'created_year', 'created_month', 'created_day','created_hour']].head()
created | created_year | created_month | created_day | created_hour | |
---|---|---|---|---|---|
10 | 2016-06-24 07:54:24 | 2016 | 6 | 24 | 7 |
10000 | 2016-06-12 12:19:27 | 2016 | 6 | 12 | 12 |
100004 | 2016-04-17 03:26:41 | 2016 | 4 | 17 | 3 |
100007 | 2016-04-18 02:22:02 | 2016 | 4 | 18 | 2 |
100013 | 2016-04-28 01:32:41 | 2016 | 4 | 28 | 1 |
7. 词云处理Features,Descriptions文本数据
from wordcloud import WordCloud
text = ''
text_da = ''
for index,row in train_df.iterrows():
for feature in row['features']:
text = ' '.join([text,"_".join(feature.strip().split(" "))])
text_da = " ".join([text_da,"_".join(row['display_address'].strip().split(" "))])
text = text.strip()
text_da = text_da.strip()
text_desc =''
for ind, row in train.iterrows():
for feature in row['features']:
text = " ".join([text, "_".join(feature.strip().split(" "))])
text_desc=" ".join([text_desc, row['description']])
text = text.strip()
text_desc = text_desc.strip()
Exploratory Data Analysis
1. Interest_level统计分布
level_count = train['interest_level'].value_counts()
plt.figure(figsize=(8,4))
bar_width = 0.5
index = np.arange(len(level_count.values))
plt.bar(index - bar_width/2, level_count.values, bar_width, alpha=0.8)
plt.xticks(index, level_count.index)
plt.ylabel('Interest level count', fontsize=12)
plt.xlabel('Interest level', fontsize=12)
plt.show()
Distribution of interest_level
2. 地理位置分布
- 房屋纬度分布
llimit = np.percentile(train.latitude.values,1)
ulimit = np.percentile(train.latitude.values,99)
train['latitude'].loc[train['latitude']<llimit]=llimit
train['latitude'].loc[train['latitude']>ulimit]=ulimit
plt.figure(figsize=(10,5))
sns.distplot(train.latitude.values,bins=50,kde=True,color='blue')
plt.xlabel('latitude',fontsize=12)
plt.show()
- 房屋经度分布
llimit = np.percentile(train.longitude.values,1)
ulimit = np.percentile(train.longitude.values,99)
train['longitude'].loc[train['longitude']<llimit]=llimit
train['longitude'].loc[train['longitude']>ulimit]=ulimit
plt.figure(figsize=(12,6))
sns.distplot(train.longitude.values,bins=50)
plt.xlabel('longitude',fontsize=12)
plt.show()
可见训练集中的住房分布聚集在一片较小的区域内。将位置信息进一步可视化处理。
- 将房屋位置信息分布到Google Map上
import gpxpy as gpx
import gpxpy.gpx
gpx = gpxpy.gpx.GPX()
for index, row in train.iterrows():
if row['interest_level'] == 'high': #opting for all nominals results in poor performance of Google Earth
gps_waypoint = gpxpy.gpx.GPXWaypoint(row['latitude'],row['longitude'],elevation=10)
gpx.waypoints.append(gps_waypoint)
filename = "GoogleEarth.gpx"
FILE = open(filename,"w")
FILE.writelines(gpx.to_xml())
FILE.close()
观察地图可以发现出租的房屋主要分布在New York City中部地区。
3. Description和Features文本数据词云分布
- Description词云生成
plt.figure(figsize=(12,6))
wordcloud = WordCloud(background_color='white', width=600, height=300,
max_font_size=50, max_words=40).generate(text_desc)
wordcloud.recolor(random_state=0)
plt.imshow(wordcloud)
plt.title("Wordcloud for Description", fontsize=20)
plt.axis("off")
plt.show()
- Features词云生成
plt.figure(figsize=(14,7))
wordcloud = WordCloud(background_color='white',width=600,height=300,max_font_size=50,max_words=40).generate(text)
wordcloud.recolor(random_state=0)
plt.imshow(wordcloud)
plt.title("Wordcloud for Features Distribution",fontsize=12)
plt.axis("off")
plt.show()
观察词云可大致得出房屋Features中出现的高频词,与data preprocessing环节中TF-IDF算法得到的feature_words一定程度上重叠。
4. 发布时间分布
- 发布日期分布
cnt_srs = train['created_date'].value_counts()
plt.figure(figsize=(12,4))
ax = plt.subplot(111)
ax.bar(cnt_srs.index, cnt_srs.values, color='g')
ax.xaxis_date()
plt.xticks(rotation='vertical')
plt.show()
由图,总体来看房屋信息在2016年的4-6月间均匀发布。
- 发布时间段分布
hour_feature = train.groupby(['hour', 'interest_level'])['created_hour'].count().unstack('interest_level').fillna(0)
hour_feature[['low','medium',"high"]].plot(kind='bar', stacked=True)
由图,每天的凌晨1点至6点是信息发布的高峰时间段,其他时间段信息发布较少。同时interest_level与created_hour的关联性不强。
5. 统计强业务能力Manager
以high interest level房屋数目为Manager能力衡量标准,筛选出业务能力前十的Manager。
manager = train.groupby(['manager_id', 'interest_level'])['manager_id'].count().unstack('interest_level').fillna(0)
manager["sum"]=manager.sum(axis=1)
x=manager[ (manager['sum'] > 80) & (manager['high'] > 30)]
manager.head()
x.sort_values("sum",inplace=True)
plt.title("High Profile Managers")
plt.xlabel("Num of High Interest level")
plt.ylabel("Manager ID")
x = x[x.index != 'e6472c7237327dd3903b3d6f6a94515a']
x["sum"].plot(kind='barh',color="r");
6. 统计受欢迎建筑(building)
以high interest level房屋数目为building的interest level衡量标准,筛选出前十受欢迎的建筑。
building = train.groupby(['building_id', 'interest_level'])['building_id'].count().unstack('interest_level').fillna(0)
building["sum"]=building.sum(axis=1)
x=building[ (building['sum'] > 60) & (building['high'] > 15)]
building.head()
Feature Engineering
1. 房间卧室数目-兴趣度特征
本特征引用自数据预处理应用课件
获取对不同房间时数目的房屋感兴趣程度的数目以及兴趣程度占比
bedroom_features = train.groupby(['bedrooms'])['high', 'low', 'medium'].sum().add_suffix('_count').reset_index()
bedroom_features['all_count'] = train.groupby(['bedrooms'])['price'].count().reset_index().price
bedroom_features['high_ratio'] = bedroom_features['high_count']/bedroom_features['all_count']
bedroom_features['low_ratio'] = bedroom_features['low_count']/bedroom_features['all_count']
bedroom_features['medium_ratio'] = bedroom_features['medium_count']/bedroom_features['all_count']
bedroom_features.head()
bedrooms | high_count | low_count | medium_count | all_count | high_ratio | low_ratio | medium_ratio | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 847.0 | 6518.0 | 2110.0 | 9475 | 0.089393 | 0.687916 | 0.222691 |
1 | 1 | 943.0 | 11715.0 | 3094.0 | 15752 | 0.059865 | 0.743715 | 0.196420 |
2 | 2 | 1306.0 | 9801.0 | 3516.0 | 14623 | 0.089311 | 0.670246 | 0.240443 |
3 | 3 | 592.0 | 4727.0 | 1957.0 | 7276 | 0.081363 | 0.649670 | 0.268966 |
4 | 4 | 149.0 | 1233.0 | 547.0 | 1929 | 0.077242 | 0.639191 | 0.283567 |
将新特征整合到训练集中
result = pd.merge(train, bedroom_features)
result.head()
bathrooms | bedrooms | building_id | created | description | display_address | features | interest_level | latitude | listing_id | ... | created_year | created_month | created_day | high_count | low_count | medium_count | all_count | high_ratio | low_ratio | medium_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.5 | 3 | 53a5b119ba8f7b61d4e010512e0dfc85 | 2016-06-24 07:54:24 | A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... | Metropolitan Avenue | [] | medium | 40.7145 | 7211212 | ... | 2016 | 6 | 24 | 592.0 | 4727.0 | 1957.0 | 7276 | 0.081363 | 0.64967 | 0.268966 |
1 | 1.0 | 3 | 205f95d4a78f1f3befda48b89edc9669 | 2016-04-12 02:39:45 | BEAUTIFUL 2 BEDROOM POSSIBLE CONVERSION INTO T... | Madison Avenue | [Doorman, Elevator, Dishwasher, Hardwood Floors] | low | 40.7454 | 6858062 | ... | 2016 | 4 | 12 | 592.0 | 4727.0 | 1957.0 | 7276 | 0.081363 | 0.64967 | 0.268966 |
2 | 1.0 | 3 | be6b7c3fdf3f63a2756306f4af7788a6 | 2016-04-18 04:46:30 | These pictures are from a similarlisting. | Thompson St | [Washer/Dryer] | low | 40.7231 | 6890563 | ... | 2016 | 4 | 18 | 592.0 | 4727.0 | 1957.0 | 7276 | 0.081363 | 0.64967 | 0.268966 |
3 | 1.0 | 3 | 6e9276972669ddcb097e71691f4c8f1b | 2016-04-26 11:38:19 | TRUE 3 bedroom! RENOVATED! Windows Everywhere!... | 540 w 52nd street, new york, ny | [Common Outdoor Space, Cats Allowed, Dogs Allo... | medium | 40.7660 | 6927880 | ... | 2016 | 4 | 26 | 592.0 | 4727.0 | 1957.0 | 7276 | 0.081363 | 0.64967 | 0.268966 |
4 | 2.0 | 3 | 9ba2f2719b9ab170812a9ea79a3bd84c | 2016-04-10 02:41:58 | !!! NO FEE !!!It is a community-oriented, -fri... | North Moore St. | [Swimming Pool, Doorman, Elevator, Laundry in ... | low | 40.7196 | 6851714 | ... | 2016 | 4 | 10 | 592.0 | 4727.0 | 1957.0 | 7276 | 0.081363 | 0.64967 | 0.268966 |
5 rows × 35 columns
use_features = ['bedrooms','bathrooms','price','high_count','low_count','medium_count','high_ratio','low_ratio','medium_ratio']
result_x = result[use_features]
result_y = train[['label']]
result_x.head()
bedrooms | bathrooms | price | high_count | low_count | medium_count | high_ratio | low_ratio | medium_ratio | |
---|---|---|---|---|---|---|---|---|---|
0 | 3 | 1.5 | 3000.0 | 592.0 | 4727.0 | 1957.0 | 0.081363 | 0.64967 | 0.268966 |
1 | 3 | 1.0 | 4395.0 | 592.0 | 4727.0 | 1957.0 | 0.081363 | 0.64967 | 0.268966 |
2 | 3 | 1.0 | 3733.0 | 592.0 | 4727.0 | 1957.0 | 0.081363 | 0.64967 | 0.268966 |
3 | 3 | 1.0 | 4500.0 | 592.0 | 4727.0 | 1957.0 | 0.081363 | 0.64967 | 0.268966 |
4 | 3 | 2.0 | 6320.0 | 592.0 | 4727.0 | 1957.0 | 0.081363 | 0.64967 | 0.268966 |
2. Features特征提取
通过观察EDA部分获取的Features词云,我发现“Elevator”,“Dogs_allowed”,“Cats_allowed”
是描述房屋环境的重要条件。因此我预测,三者符合数目越多,房屋条件越亲民,租客感兴趣程度就越高。
因此用bool值描述各个房屋的description中是否包括这三个条件,做为新特征合并到训练集train中。
def newfeat(name, df, series):
"""Create a Series for my feature building loop to fill"""
feature = pd.Series(0, df.index, name=name)
"""Now populate the new Series with numeric values"""
for row, word in enumerate(series):
if name in word:
feature.iloc[row] = 1
df[name] = feature
return(df)
train = newfeat('Elevator', train, train.features)
train = newfeat('Dogs Allowed', train, train.features)
train = newfeat('Cats Allowed', train, train.features)
train.head()
3.时间特征提取
房屋的信息发布时间分布跨度较大,考虑到租客可能对新信息的兴趣较大,因此提取发布信息距离最近一条信息的时间间隔为新特征。
train['passed'] = train['date'].max()-train['date']
train.head()
网友评论