作者: 四咸一声如裂帛 | 来源:发表于2018-04-02 20:30 被阅读0次

Data Mining Exp 1 Report: Rental Listing Inquiries

Data Preprocessing

1. 获取训练集(train.json)中的所有属性

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

train_df = pd.read_json("train.json")
train_df.head(3).T

output:

	10	10000	100004
bathrooms	1.5	1	1
bedrooms	3	2	1
building_id	53a5b119ba8f7b61d4e010512e0dfc85	c5c8a357cba207596b04d1afd1e4f130	c3ba40552e2120b0acfc3cb5730bb2aa
created	2016-06-24 07:54:24	2016-06-12 12:19:27	2016-04-17 03:26:41
description	A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...		Top Top West Village location, beautiful Pre-w...
display_address	Metropolitan Avenue	Columbus Avenue	W 13 Street
features	[]	[Doorman, Elevator, Fitness Center, Cats Allow...	[Laundry In Building, Dishwasher, Hardwood Flo...
interest_level	medium	low	high
latitude	40.7145	40.7947	40.7388
listing_id	7211212	7150865	6887163
longitude	-73.9425	-73.9667	-74.0018
manager_id	5ba989232d0489da1b5f2c45f6688adc	7533621a882f71e25173b27e3139d83d	d9039c43983f6e564b1482b273bd7b01
photos	[https://photos.renthop.com/2/7211212_1ed4542e...	[https://photos.renthop.com/2/7150865_be3306c5...	[https://photos.renthop.com/2/6887163_de85c427...
price	3000	5465	2850
street_address	792 Metropolitan Avenue	808 Columbus Avenue	241 W 13 Street

查看训练集大小

train_df.shape

output: (49352, 15)

2. 获取测试集(test.json)中的所有属性

test_df = pd.read_json("test.json")
test_df.head(3).T

output:

	0	1	100
bathrooms	1	1	1
bedrooms	1	2	1
building_id	79780be1514f645d7e6be99a3de696c5	0	3dbbb69fd52e0d25131aa1cd459c87eb
created	2016-06-11 05:29:41	2016-06-24 06:36:34	2016-06-03 04:29:40
description	Large with awesome terrace--accessible via bed...	Prime Soho - between Bleecker and Houston - Ne...	New York chic has reached a new level ...
display_address	Suffolk Street	Thompson Street	101 East 10th Street
features	[Elevator, Laundry in Building, Laundry in Uni...	[Pre-War, Dogs Allowed, Cats Allowed]	[Doorman, Elevator, No Fee]
latitude	40.7185	40.7278	40.7306
listing_id	7142618	7210040	7103890
longitude	-73.9865	-74	-73.989
manager_id	b1b1852c416d78d7765d746cb1b8921f	d0b5648017832b2427eeb9956d966a14	9ca6f3baa475c37a3b3521a394d65467
photos	[https://photos.renthop.com/2/7142618_1c45a2c8...	[https://photos.renthop.com/2/7210040_d824cc71...	[https://photos.renthop.com/2/7103890_85b33077...
price	2950	2850	3758
street_address	99 Suffolk Street	176 Thompson Street	101 East 10th Street

查看测试集大小

test_df.shape

output: (74659, 14)

3. 观察数据信息

查看dataframe信息

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49352 entries, 10 to 99994
Data columns (total 15 columns):


bathrooms	49352	non-null	float64
bedrooms	49352	non-null	int64
building_id	49352	non-null	object
created	49352	non-null	object
description	49352	non-null	object
display_address	49352	non-null	object
features	49352	non-null	object
interest_level	49352	non-null	object
latitude	49352	non-null	float64
listing_id	49352	non-null	int64
longitude	49352	non-null	float64
manager_id	49352	non-null	object
photos	49352	non-null	object
price	49352	non-null	int64
street_address	49352	non-null	object

dtypes: float64(3), int64(3), object(9)
memory usage: 4.3+ MB

查看数据是否存在缺失值：

train_df.isnull().sum()


bathrooms	0
bedrooms	0
building_id	0
created	0
description	0
display_address	0
features	0
interest_level	0
latitude	0
listing_id	0
longitude	0
manager_id	0
photos	0
price	0
street_address	0

dtype: int64

4. 处理price数据

观察price是否存在异常点：

plt.figure(figsize=(8, 10))
plt.scatter(range(train_df.shape[0]), train_df["price"].values, color='red')
plt.title("Distribution of Price")

除去偏离总体值过远的异常点(>%99的部分)

ulimit = np.percentile(train_df.price.values,99)
train_df['price'].ix[train_df['price']>ulimit] = ulimit

plt.figure(figsize=(8, 10))
plt.scatter(range(train_df.shape[0]), train_df["price"].values, color='red')
plt.title("Distribution of Price")

5. 处理数组型数据

对于数组形式的数据：features、photos、description，可以先做基本处理，例如：统计数组中元素个数、提取特征词逐个保存等。

train_df['num_photos'] = train_df['photos'].apply(len)
train_df['num_features'] = train_df['features'].apply(len)
train_df['num_description_words'] = train_df['description'].apply(lambda x: len(x.split(' ')))
train_df['num_sescription_len'] = train_df['description'].apply(len)

6. 处理Interest_level数据

将string形式存储的Interest_level映射为One hot 编码

label_num_map = {'high':0, 'medium':1, 'low':2}
train_df['label'] = train_df['interest_level'].apply(lambda x : label_num_map[x])
interest_level = pd.get_dummies(train_df['interest_level'])
train = train_df
train = pd.concat([train, interest_level], axis = 1)
train[['interest_level', 'high', 'low', 'medium']].head()

处理结果示例如下：

	interest_level	high	low	medium
10	medium	0	0	1
10000	low	0	1	0
100004	high	1	0	0
100007	low	0	1	0
100013	low	0	1	0

6. 处理Features数据

可以使用TF-IDF算法将Features数据映射成稀疏矩阵

用"null"填充缺失的数据

def fill_empty(row):
 if row.num_features == 0:
 return ['null']
 return row.features
train['features_filled'] = train.apply(lambda row: fill_empty(row), axis = 1)
train[['features_filled']].head()

	features_filled
10	[null]
10000	[Doorman, Elevator, Fitness Center, Cats Allow...
100004	[Laundry In Building, Dishwasher, Hardwood Flo...
100007	[Hardwood Floors, No Fee]
100013	[Pre-War]

调用sklearn中的TF-IDF算法进行处理

from sklearn import model_selection, preprocessing, ensemble
from sklearn.metrics import log_loss
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
train['features_word'] = train['features_filled'].apply(lambda x:" ".join([" ".join(i.split(" "))for i in x]))
tfidf = CountVectorizer(stop_words='english', max_features=200 )
train_sparse = tfidf.fit_transform(train['features_word'])
train[['features_word']].head()

	features_word
10	null
10000	Doorman Elevator Fitness Center Cats Allowed D...
100004	Laundry In Building Dishwasher Hardwood Floors...
100007	Hardwood Floors No Fee
100013	Pre-War

7. 处理时间数据

简单提取，分别保存即可

train['created_year'] = pd.to_datetime(train['created']).dt.year
train['created_month'] = pd.to_datetime(train['created']).dt.month
train['created_day'] = pd.to_datetime(train['created']).dt.day
train['created_hour'] = pd.to_datetime(train['created']).dt.hour
train['created_date'] = train['created'].dt.date
train[['created', 'created_year', 'created_month', 'created_day','created_hour']].head()

	created	created_year	created_month	created_day	created_hour
10	2016-06-24 07:54:24	2016	6	24	7
10000	2016-06-12 12:19:27	2016	6	12	12
100004	2016-04-17 03:26:41	2016	4	17	3
100007	2016-04-18 02:22:02	2016	4	18	2
100013	2016-04-28 01:32:41	2016	4	28	1

7. 词云处理Features，Descriptions文本数据

from wordcloud import WordCloud
text = ''
text_da = ''
for index,row in train_df.iterrows():
    for feature in row['features']:
        text = ' '.join([text,"_".join(feature.strip().split(" "))])
        text_da = " ".join([text_da,"_".join(row['display_address'].strip().split(" "))])
text = text.strip()
text_da = text_da.strip()

text_desc =''
for ind, row in train.iterrows():
    for feature in row['features']:
        text = " ".join([text, "_".join(feature.strip().split(" "))])
    text_desc=" ".join([text_desc, row['description']])
text = text.strip()
text_desc = text_desc.strip()

Exploratory Data Analysis

1. Interest_level统计分布

level_count = train['interest_level'].value_counts()

plt.figure(figsize=(8,4))
bar_width = 0.5
index = np.arange(len(level_count.values))
plt.bar(index - bar_width/2, level_count.values, bar_width, alpha=0.8)

plt.xticks(index, level_count.index)
plt.ylabel('Interest level count', fontsize=12)
plt.xlabel('Interest level', fontsize=12)
plt.show()

Distribution of interest_level

2. 地理位置分布

房屋纬度分布

llimit = np.percentile(train.latitude.values,1)
ulimit = np.percentile(train.latitude.values,99)
train['latitude'].loc[train['latitude']<llimit]=llimit
train['latitude'].loc[train['latitude']>ulimit]=ulimit
plt.figure(figsize=(10,5))
sns.distplot(train.latitude.values,bins=50,kde=True,color='blue')
plt.xlabel('latitude',fontsize=12)
plt.show()

房屋经度分布

llimit = np.percentile(train.longitude.values,1)
ulimit = np.percentile(train.longitude.values,99)
train['longitude'].loc[train['longitude']<llimit]=llimit
train['longitude'].loc[train['longitude']>ulimit]=ulimit
plt.figure(figsize=(12,6))
sns.distplot(train.longitude.values,bins=50)
plt.xlabel('longitude',fontsize=12)
plt.show()

可见训练集中的住房分布聚集在一片较小的区域内。将位置信息进一步可视化处理。

将房屋位置信息分布到Google Map上

import gpxpy as gpx
import gpxpy.gpx
gpx = gpxpy.gpx.GPX()
for index, row in train.iterrows():
    if row['interest_level'] == 'high': #opting for all nominals results in poor performance of Google Earth
        gps_waypoint = gpxpy.gpx.GPXWaypoint(row['latitude'],row['longitude'],elevation=10)
        gpx.waypoints.append(gps_waypoint)        
filename = "GoogleEarth.gpx"
FILE = open(filename,"w")
FILE.writelines(gpx.to_xml())
FILE.close()

观察地图可以发现出租的房屋主要分布在New York City中部地区。

3. Description和Features文本数据词云分布

Description词云生成

plt.figure(figsize=(12,6))
wordcloud = WordCloud(background_color='white', width=600, height=300,
                      max_font_size=50, max_words=40).generate(text_desc)
wordcloud.recolor(random_state=0)
plt.imshow(wordcloud)
plt.title("Wordcloud for Description", fontsize=20)
plt.axis("off")
plt.show()

Features词云生成

plt.figure(figsize=(14,7))
wordcloud = WordCloud(background_color='white',width=600,height=300,max_font_size=50,max_words=40).generate(text)
wordcloud.recolor(random_state=0)
plt.imshow(wordcloud)
plt.title("Wordcloud for Features Distribution",fontsize=12)
plt.axis("off")
plt.show()

观察词云可大致得出房屋Features中出现的高频词，与data preprocessing环节中TF-IDF算法得到的feature_words一定程度上重叠。

4. 发布时间分布

发布日期分布

cnt_srs = train['created_date'].value_counts()
plt.figure(figsize=(12,4))
ax = plt.subplot(111)
ax.bar(cnt_srs.index, cnt_srs.values, color='g')
ax.xaxis_date()
plt.xticks(rotation='vertical')
plt.show()

由图，总体来看房屋信息在2016年的4-6月间均匀发布。

发布时间段分布

hour_feature = train.groupby(['hour', 'interest_level'])['created_hour'].count().unstack('interest_level').fillna(0)
hour_feature[['low','medium',"high"]].plot(kind='bar', stacked=True)

由图，每天的凌晨1点至6点是信息发布的高峰时间段，其他时间段信息发布较少。同时interest_level与created_hour的关联性不强。

5. 统计强业务能力Manager

以high interest level房屋数目为Manager能力衡量标准，筛选出业务能力前十的Manager。

manager = train.groupby(['manager_id', 'interest_level'])['manager_id'].count().unstack('interest_level').fillna(0)
manager["sum"]=manager.sum(axis=1)
x=manager[ (manager['sum'] > 80) & (manager['high'] > 30)]
manager.head()
x.sort_values("sum",inplace=True)
plt.title("High Profile Managers")
plt.xlabel("Num of High Interest level")
plt.ylabel("Manager ID")
x = x[x.index != 'e6472c7237327dd3903b3d6f6a94515a']
x["sum"].plot(kind='barh',color="r");

6. 统计受欢迎建筑(building)

以high interest level房屋数目为building的interest level衡量标准，筛选出前十受欢迎的建筑。

building = train.groupby(['building_id', 'interest_level'])['building_id'].count().unstack('interest_level').fillna(0)
building["sum"]=building.sum(axis=1)
x=building[ (building['sum'] > 60) & (building['high'] > 15)]
building.head()

Feature Engineering

1. 房间卧室数目-兴趣度特征

本特征引用自数据预处理应用课件

获取对不同房间时数目的房屋感兴趣程度的数目以及兴趣程度占比

bedroom_features = train.groupby(['bedrooms'])['high', 'low', 'medium'].sum().add_suffix('_count').reset_index()
bedroom_features['all_count'] = train.groupby(['bedrooms'])['price'].count().reset_index().price
bedroom_features['high_ratio'] = bedroom_features['high_count']/bedroom_features['all_count']
bedroom_features['low_ratio'] = bedroom_features['low_count']/bedroom_features['all_count']
bedroom_features['medium_ratio'] = bedroom_features['medium_count']/bedroom_features['all_count']
bedroom_features.head()

	bedrooms	high_count	low_count	medium_count	all_count	high_ratio	low_ratio	medium_ratio
0	0	847.0	6518.0	2110.0	9475	0.089393	0.687916	0.222691
1	1	943.0	11715.0	3094.0	15752	0.059865	0.743715	0.196420
2	2	1306.0	9801.0	3516.0	14623	0.089311	0.670246	0.240443
3	3	592.0	4727.0	1957.0	7276	0.081363	0.649670	0.268966
4	4	149.0	1233.0	547.0	1929	0.077242	0.639191	0.283567

将新特征整合到训练集中

result = pd.merge(train, bedroom_features)

result.head()

	bathrooms	bedrooms	building_id	created	description	display_address	features	interest_level	latitude	listing_id	...	created_year	created_month	created_day	high_count	low_count	medium_count	all_count	high_ratio	low_ratio	medium_ratio
0	1.5	3	53a5b119ba8f7b61d4e010512e0dfc85	2016-06-24 07:54:24	A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...	Metropolitan Avenue	[]	medium	40.7145	7211212	...	2016	6	24	592.0	4727.0	1957.0	7276	0.081363	0.64967	0.268966
1	1.0	3	205f95d4a78f1f3befda48b89edc9669	2016-04-12 02:39:45	BEAUTIFUL 2 BEDROOM POSSIBLE CONVERSION INTO T...	Madison Avenue	[Doorman, Elevator, Dishwasher, Hardwood Floors]	low	40.7454	6858062	...	2016	4	12	592.0	4727.0	1957.0	7276	0.081363	0.64967	0.268966
2	1.0	3	be6b7c3fdf3f63a2756306f4af7788a6	2016-04-18 04:46:30	These pictures are from a similarlisting.	Thompson St	[Washer/Dryer]	low	40.7231	6890563	...	2016	4	18	592.0	4727.0	1957.0	7276	0.081363	0.64967	0.268966
3	1.0	3	6e9276972669ddcb097e71691f4c8f1b	2016-04-26 11:38:19	TRUE 3 bedroom! RENOVATED! Windows Everywhere!...	540 w 52nd street, new york, ny	[Common Outdoor Space, Cats Allowed, Dogs Allo...	medium	40.7660	6927880	...	2016	4	26	592.0	4727.0	1957.0	7276	0.081363	0.64967	0.268966
4	2.0	3	9ba2f2719b9ab170812a9ea79a3bd84c	2016-04-10 02:41:58	!!! NO FEE !!!It is a community-oriented, -fri...	North Moore St.	[Swimming Pool, Doorman, Elevator, Laundry in ...	low	40.7196	6851714	...	2016	4	10	592.0	4727.0	1957.0	7276	0.081363	0.64967	0.268966

5 rows × 35 columns

use_features = ['bedrooms','bathrooms','price','high_count','low_count','medium_count','high_ratio','low_ratio','medium_ratio']
result_x = result[use_features]
result_y = train[['label']]

result_x.head()

	bedrooms	bathrooms	price	high_count	low_count	medium_count	high_ratio	low_ratio	medium_ratio
0	3	1.5	3000.0	592.0	4727.0	1957.0	0.081363	0.64967	0.268966
1	3	1.0	4395.0	592.0	4727.0	1957.0	0.081363	0.64967	0.268966
2	3	1.0	3733.0	592.0	4727.0	1957.0	0.081363	0.64967	0.268966
3	3	1.0	4500.0	592.0	4727.0	1957.0	0.081363	0.64967	0.268966
4	3	2.0	6320.0	592.0	4727.0	1957.0	0.081363	0.64967	0.268966

2. Features特征提取

通过观察EDA部分获取的Features词云，我发现“Elevator”，“Dogs_allowed”，“Cats_allowed”
是描述房屋环境的重要条件。因此我预测，三者符合数目越多，房屋条件越亲民，租客感兴趣程度就越高。
因此用bool值描述各个房屋的description中是否包括这三个条件，做为新特征合并到训练集train中。

def newfeat(name, df, series):
    """Create a Series for my feature building loop to fill"""
    feature = pd.Series(0, df.index, name=name)
    """Now populate the new Series with numeric values"""
    for row, word in enumerate(series):
        if name in word:
            feature.iloc[row] = 1
    df[name] = feature
    return(df)
   
train = newfeat('Elevator', train, train.features)
train = newfeat('Dogs Allowed', train, train.features)
train = newfeat('Cats Allowed', train, train.features)
train.head()

3.时间特征提取

房屋的信息发布时间分布跨度较大，考虑到租客可能对新信息的兴趣较大，因此提取发布信息距离最近一条信息的时间间隔为新特征。

train['passed'] = train['date'].max()-train['date']
train.head()

网友评论

我爱编程

本文标题：

本文链接：https://www.haomeiwen.com/subject/pzgchftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！