上一篇
机器学习实战⑴之线性回归预测房价 - 简书
https://www.jianshu.com/p/0b66f1c4cc2d
这一篇主要是系统地对数据进行机器学习前的预处理。
# -*- coding: utf-8 -*-
"""
Created on Sun Oct 21 14:37:15 2018
@author: Administrator
"""
% reset -f
% clear
# In[*]
##########第一步 导入包
# In[*]
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
import os
from scipy.stats import skew
from scipy.stats.stats import pearsonr
os.chdir("C:\\Users\\Administrator\\Desktop\\all")
# In[*]
##########第二步 导入数据
# In[*]
train = pd.read_csv('train.csv',header = 0,index_col=0)
test = pd.read_csv('test.csv',header = 0,index_col=0)
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
test.loc[:,'MSSubClass':'SaleCondition']))
前两步,导入包和数据。
数据大概80列,3000个观测值,属性包括有数字列,同时也有字符串列。
# In[*]
# 第三步,将目标变量标准化
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"],
"log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()
#log transform the target:
# In[*]
# 第四步,将预测变量标准化
train["SalePrice"] = np.log1p(train["SalePrice"])
#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna()))
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
这一步主要目的是将数字类型的属性,将这些特征其中比较偏,不属于正态分布的特征做log标准化。
# In[*]
# 第五步,处理字符型变量以及将填充缺失值
# In[*]
all_data = pd.get_dummies(all_data)
all_data = all_data.fillna(all_data.mean())
# In[*]
# 第六步,划分训练集和测试集
# In[*]
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice
数据预处理要点:
1.使用log(x+1)来转换偏斜的数字特征 -,这将使我们的数据更加正常
2.为分类要素创建虚拟变量
3.将数字缺失值(NaN)替换为各自列的平均值
全部代码:
# -*- coding: utf-8 -*-
"""
Created on Sun Oct 21 14:37:15 2018
@author: Administrator
"""
% reset -f
% clear
# In[*]
##########第一步 导入包
# In[*]
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
import os
from scipy.stats import skew
from scipy.stats.stats import pearsonr
os.chdir("C:\\Users\\Administrator\\Desktop\\all")
# In[*]
##########第二步 导入数据
# In[*]
train = pd.read_csv('train.csv',header = 0,index_col=0)
test = pd.read_csv('test.csv',header = 0,index_col=0)
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
test.loc[:,'MSSubClass':'SaleCondition']))
# In[*]
#Data preprocessing:
#We're not going to do anything fancy here:
#First I'll transform the skewed numeric features by taking log(feature + 1) -
#this will make the features more normal
#Create Dummy variables for the categorical features
#Replace the numeric missing values (NaN's) with the mean of their respective columns
# In[*]
# 第三步,将目标变量标准化
# In[*]
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"],
"log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()
#log transform the target:
# In[*]
# 第四步,将预测变量标准化
# In[*]
train["SalePrice"] = np.log1p(train["SalePrice"])
#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna()))
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
# In[*]
# 第五步,处理字符型变量以及将填充缺失值
# In[*]
all_data = pd.get_dummies(all_data)
all_data = all_data.fillna(all_data.mean())
# In[*]
# 第六步,划分训练集和测试集
# In[*]
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice
下一篇
机器学习实战(3)之使用lasso回归预测房价 - 简书
https://www.jianshu.com/p/ccfa1d0b792a
网友评论