机器学习实战（2）之预测房价

作者: 柳叶刀与小鼠标 | 来源:发表于2018-10-21 19:03 被阅读7次

机器学习实战（3）之使用lasso回归预测房价
机器学习实战（2）之预测房价
波士顿房价预测
K-Means算法
机器学习实战⑴之线性回归预测房价
80. TensorFlow教程(四)房价预测
【机器学习】：房价预测(实例)
机器学习-线性回归预测房价模型demo
机器学习笔记

机器学习实战⑴之线性回归预测房价 - 简书
https://www.jianshu.com/p/0b66f1c4cc2d

这一篇主要是系统地对数据进行机器学习前的预处理。

# -*- coding: utf-8 -*-
"""
Created on Sun Oct 21 14:37:15 2018

@author: Administrator
"""

% reset -f
% clear

# In[*]
##########第一步  导入包
# In[*]
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
import os
from scipy.stats import skew
from scipy.stats.stats import pearsonr
os.chdir("C:\\Users\\Administrator\\Desktop\\all")

# In[*]
##########第二步  导入数据
# In[*]
train = pd.read_csv('train.csv',header = 0,index_col=0)
test  = pd.read_csv('test.csv',header = 0,index_col=0)

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))

前两步，导入包和数据。

数据大概80列，3000个观测值，属性包括有数字列，同时也有字符串列。

# In[*]
# 第三步，将目标变量标准化

matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"],
                       "log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()

#log transform the target:

# In[*]
# 第四步，将预测变量标准化
train["SalePrice"] = np.log1p(train["SalePrice"])

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) 

skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

这一步主要目的是将数字类型的属性，将这些特征其中比较偏，不属于正态分布的特征做log标准化。

# In[*]
# 第五步，处理字符型变量以及将填充缺失值
# In[*]
all_data = pd.get_dummies(all_data)
all_data = all_data.fillna(all_data.mean())
# In[*]
# 第六步，划分训练集和测试集
# In[*]
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice

数据预处理要点：
1.使用log（x+1）来转换偏斜的数字特征 -，这将使我们的数据更加正常
2.为分类要素创建虚拟变量
3.将数字缺失值（NaN）替换为各自列的平均值

全部代码：

# -*- coding: utf-8 -*-
"""
Created on Sun Oct 21 14:37:15 2018

@author: Administrator
"""

% reset -f
% clear

# In[*]
##########第一步  导入包
# In[*]
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
import numpy as np
import seaborn as sns
import os
from scipy.stats import skew
from scipy.stats.stats import pearsonr
os.chdir("C:\\Users\\Administrator\\Desktop\\all")

# In[*]
##########第二步  导入数据
# In[*]
train = pd.read_csv('train.csv',header = 0,index_col=0)
test  = pd.read_csv('test.csv',header = 0,index_col=0)

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))
# In[*]

#Data preprocessing:

#We're not going to do anything fancy here:

#First I'll transform the skewed numeric features by taking log(feature + 1) - 
#this will make the features more normal
#Create Dummy variables for the categorical features
#Replace the numeric missing values (NaN's) with the mean of their respective columns

# In[*]
# 第三步，将目标变量标准化
# In[*]
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"],
                       "log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()

#log transform the target:
# In[*]
# 第四步，将预测变量标准化
# In[*]
train["SalePrice"] = np.log1p(train["SalePrice"])

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) 

skewed_feats = skewed_feats[skewed_feats > 0.75]


skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

# In[*]
# 第五步，处理字符型变量以及将填充缺失值
# In[*]
all_data = pd.get_dummies(all_data)
all_data = all_data.fillna(all_data.mean())
# In[*]
# 第六步，划分训练集和测试集
# In[*]
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice

机器学习实战（3）之使用lasso回归预测房价 - 简书
https://www.jianshu.com/p/ccfa1d0b792a

机器学习实战（3）之使用lasso回归预测房价
上一篇机器学习实战（2）之预测房价 - 简书https://www.jianshu.com/p/98b6e874...
机器学习实战（2）之预测房价
上一篇机器学习实战⑴之线性回归预测房价 - 简书https://www.jianshu.com/p/0b66f1...
波士顿房价预测
机器学习实战小项目之波士顿房价预测前言波士顿房价预测项目是一个简单的回归模型，通过该项目的学习可以学会一些关于...
K-Means算法
参考链接：1. python机器学习实战之K均值聚类2. 机器学习实战之K-Means算法3.《机器学习实战》（十...
机器学习实战⑴之线性回归预测房价
机器学习一般来说，一个学习问题通常会考虑一系列 n 个样本数据，然后尝试预测未知数据的属性。如果每个样本是...
机器学习工程师纳米学位模型评价与验证项目 1: 预测波士顿房价欢迎来到机器学习的预测波士顿房价项目！在此文件...
80. TensorFlow教程(四)房价预测
本文介绍实战房价预测模型，内容如下：房价预测模型介绍使用TensorFlow实现房价预测模型使用Tensor...
【机器学习】：房价预测(实例)
一、数据集处理 1、数据集导入 2、查看前五行数据 3、获取数据的简单描述输出数据描述 3、查看ocean_pr...
机器学习-线性回归预测房价模型demo
这篇介绍的是我在做房价预测模型时的python代码，房价预测在机器学习入门中已经是个经典的题目了，但我发现目前网上...
机器学习笔记
机器学习分类 1、监督式学习比如垃圾邮件识别、房价预测；特点：一组输入数据，对应一个“正确的”输出结果。 2、非...