Notebook - Comprehensive data ex

作者: 左心Chris | 来源:发表于2019-10-28 15:54 被阅读0次

Notebook - Comprehensive data ex
011 大数据与云计算-综合指南
TOCC数据异常检测二
016 大数据中的职业和工作角色-综合指南
【译】python数据科学家的学习路径。
【译】数据探索完全指南
Ｒ可视化学习资料
EX23 USART
Jupyter notebook报错：IOPub data ra
python 无敌小抄

https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python#2.-First-things-first:-analysing-'SalePrice'

Intro

Understand the problem
Univariable study
Multivariate study
Basic cleaning
Test assumption

What we expect

Variable: name
Type: categorical or numerical
Segment: identificaiton
Expection: output
先过滤出我们需要的特征：

这个特征对output有影响么
这个特征有多重要
这个特征是不是其他特征已经描述过了

Analysing

#invite people for the Kaggle party
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Simple describe and histogram

#descriptive statistics summary
df_train['SalePrice'].describe()
sns.distplot(df_train['SalePrice'])

skewness and kurtosis
http://blog.sciencenet.cn/blog-3083238-1057463.html
峰度大于0 比正态分布陡峭
偏度大于0 右偏有长尾在右边

#skewness and kurtosis
print("Skewness: %f" % df_train['SalePrice'].skew())
print("Kurtosis: %f" % df_train['SalePrice'].kurt())

relations and numerical and scatter

#scatter plot grlivarea/saleprice
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

relations and categorical and boxplot

#box plot overallqual/saleprice
var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);</pre>

var = 'YearBuilt'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90)

Work smart

Correlation matrix(heatmap)

#correlation matrix
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);</pre>

Correlation matrix(zoomed heatmap style)

#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

Scatter plots

#scatterplot
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df_train[cols], size = 2.5)
plt.show()

Missing Data

drop the bad data columns

#missing data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)</pre>

Univariate analysis: In this context, data standardization means converting data values to have mean of 0 and a standard deviation of 1

#standardizing data
saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]
high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]
print('outer range (low) of the distribution:')
print(low_range)
print('\nouter range (high) of the distribution:')
print(high_range)

Code

Histogram - Kurtosis and skewness.
Normal probability plot - Data distribution should closely follow the diagonal that represents the normal distribution.

#histogram and normal probability plot
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)</pre>

convert categorical variable into dummy

#convert categorical variable into dummy
df_train = pd.get_dummies(df_train)

网友评论

本文标题：Notebook - Comprehensive data ex

本文链接：https://www.haomeiwen.com/subject/iorbvctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！