美文网首页
Notebook - Comprehensive data ex

Notebook - Comprehensive data ex

作者: 左心Chris | 来源:发表于2019-10-28 15:54 被阅读0次

    https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python#2.-First-things-first:-analysing-'SalePrice'

    Intro

    1. Understand the problem
    2. Univariable study
    3. Multivariate study
    4. Basic cleaning
    5. Test assumption

    What we expect

    1. Variable: name
    2. Type: categorical or numerical
    3. Segment: identificaiton
    4. Expection: output
      先过滤出我们需要的特征:
    • 这个特征对output有影响么
    • 这个特征有多重要
    • 这个特征是不是其他特征已经描述过了

    Analysing

    #invite people for the Kaggle party
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import numpy as np
    from scipy.stats import norm
    from sklearn.preprocessing import StandardScaler
    from scipy import stats
    import warnings
    warnings.filterwarnings('ignore')
    %matplotlib inline
    

    Simple describe and histogram

    #descriptive statistics summary
    df_train['SalePrice'].describe()
    sns.distplot(df_train['SalePrice'])
    

    skewness and kurtosis
    http://blog.sciencenet.cn/blog-3083238-1057463.html
    峰度大于0 比正态分布陡峭
    偏度大于0 右偏 有长尾在右边

    #skewness and kurtosis
    print("Skewness: %f" % df_train['SalePrice'].skew())
    print("Kurtosis: %f" % df_train['SalePrice'].kurt())
    

    relations and numerical and scatter

    #scatter plot grlivarea/saleprice
    var = 'GrLivArea'
    data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
    data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
    

    relations and categorical and boxplot

    #box plot overallqual/saleprice
    var = 'OverallQual'
    data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
    f, ax = plt.subplots(figsize=(8, 6))
    fig = sns.boxplot(x=var, y="SalePrice", data=data)
    fig.axis(ymin=0, ymax=800000);</pre>
    
    var = 'YearBuilt'
    data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
    f, ax = plt.subplots(figsize=(16, 8))
    fig = sns.boxplot(x=var, y="SalePrice", data=data)
    fig.axis(ymin=0, ymax=800000);
    plt.xticks(rotation=90)
    

    Work smart

    Correlation matrix(heatmap)

    #correlation matrix
    corrmat = df_train.corr()
    f, ax = plt.subplots(figsize=(12, 9))
    sns.heatmap(corrmat, vmax=.8, square=True);</pre>
    

    Correlation matrix(zoomed heatmap style)

    #saleprice correlation matrix
    k = 10 #number of variables for heatmap
    cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
    cm = np.corrcoef(df_train[cols].values.T)
    sns.set(font_scale=1.25)
    hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
    plt.show()
    

    Scatter plots

    #scatterplot
    sns.set()
    cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
    sns.pairplot(df_train[cols], size = 2.5)
    plt.show()
    

    Missing Data

    drop the bad data columns

    #missing data
    total = df_train.isnull().sum().sort_values(ascending=False)
    percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    missing_data.head(20)</pre>
    

    Univariate analysis: In this context, data standardization means converting data values to have mean of 0 and a standard deviation of 1

    #standardizing data
    saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
    low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]
    high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]
    print('outer range (low) of the distribution:')
    print(low_range)
    print('\nouter range (high) of the distribution:')
    print(high_range)
    

    Code

    Histogram - Kurtosis and skewness.
    Normal probability plot - Data distribution should closely follow the diagonal that represents the normal distribution.

    #histogram and normal probability plot
    sns.distplot(df_train['SalePrice'], fit=norm);
    fig = plt.figure()
    res = stats.probplot(df_train['SalePrice'], plot=plt)</pre>
    

    convert categorical variable into dummy

    #convert categorical variable into dummy
    df_train = pd.get_dummies(df_train)
    

    相关文章

      网友评论

          本文标题:Notebook - Comprehensive data ex

          本文链接:https://www.haomeiwen.com/subject/iorbvctx.html