美文网首页
特征数据预处理-标准化与归一化

特征数据预处理-标准化与归一化

作者: ForgetThatNight | 来源:发表于2018-07-05 23:05 被阅读11次
    import pandas as pd
    import numpy as np
    
    df = pd.read_csv(
        '../data/wine_data.csv',  #葡萄酒数据集
         header=None,     #用哪行当做列名,我们自己来指定
         usecols=[0,1,2]  #返回一个子集,我们拿部分特征举例就可以了
        )
    
    df.columns=['Class label', 'Alcohol', 'Malic acid']
    
    df.head()
    

    在数据中,Alcohol和Malic acid 衡量的标准应该是不同的,特征之间数值差异较大

    Standardization and Min-Max scaling

    from sklearn import preprocessing
    
    std_scale = preprocessing.StandardScaler().fit(df[['Alcohol', 'Malic acid']])
    df_std = std_scale.transform(df[['Alcohol', 'Malic acid']])
    
    minmax_scale = preprocessing.MinMaxScaler().fit(df[['Alcohol', 'Malic acid']])
    df_minmax = minmax_scale.transform(df[['Alcohol', 'Malic acid']])
    
    
    print('Mean after standardization:\nAlcohol={:.2f}, Malic acid={:.2f}'
          .format(df_std[:,0].mean(), df_std[:,1].mean()))
    print('\nStandard deviation after standardization:\nAlcohol={:.2f}, Malic acid={:.2f}'
          .format(df_std[:,0].std(), df_std[:,1].std()))
    

    输出 :
    Mean after standardization:
    Alcohol=-0.00, Malic acid=-0.00

    Standard deviation after standardization:
    Alcohol=1.00, Malic acid=1.00

    print('Min-value after min-max scaling:\nAlcohol={:.2f}, Malic acid={:.2f}'
          .format(df_minmax[:,0].min(), df_minmax[:,1].min()))
    print('\nMax-value after min-max scaling:\nAlcohol={:.2f}, Malic acid={:.2f}'
          .format(df_minmax[:,0].max(), df_minmax[:,1].max()))
    

    输出 :
    Min-value after min-max scaling:
    Alcohol=0.00, Malic acid=0.00

    Max-value after min-max scaling:
    Alcohol=1.00, Malic acid=1.00

    Plotting

    %matplotlib inline
    
    from matplotlib import pyplot as plt
    
    def plot():
        plt.figure(figsize=(8,6))
    
        plt.scatter(df['Alcohol'], df['Malic acid'], 
                color='green', label='input scale', alpha=0.5)
    
        plt.scatter(df_std[:,0], df_std[:,1], color='red', 
                label='Standardized [$N  (\mu=0, \; \sigma=1)$]', alpha=0.3)
    
        plt.scatter(df_minmax[:,0], df_minmax[:,1], 
                color='blue', label='min-max scaled [min=0, max=1]', alpha=0.3)
    
        plt.title('Alcohol and Malic Acid content of the wine dataset')
        plt.xlabel('Alcohol')
        plt.ylabel('Malic Acid')
        plt.legend(loc='upper left')
        plt.grid()
        
        plt.tight_layout()
    
    plot()
    plt.show()
    

    我们将原始的和变换后都放到了同一个图上,观察下结果吧!接下来我们再看看数据是否被打乱了呢?

    fig, ax = plt.subplots(3, figsize=(6,14))
    
    for a,d,l in zip(range(len(ax)), 
                   (df[['Alcohol', 'Malic acid']].values, df_std, df_minmax),
                   ('Input scale', 
                    'Standardized [$N  (\mu=0, \; \sigma=1)$]', 
                    'min-max scaled [min=0, max=1]')
                    ):
        for i,c in zip(range(1,4), ('red', 'blue', 'green')):
            ax[a].scatter(d[df['Class label'].values == i, 0], 
                      d[df['Class label'].values == i, 1],
                      alpha=0.5,
                      color=c,
                      label='Class %s' %i
                      )
        ax[a].set_title(l)
        ax[a].set_xlabel('Alcohol')
        ax[a].set_ylabel('Malic Acid')
        ax[a].legend(loc='upper left')
        ax[a].grid()
        
    plt.tight_layout()
    
    plt.show()
    

    在机器学习中,如果我们对训练集做了上述处理,那么同样的对测试集也必须要经过相同的处理

    std_scale = preprocessing.StandardScaler().fit(X_train)
    X_train = std_scale.transform(X_train)
    X_test = std_scale.transform(X_test)

    标准化处理对PCA主成分分析的影响

    主成分分析(PCA)和一个非常有用的套路,接下来,咱们来看看数据经过标准化处理和未经标准化处理后使用PCA的效果

    Reading in the dataset

    import pandas as pd
    
    df = pd.read_csv(
        '../data/wine_data.csv', 
        header=None,
        )
    df.head()
    

    Dividing the dataset into a separate training and test dataset

    In this step, we will randomly divide the wine dataset into a training dataset and a test dataset where the training dataset will contain 70% of the samples and the test dataset will contain 30%, respectively.

    from sklearn.model_selection import train_test_split
    
    X_wine = df.values[:,1:]
    y_wine = df.values[:,0]
    
    X_train, X_test, y_train, y_test = train_test_split(X_wine, y_wine,
        test_size=0.30, random_state=12345)
    

    Feature Scaling - Standardization

    from sklearn import preprocessing
    
    std_scale = preprocessing.StandardScaler().fit(X_train)
    X_train_std = std_scale.transform(X_train)
    X_test_std = std_scale.transform(X_test)
    

    使用PCA进行降维

    现在,我们在标准化和非标准化数据集上执行PCA,将数据集转换成二维特征子空间。

    from sklearn.decomposition import PCA
    
    # on non-standardized data
    pca = PCA(n_components=2).fit(X_train)
    X_train = pca.transform(X_train)
    X_test = pca.transform(X_test)
    
    
    # om standardized data
    pca_std = PCA(n_components=2).fit(X_train_std)
    X_train_std = pca_std.transform(X_train_std)
    X_test_std = pca_std.transform(X_test_std)
    

    来看看效果咋样吧

    %matplotlib inline
    
    from matplotlib import pyplot as plt
    
    fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10,4))
    
    
    for l,c,m in zip(range(1,4), ('blue', 'red', 'green'), ('^', 's', 'o')):
        ax1.scatter(X_train[y_train==l, 0], X_train[y_train==l, 1],
            color=c, 
            label='class %s' %l, 
            alpha=0.5,
            marker=m
            )
    
    for l,c,m in zip(range(1,4), ('blue', 'red', 'green'), ('^', 's', 'o')):
        ax2.scatter(X_train_std[y_train==l, 0], X_train_std[y_train==l, 1],
            color=c, 
            label='class %s' %l, 
            alpha=0.5,
            marker=m
            )
    
    ax1.set_title('Transformed NON-standardized training dataset after PCA')    
    ax2.set_title('Transformed standardized training dataset after PCA')    
        
    for ax in (ax1, ax2):
    
        ax.set_xlabel('1st principal component')
        ax.set_ylabel('2nd principal component')
        ax.legend(loc='upper right')
        ax.grid()
    plt.tight_layout()
    
    plt.show() 
    

    直观上,可以清晰的看到经过标准化的数据可分性更强的

    相关文章

      网友评论

          本文标题:特征数据预处理-标准化与归一化

          本文链接:https://www.haomeiwen.com/subject/aemmuftx.html