美文网首页Python机器学习我爱编程
sklearn(一):数据预处理

sklearn(一):数据预处理

作者: 海街diary | 来源:发表于2018-02-01 23:26 被阅读93次

    1 数值型特征处理

    1.1 均值移除(mean removal)

    对不同样本的同一特征值进行处理,最终均值为0,标准差为1
    import numpy as np
    from sklearn import preprocessing

    # each column is a sample, and features stack vertically
    # i.e, here are 4 examples, each one has 3 features.
    data = np.array([[3, -1.5, 2, -5.4],
                    [0, 4, -0.3, 2.1],
                    [1, 3.3, -1.9, -4.3]])
    
    # mean removal
    data_standardized = preprocessing.scale(data, axis=1)
    print(data_standardized)
    print("\nMean = ", data_standardized.mean(axis=1))
    print("Std deviation", data_standardized.std(axis=1))
    

    结果为:

    [[ 1.05366545 -0.31079341  0.75045237 -1.49332442]
     [-0.8340361   1.46675314 -1.00659529  0.37387825]
     [ 0.51284962  1.31254733 -0.49546489 -1.32993207]]
    
    Mean =  [ -5.55111512e-17  -1.11022302e-16   0.00000000e+00]
    Std deviation [ 1.  1.  1.]
    

    1.2 范围缩放(scale)

    对不同样本的同一特征值,减去其最大值,除以(最大值-最小值), 最终原最大值为1,原最小值为0

    # scaling
    data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
    data_scaled = data_scaler.fit_transform(data)
    print(data_scaled)  
    

    结果为:

    [[ 1.          0.          1.          0.        ]
     [ 0.          1.          0.41025641  1.        ]
     [ 0.33333333  0.87272727  0.          0.14666667]]
    

    1.3 归一化(normalization)

    归一化可以保持数据的正负、比例大小不变,同时可以收缩都范数为1的范围内。

    data_normalized_l1 = preprocessing.normalize(data, norm='l1', axis=1)
    data_normalized_l2 = preprocessing.normalize(data, norm='l2', axis=1)
    print("L1 norm")
    print(data_normalized_l1)
    print("\n L2 norm")
    print(data_normalized_l2)
    

    结果为:

    L1 norm
    [[ 0.25210084 -0.12605042  0.16806723 -0.45378151]
     [ 0.          0.625      -0.046875    0.328125  ]
     [ 0.0952381   0.31428571 -0.18095238 -0.40952381]]
    
     L2 norm
    [[ 0.45017448 -0.22508724  0.30011632 -0.81031406]
     [ 0.          0.88345221 -0.06625892  0.46381241]
     [ 0.17152381  0.56602858 -0.32589524 -0.73755239]]
    

    1.4 二值化(binarization)

    二值化用于数值特征向量转化为布尔类型,

    data_binarized = preprocessing.Binarizer(threshold=0.4).transform(data)
    print("\nBinarized data:")
    print(data_binarized)
    

    结果:

    Binarized data:
    [[ 1.  0.  1.  0.]
     [ 0.  1.  0.  1.]
     [ 1.  1.  0.  0.]]
    

    当然numpy本身就支持condition index(我自己瞎编的词),也可以直接使用下面的代码,效果相同。

    (data<0.4).astype(np.int32)
    

    2 非数值型数据编码

    2.1 普通编码

    将字符串按照(0~n-1)进行编码

    label_encoder = preprocessing.LabelEncoder()
    input_classes = ['audi', 'ford', 'toyota', 'ford', 'bwm']
    label_encoder.fit(input_classes)
    print("\nClass mapping:")
    for i, item in enumerate(label_encoder.classes_):
        print(item, "-->", i)
    

    编码器:

    Class mapping:
    audi --> 0
    bwm --> 1
    ford --> 2
    toyota --> 3
    

    对新数据进行编码:

    labels = ['toyota', 'ford', 'audi']
    encoded_labels = label_encoder.fit_transform(labels)
    print("Labels: ", labels)
    print("Encoded Labels: ", encoded_labels)
    

    编码结果为:

    Labels:  ['toyota', 'ford', 'audi']
    Encoded Labels:  [2 1 0]
    

    解码使用inverse_transform即可

    encoded_labels = [2, 1, 0, 3, 1]
    decoded_labels = label_encoder.inverse_transform(encoded_labels)
    print("Encoded Labels: ", encoded_labels)
    print("Decoded Labels: ", decoded_labels)
    

    解码结果为:

    Encoded Labels:  [2, 1, 0, 3, 1]
    Decoded Labels:  ['ford' 'bwm' 'audi' 'toyota' 'bwm']
    

    2.2 独热编码(one hot)

    独热编码用于将非structure data进行编码,确保编码后的数据在常见的欧式空间中距离不变。独热编码的详细介绍可以参照这里。这里需要注意,one-hot是按照列进行编码的

    data = np.array([[0, 2, 1, 12],
                     [1, 3, 5, 3],
                     [2, 3, 2, 12],
                     [1, 2, 4, 3]])
    encoder = preprocessing.OneHotEncoder()
    encoder.fit(data)
    encoder_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
    print(encoder_vector)
    

    结果为:

    [[ 0.  0.  1.  0.  1.  0.  0.  0.  1.  1.  0.]]
    

    相关文章

      网友评论

        本文标题:sklearn(一):数据预处理

        本文链接:https://www.haomeiwen.com/subject/phajzxtx.html