美文网首页生物信息学技能
Scaling与Normalization的区别

Scaling与Normalization的区别

作者: 生信编程日常 | 来源:发表于2020-10-15 14:08 被阅读0次

    scale与normalize,是我们在做前期数据处理的时候经常做的操作,但是它们经常会被混淆,现在网上的一些讨论也比较混乱。

    import pandas as pd
    import numpy as np
    
    # for Box-Cox Transformation
    from scipy import stats
    
    # for min_max scaling
    from mlxtend.preprocessing import minmax_scaling
    from sklearn import preprocessing
    
    # plotting modules
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # set seed for reproducibility
    np.random.seed(0)
    
    1. Scaling

    特征缩放,特点是不改变数据分布情况。比如min-max或者Z-score (主要有如下四种方法,详见:Feature_scaling).

    Min-Max scale:

    original_data = np.random.beta(5, 1, 1000) * 60
    
    # mix-max scale the data between 0 and 1
    scaled_data = minmax_scaling(original_data, columns=[0])
    # 或者
    scaled_data = preprocessing.minmax_scale(original_data)
    
    # plot both together to compare
    fig, ax = plt.subplots(1,2)
    sns.distplot(original_data, ax=ax[0])
    ax[0].set_title("Original Data")
    sns.distplot(scaled_data, ax=ax[1])
    ax[1].set_title("Scaled data")
    

    Z-score:

    s_scaler = preprocessing.StandardScaler(with_mean=True, with_std=True)
    df_s = s_scaler.fit_transform(original_data.reshape(-1,1))
    
    # plot both together to compare
    fig, ax = plt.subplots(1,2)
    sns.distplot(original_data, ax=ax[0])
    ax[0].set_title("Original Data")
    sns.distplot(df_s, ax=ax[1])
    ax[1].set_title("Scaled data")
    
    2. Normalization

    Normalization则会改变数据的分布。比如Box-Cox转换,可以将数据转为正态分布。

    # normalize the exponential data with boxcox
    normalized_data = stats.boxcox(original_data)
    
    # plot both together to compare
    fig, ax=plt.subplots(1,2)
    sns.distplot(original_data, ax=ax[0])
    ax[0].set_title("Original Data")
    sns.distplot(normalized_data[0], ax=ax[1])
    ax[1].set_title("Normalized data")
    

    换一个分布看一下:

    original_data = np.random.exponential(size=1000)
    # normalize the exponential data with boxcox
    normalized_data = stats.boxcox(original_data)
    
    # plot both together to compare
    fig, ax=plt.subplots(1,2)
    sns.distplot(original_data, ax=ax[0])
    ax[0].set_title("Original Data")
    sns.distplot(normalized_data[0], ax=ax[1])
    ax[1].set_title("Normalized data")
    

    参考:

    1. https://www.kaggle.com/alexisbcook/scaling-and-normalization
    2. https://link.zhihu.com/?target=https%3A//en.wikipedia.org/wiki/Feature_scaling

    相关文章

      网友评论

        本文标题:Scaling与Normalization的区别

        本文链接:https://www.haomeiwen.com/subject/cownpktx.html