美文网首页
5.data_preprocessing_and_feature

5.data_preprocessing_and_feature

作者: 许志辉Albert | 来源:发表于2021-01-28 09:55 被阅读0次

    1.数据预处理与特征工程

    1.1处理缺失值

    import pandas as pd
    from io import StringIO
    
    csv_data = '''A,B,C,D
    1.0,2.0,3.0,4.0
    5.0,6.0,,8.0
    10.0,11.0,12.0,'''
    
    df = pd.read_csv(StringIO(csv_data))
    df
    
    1
    #判断缺失值的数量
    df.isnull().sum()
    
    2

    1.1.1 直接删除缺失值多的样本和特征

    df.dropna()
    
    3
    df.dropna(axis = 1)
    
    4
    # only drop rows where all columns are NaN
    df.dropna(how = 'all')
    
    5
    # drop rows that have not at least 4 non-NaN values
    df.dropna(thresh = 4)
    
    6
    # only drop rows where NaN appear in specific columns (here: 'C')
    df.dropna(subset=['C'])
    
    7

    1.1.2 计算缺失值与填充

    from sklearn.preprocessing import Imputer
    
    imr = Imputer(missing_values = 'NaN' , strategy = 'mean' , axis = 0)
    imr = imr.fit(df)
    imputed_data = imr.transform(df.values)
    imputer_data
    

    array([[ 1. , 2. , 3. , 4. ],
    [ 5. , 6. , 7.5, 8. ],
    [10. , 11. , 12. , 6. ]])

    df.values
    

    array([[ 1., 2., 3., 4.],
    [ 5., 6., nan, 8.],
    [10., 11., 12., nan]])

    1.1.3 关于scikit - learn核心预测器API

    8 9

    1.2 处理类别型数据

    import pandas as pd
    
    df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                       ['red', 'L', 13.5, 'class2'],
                       ['blue', 'XL', 15.3, 'class1']])
    
    df.columns = ['color', 'size', 'price', 'classlabel']
    df
    
    
    10

    1.3序列特征映射

    size_mapping = {'XL': 3,
                    'L': 2,
                    'M': 1}
    
    df.loc[:,'size'] = df['size'].map(size_mapping)
    df
    
    11
    inv_size_mapping = {v: k for k, v in size_mapping.items()}
    df['size'].map(inv_size_mapping)
    
    12

    1.3.1 LableEncoder变换

    • 可能适用的模型:树模型
    • 可能不适用的模型:LR、NN
    import numpy as np
    
    class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
    class_mapping
    
    13
    df['classlabel'] = df['classlabel'].map(class_mapping)
    df
    
    14
    inv_class_mapping = {v: k for k, v in class_mapping.items()}
    df['classlabel'] = df['classlabel'].map(inv_class_mapping)
    df
    
    15
    from sklearn.preprocessing import LabelEncoder
    
    class_le = LabelEncoder()
    y = class_le.fit_transform(df['classlabel'].values)
    y
    
    16
    class_le.inverse_transform(y)
    
    17

    1.3.2 one-hot encoding /读热向量编码

    X = df[['color', 'size', 'price']].values
    
    color_le = LabelEncoder()
    X[:, 0] = color_le.fit_transform(X[:, 0])
    X
    
    18
    from sklearn.preprocessing import OneHotEncoder
    
    ohe = OneHotEncoder(categorical_features=[0])
    ohe.fit_transform(X).toarray()
    
    19
    pd.get_dummies(df[['price', 'color', 'size']])
    
    20

    1.4 切分数据集(训练集与测试集)

    df_wine = pd.read_csv('./data/wine.data',
                          header=None)
    
    df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                       'Alcalinity of ash', 'Magnesium', 'Total phenols',
                       'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                       'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                       'Proline']
    
    print('Class labels', np.unique(df_wine['Class label']))
    df_wine.head()
    
    21
    df_wine = pd.read_csv('./data/wine.data', header=None)
    
    df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 
    'Alcalinity of ash', 'Magnesium', 'Total phenols', 
    'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 
    'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
    df_wine.head()
    
    21
    if Version(sklearn_version) < '0.18':
        from sklearn.cross_validation import train_test_split
    else:
        from sklearn.model_selection import train_test_split
    
    X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
    
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.3, random_state=0)
    

    1.5 连续值特征幅度缩放(scaling)

    数值型特征:

    • 幅度缩放
      • 通常在计算型模型中会需要使用,加快收敛速度和精度
    • 离散化
      • 最重要的作用是带来非线性的表达能力
    from sklearn.preprocessing import MinMaxScaler
    mms = MinMaxScaler()
    X_train_norm = mms.fit_transform(X_train)
    X_test_norm = mms.transform(X_test)
    
    from sklearn.preprocessing import StandardScaler
    
    stdsc = StandardScaler()
    X_train_std = stdsc.fit_transform(X_train)
    X_test_std = stdsc.transform(X_test)
    
    ex = pd.DataFrame([0, 1, 2, 3, 4, 5])
    
    # standardize
    ex[1] = (ex[0] - ex[0].mean()) / ex[0].std(ddof=0)
    
    # Please note that pandas uses ddof=1 (sample standard deviation) 
    # by default, whereas NumPy's std method and the StandardScaler
    # uses ddof=0 (population standard deviation)
    
    # normalize
    ex[2] = (ex[0] - ex[0].min()) / (ex[0].max() - ex[0].min())
    ex.columns = ['input', 'standardized', 'normalized']
    ex
    
    22

    1.6特征选择

    1.6.1 通过L1正则化的截断效应选择

    23 24
    from sklearn.linear_model import LogisticRegression
    
    lr = LogisticRegression(penalty='l1', C=0.1)
    lr.fit(X_train_std, y_train)
    print('Training accuracy:', lr.score(X_train_std, y_train))
    print('Test accuracy:', lr.score(X_test_std, y_test))
    

    Training accuracy: 0.9838709677419355
    Test accuracy: 0.9814814814814815

    lr.intercept_
    

    array([-0.38379501, -0.15808103, -0.70040229])

    lr.coef_
    

    array([[ 0.2800559 , 0. , 0. , -0.02782253, 0. ,
    0. , 0.70995316, 0. , 0. , 0. ,
    0. , 0. , 1.23687975],
    [-0.64398547, -0.06878116, -0.05721049, 0. , 0. ,
    0. , 0. , 0. , 0. , -0.9267665 ,
    0.06014006, 0. , -0.37103485],
    [ 0. , 0.06128029, 0. , 0. , 0. ,
    0. , -0.63658702, 0. , 0. , 0.49830327,
    -0.35836566, -0.57070807, 0. ]])

    import warnings
    warnings.filterwarnings("ignore")
    import matplotlib.pyplot as plt
    
    fig = plt.figure()
    ax = plt.subplot(111)
        
    colors = ['blue', 'green', 'red', 'cyan', 
              'magenta', 'yellow', 'black', 
              'pink', 'lightgreen', 'lightblue', 
              'gray', 'indigo', 'orange']
    
    weights, params = [], []
    for c in np.arange(-4, 6, dtype=float):
        lr = LogisticRegression(penalty='l1', C=10**c, random_state=0)
        lr.fit(X_train_std, y_train)
        weights.append(lr.coef_[1])
        params.append(10**c)
    
    weights = np.array(weights)
    
    for column, color in zip(range(weights.shape[1]), colors):
        plt.plot(params, weights[:, column],
                 label=df_wine.columns[column + 1],
                 color=color)
    plt.axhline(0, color='black', linestyle='--', linewidth=3)
    plt.xlim([10**(-5), 10**5])
    plt.ylabel('weight coefficient')
    plt.xlabel('C')
    plt.xscale('log')
    plt.legend(loc='upper left')
    ax.legend(loc='upper center', 
              bbox_to_anchor=(1.38, 1.03),
              ncol=1, fancybox=True)
    # plt.savefig('./figures/l1_path.png', dpi=300)
    plt.show()
    
    25

    1.6.2 子特征集选择

    from sklearn.base import clone
    from itertllos import combinations
    import numpy as np
    from sklearn.metrics import accuracy_score
    if Version(sklearn_version) < '0.18':
        from sklearn.cross_validation import train_test_split
    else:
        from sklearn.model_selection import train_test_split
    
    
    class SBS():
        def __init__(self, estimator, k_features, scoring=accuracy_score,
                     test_size=0.25, random_state=1):
            self.scoring = scoring
            self.estimator = clone(estimator)
            self.k_features = k_features
            self.test_size = test_size
            self.random_state = random_state
    
        def fit(self, X, y):
            
            X_train, X_test, y_train, y_test = \
                train_test_split(X, y, test_size=self.test_size,
                                 random_state=self.random_state)
    
            dim = X_train.shape[1]
            self.indices_ = tuple(range(dim))
            self.subsets_ = [self.indices_]
            score = self._calc_score(X_train, y_train, 
                                     X_test, y_test, self.indices_)
            self.scores_ = [score]
    
            while dim > self.k_features:
                scores = []
                subsets = []
    
                for p in combinations(self.indices_, r=dim - 1):
                    score = self._calc_score(X_train, y_train, 
                                             X_test, y_test, p)
                    scores.append(score)
                    subsets.append(p)
    
                best = np.argmax(scores)
                self.indices_ = subsets[best]
                self.subsets_.append(self.indices_)
                dim -= 1
    
                self.scores_.append(scores[best])
            self.k_score_ = self.scores_[-1]
    
            return self
    
        def transform(self, X):
            return X[:, self.indices_]
    
        def _calc_score(self, X_train, y_train, X_test, y_test, indices):
            self.estimator.fit(X_train[:, indices], y_train)
            y_pred = self.estimator.predict(X_test[:, indices])
            score = self.scoring(y_test, y_pred)
            return score
    
    import matplotlib.pyplot as plt
    from sklearn.neighbors import KNeighborsClassifier
    
    knn = KNeighborsClassifier(n_neighbors=2)
    
    # selecting features
    sbs = SBS(knn, k_features=1)
    sbs.fit(X_train_std, y_train)
    
    # plotting performance of feature subsets
    k_feat = [len(k) for k in sbs.subsets_]
    
    plt.plot(k_feat, sbs.scores_, marker='o')
    plt.ylim([0.7, 1.1])
    plt.ylabel('Accuracy')
    plt.xlabel('Number of features')
    plt.grid()
    plt.tight_layout()
    # plt.savefig('./sbs.png', dpi=300)
    plt.show()
    
    26
    k5 = list(sbs.subsets_[8])
    print(df_wine.columns[1:][k5])
    

    Index(['Alcohol', 'Malic acid', 'Alcalinity of ash', 'Hue', 'Proline'], dtype='object')

    knn.fit(X_train_std, y_train)
    print('Training accuracy:', knn.score(X_train_std, y_train))
    print('Test accuracy:', knn.score(X_test_std, y_test))
    

    Training accuracy: 0.9838709677419355
    Test accuracy: 0.9444444444444444

    knn.fit(X_train_std[:, k5], y_train)
    print('Training accuracy:', knn.score(X_train_std[:, k5], y_train))
    print('Test accuracy:', knn.score(X_test_std[:, k5], y_test))
    

    Training accuracy: 0.9596774193548387
    Test accuracy: 0.9629629629629629

    1.7 通过树模型对特征重要性排序

    from sklearn.ensemble import RandomForestClassifier
    
    feat_labels = df_wine.columns[1:]
    
    forest = RandomForestClassifier(n_estimators=10000,
                                    random_state=0,
                                    n_jobs=-1)
    
    forest.fit(X_train, y_train)
    importances = forest.feature_importances_
    
    indices = np.argsort(importances)[::-1]
    
    for f in range(X_train.shape[1]):
        print("%2d) %-*s %f" % (f + 1, 30, 
                                feat_labels[indices[f]], 
                                importances[indices[f]]))
    
    plt.title('Feature Importances')
    plt.bar(range(X_train.shape[1]), 
            importances[indices],
            color='lightblue', 
            align='center')
    
    plt.xticks(range(X_train.shape[1]), 
               feat_labels[indices], rotation=90)
    plt.xlim([-1, X_train.shape[1]])
    plt.tight_layout()
    #plt.savefig('./random_forest.png', dpi=300)
    plt.show()
    
    27
    if Version(sklearn_version) < '0.18':
        X_selected = forest.transform(X_train, threshold=0.15)
    else:
        from sklearn.feature_selection import SelectFromModel
        sfm = SelectFromModel(forest, threshold=0.15, prefit=True)
        X_selected = sfm.transform(X_train)
    
    X_selected.shape
    

    (124, 3)
    Now, let's print the 3 features that met the threshold criterion for feature selection that we set earlier (note that this code snippet does not appear in the actual book but was added to this notebook later for illustrative purposes):

    for f in range(X_selected.shape[1]):
        print("%2d) %-*s %f" % (f + 1, 30, 
                                feat_labels[indices[f]], 
                                importances[indices[f]]))
    
    1. Color intensity 0.182483
    2. Proline 0.158610
    3. Flavanoids 0.150948

    相关文章

      网友评论

          本文标题:5.data_preprocessing_and_feature

          本文链接:https://www.haomeiwen.com/subject/pkiaaktx.html