美文网首页
第三章:清洗数据

第三章:清洗数据

作者: 了不起的一一 | 来源:发表于2019-09-30 21:33 被阅读0次

    针对定性数据(数值型数据)

    1.识别数据中缺失值

    查看数据具体信息

    pima.info()
    

    查看数据的大小

    pima.shape
    

    统计缺失值

    pima.isnull().sum()
    

    对定量数据进行基本统计性描述(如:均值、标准差、一些百分位数、最小值、最大值)

    pima.describe()
    

    tips: 注意观察统计量是否合理,比如:BMI指数最小值为0,这有悖于医学常识,因此BMI变量存在问题。
    替换处理

    cols = ['plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness', 'serum_insulin', 'bmi']
    for col in cols:
        pima[col].replace([0], [None], inplace=True)
    

    2.处理缺失值

    2.1删除缺失行

    pima_dropped = pima.dropna()
    

    2.2填充缺失行(均值)

    pima_no = pima.copy()
    for col_no in cols:
        pima_no[col_no].fillna(pima_no[col_no].mean(), inplace=True)
    

    具有泛化能力

    pima_no = pima.copy()
    x_no = pima_no.drop('onset_diabetes', axis=1)
    y_no = pima_no['onset_diabetes']
    x_train, x_test, y_train, y_test = train_test_split( x_no, y_no, random_state=99)
    for col_no in cols:
        x_train[col_no].fillna(x_train[col_no].mean(), inplace=True)
        x_test[col_no].fillna(x_train[col_no].mean(), inplace=True)
    

    2.3机器学习流水线中填充

    构造流水线

    mean_imputer = Pipeline([ ('imputer', Imputer(strategy='mean') ) ])
    

    2.3各种方法比较

    常见机器学习流程

    x = pima.drop('onset_diabetes', axis=1)
    y = pima['onset_diabetes']
    knn_par = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8]}
    knn = KNeighborsClassifier()
    grid = GridSearchCV(knn, knn_par)
    grid.fit(x_dropped, y_dropped)
    print(grid.best_score_, grid.best_params_)
    

    删除none值

    # 1. 删除None值
    pima_dropped = pima.dropna()
    # 删除None后机器学习
    x_dropped = pima_dropped.drop('onset_diabetes', axis=1)
    y_dropped = pima_dropped['onset_diabetes']
    knn_par = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8]}
    knn = KNeighborsClassifier()
    grid = GridSearchCV(knn, knn_par)
    grid.fit(x_dropped, y_dropped)
    print(grid.best_score_, grid.best_params_)
    

    填充值

    # 使用mean(),在划分后填充,具有泛化能力
    pima_no = pima.copy()
    x_no = pima_no.drop('onset_diabetes', axis=1)
    y_no = pima_no['onset_diabetes']
    x_train, x_test, y_train, y_test = train_test_split(
        x_no, y_no, random_state=99)
    for col_no in cols:
        x_train[col_no].fillna(x_train[col_no].mean(), inplace=True)
        x_test[col_no].fillna(x_train[col_no].mean(), inplace=True)
    knn_no = KNeighborsClassifier()
    knn_no.fit(x_train, y_train)
    knn_no.score(x_test, y_test)
    

    流水线填充

    # 流水线作业:Imputer
    knn_params = {'classify__n_neighbors': [1, 2, 3, 4, 5, 6, 7]}
    knn_imuter = KNeighborsClassifier()
    # mean_imputer = Pipeline([ ( 'imputer',Imputer(strategy = 'median' )) ,( 'classify', knn_imuter) ])
    mean_imputer = Pipeline(
        [('imputer', Imputer(strategy='mean')), ('classify', knn_imuter)])
    x_iputer_mean = pima.drop('onset_diabetes', axis=1)
    y_iputer_mean = pima['onset_diabetes']
    grid_iputer_mean = GridSearchCV(mean_imputer, knn_params)
    grid_iputer_mean.fit(x_iputer_mean, y_iputer_mean)
    print(grid_iputer_mean.best_score_, grid_iputer_mean.best_params_)
    

    3.标准化与归一化

    # 标准化与归一化
    knn_params_z = {'classify__n_neighbors': [1, 2, 3, 4, 5, 6, 7]}
    knn_imuter_z = KNeighborsClassifier()
    # min-max标准化
    mean_imputer_minmax = Pipeline([('imputer', Imputer(
        strategy='median')), ('standardize', MinMaxScaler()), ('classify', knn_imuter_z)])
    # 行标准化
    mean_imputer_n = Pipeline([('imputer', Imputer(
        strategy='median')), ('standardize', Normalizer()), ('classify', knn_imuter_z)])
    # z标准化
    mean_imputer_z = Pipeline([('imputer', Imputer(
        strategy='median')), ('standardize', StandardScaler()), ('classify', knn_imuter_z)])
    x_iputer_mean_z = pima.drop('onset_diabetes', axis=1)
    y_iputer_mean_z = pima['onset_diabetes']
    grid_iputer_mean_z = GridSearchCV(mean_imputer_z, knn_params_z)
    grid_iputer_mean_z.fit(x_iputer_mean_z, y_iputer_mean_z)
    print(grid_iputer_mean_z.best_score_, grid_iputer_mean_z.best_params_)
    
    

    相关文章

      网友评论

          本文标题:第三章:清洗数据

          本文链接:https://www.haomeiwen.com/subject/iaevyctx.html