美文网首页
分类变量

分类变量

作者: 1nvad3r | 来源:发表于2020-09-24 22:07 被阅读0次

    分类变量(categorical variable)是说明事物类别的一个名称,其取值是分类数据。如“性别”就是一个分类变量,其变量值为“男”或“女”;“行业”也是一个分类变量,其变量值可以为“零售业”、“旅游业”、“汽车制造业”等。

    方法一:
    直接丢弃分类变量。

    drop_X_train = X_train.select_dtypes(exclude=['object'])
    drop_X_valid = X_valid.select_dtypes(exclude=['object'])
    
    print("MAE from Approach 1 (Drop categorical variables):")
    print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
    

    方法二:
    给分类变量进行标签编码。

    from sklearn.preprocessing import LabelEncoder
    
    # All categorical columns
    object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
    
    # Columns that can be safely label encoded
    good_label_cols = [col for col in object_cols if set(X_train[col]) == set(X_valid[col])]
            
    # Problematic columns that will be dropped from the dataset
    bad_label_cols = list(set(object_cols)-set(good_label_cols))
    
    # Drop categorical columns that will not be encoded
    label_X_train = X_train.drop(bad_label_cols, axis=1)
    label_X_valid = X_valid.drop(bad_label_cols, axis=1)
    
    # Apply label encoder to each column with categorical data
    label_encoder = LabelEncoder()
    
    for col in set(good_label_cols):
        label_X_train[col] = label_encoder.fit_transform(X_train[col])
        label_X_valid[col] = label_encoder.transform(X_valid[col])
    
    print("MAE from Approach 2 (Label Encoding):") 
    print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
    

    方法三:
    独热编码。

    # Get number of unique entries in each column with categorical data
    object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
    d = dict(zip(object_cols, object_nunique))
    
    # Print number of unique entries by column, in ascending order
    sorted(d.items(), key=lambda x: x[1])
    
    # Columns that will be one-hot encoded
    low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]
    
    # Columns that will be dropped from the dataset
    high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
    
    print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
    print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
    
    from sklearn.preprocessing import OneHotEncoder
    
    # Apply one-hot encoder to each column with categorical data
    OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
    OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
    OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
    
    
    # One-hot encoding removed index; put it back
    OH_cols_train.index = X_train.index
    OH_cols_valid.index = X_valid.index
    
    # Remove categorical columns (will replace with one-hot encoding)
    num_X_train = X_train.drop(object_cols, axis=1)
    num_X_valid = X_valid.drop(object_cols, axis=1)
    
    # Add one-hot encoded columns to numerical features
    OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
    OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
    
    print("MAE from Approach 3 (One-Hot Encoding):") 
    print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
    

    相关文章

      网友评论

          本文标题:分类变量

          本文链接:https://www.haomeiwen.com/subject/knxayktx.html