美文网首页
关于OrdinalEncoder 、OneHotEncoder

关于OrdinalEncoder 、OneHotEncoder

作者: SeekerLinJunYu | 来源:发表于2019-04-29 11:10 被阅读0次

    OrdinalEncoder / OneHotEncoder /get_dummies 三个方法都能够将离散的类别特征转换成由数字代表的类别特征.但是三者的特征又不尽相同

    不扩展特征个数

    OrdinalEncoder (与LabelEncoder用法 效果都是一致的,这里就不再单独说明LabelEncoder)
    • scikit-learn中提供的方法;可以将每一个类别的特征转换成一个新的整数(0到类别数n-1之间),即并非0或1
      • 传入的对象必须要求是2D的数据结构
      • 并不会增添特征的维度,只是对该特征类别值进行一个映射,这与One_Hot Encoder有明显的转换上的不同
      • 但是这种方法并不是对所有的scikit-learn估计器都适用
    In: MSSubClass_data = train_df.MSSubClass.astype(str)    #在所有操作前,将特征转换成字符串是必须的操作
    In: label_encoder = preprocessing.LabelEncoder()
    In: MSSubClass_data_encoded = label_encoder.fit_transform(MSSubClass_data)
    

    扩展特征个数

    OneHotEncoder
    • Scikit-Learn OneHotEncoder
      • OneHotEncoder是一种能够被scikit-learn的估计器使用的类别特征转换函数
      • 原理是将有n个类别的值转换成n个二分特征属性,属性值取0或者1
      • 因此,One-Hot Encoder是会根据特征取值的类别改变数据特征数目的
      • 因为扩展了特征的个数,并返回二值类别数值,势必会造成稀疏矩阵.参数spare可以用来设置是否返回稀疏矩阵
      • 传入的对象必须要求是2D的数据结构
      • 需要注意的是,如果用来fit的数据并不包含所有的潜在类别,那么在传参时需要传入ignore,即忽略没有拟合的类别,否则会报错enc = preprocessing.OneHotEncoder(handle_unknown='ignore'). 如果要求识别出所有类别必须指明:
    >>> genders = ['female', 'male']
    >>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
    >>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
    >>> enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
    >>> # Note that for there are missing categorical values for the 2nd and 3rd
    >>> # feature
    >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
    >>> enc.fit(X) 
    OneHotEncoder(categorical_features=None,
           categories=[...],
           dtype=<... 'numpy.float64'>, handle_unknown='error',
           n_values=None, sparse=True)
    >>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
    array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
    
    • 对array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])的解读:
      • 前两位1,0是对genders的分类编码.即扩展出的两个特征(因为gender下一共就两种类别)gender_female特征下的值为1, gender_male特征下的值为0
      • 中间四位0,1,0,0是对locations的分类编码.即locations_from_Africa = 0,locations_from_Asia= 1,
        locations_from_Europe= 0, locations_from_US= 0,
      • 最后四位1,0,0,0是对browsers的分类编码.原理与上述一致,不再赘述.
    最基本的用法:
    In: enc = preprocessing.OneHotEncoder()
    In: result = enc.fit_transform(MSSubClass_data.values.reshape(-1,1))
    
    get_dummies
    • get_dummies 是pandas中提供的方法
      • 原理与OneHotEncoder基本一致
    最基本的用法:
    In: all_df.MSSubClass = pd.get_dummies(all_df['MSSubClass'],prefix='MSSubClass')
    
    转换前.png 转换后.png

    关于怎么使用Encoder方法改变原数据集?

    在写这篇文章的时候,其实最困扰我的问题是如何利用Encoder接口实现对原数据集的有针对性的更改.后来在Scikit-Learn官网上找到一段代码,也算是能解答这个问题.下面贴出来:

    from __future__ import print_function
    
    import pandas as pd
    import numpy as np
    
    
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler OneHotEncoder
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split,GridSearchCV
    
    
    np.random.seed(0)
    
    # Read data from Titanic dataset.
    titanic_url = ('https://raw.githubusercontent.com/amueller/'
                   'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
    data = pd.read_csv(titanic_url)
    
    # We will train our classifier with the following features:
    # Numeric Features:
    # - age: float.
    # - fare: float.
    # Categorical Features:
    # - embarked: categories encoded as strings {'C', 'S', 'Q'}.
    # - sex: categories encoded as strings {'female', 'male'}.
    # - pclass: ordinal integers {1, 2, 3}.
    
    # We create the preprocessing pipelines for both numeric and categorical data.
    numeric_features = ['age', 'fare']
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant',fill_value = 'missing')),
        ('onehot',OneHotEncoder(handle_unknown = 'ignore'))]
    
    categorical_features = ['embarked', 'sex', 'pclass']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    
    preprocessor = ColumnTransformer(                  # 这一步实现对数值型数据和类别数据的分别更改
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])
    
    # Append classifier to preprocessing pipeline.
    # Now we have a full prediction pipeline.
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', LogisticRegression(solver='lbfgs'))])
    
    X = data.drop('survived', axis=1)
    y = data['survived']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    clf.fit(X_train, y_train)
    print("model score: %.3f" % clf.score(X_test, y_test))</pre>
    
    

    源代码网址:https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

    相关文章

      网友评论

          本文标题:关于OrdinalEncoder 、OneHotEncoder

          本文链接:https://www.haomeiwen.com/subject/xpuenqtx.html