美文网首页
30 Pandas的get_dummies用于机器学习的特征处理

30 Pandas的get_dummies用于机器学习的特征处理

作者: Viterbi | 来源:发表于2022-11-16 14:26 被阅读0次

    30 Pandas的get_dummies用于机器学习的特征处理

    分类特征有两种:

    • 普通分类:性别、颜色
    • 顺序分类:评分、级别

    对于评分,可以把这个分类直接转换成1、2、3、4、5表示,因为它们之间有顺序、大小关系

    但是对于颜色这种分类,直接用1/2/3/4/5/6/7表达,是不合适的,因为机器学习会误以为这些数字之间有大小关系

    get_dummies就是用于颜色、性别这种特征的处理,也叫作one-hot-encoding处理

    比如:

    • 男性:1 0
    • 女性:0 1

    这就叫做one-hot-encoding,是机器学习对类别的特征处理

    1、读取泰坦尼克数据集

    import pandas as pd
    
    df_train = pd.read_csv("./datas/titanic/titanic_train.csv")
    df_train.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
    0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
    1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
    2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
    3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
    4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
    df_train.drop(columns=["Name", "Ticket", "Cabin"], inplace=True)
    df_train.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
    0 1 0 3 male 22.0 1 0 7.2500 S
    1 2 1 1 female 38.0 1 0 71.2833 C
    2 3 1 3 female 26.0 0 0 7.9250 S
    3 4 1 1 female 35.0 1 0 53.1000 S
    4 5 0 3 male 35.0 0 0 8.0500 S
    df_train.info()
    
    
        <class 'pandas.core.frame.DataFrame'>
        RangeIndex: 891 entries, 0 to 890
        Data columns (total 9 columns):
        PassengerId    891 non-null int64
        Survived       891 non-null int64
        Pclass         891 non-null int64
        Sex            891 non-null object
        Age            714 non-null float64
        SibSp          891 non-null int64
        Parch          891 non-null int64
        Fare           891 non-null float64
        Embarked       889 non-null object
        dtypes: float64(2), int64(5), object(2)
        memory usage: 62.8+ KB
    

    特征说明:

    • 数值特征:Fare
    • 分类-有序特征:Age
    • 分类-普通特征:PassengerId、Pclass、Sex、SibSp、Parch、Embarked

    Survived为要预测的Label

    2、分类有序特征可以用数字的方法处理

    # 使用年龄的平均值,填充空值
    df_train["Age"] = df_train["Age"].fillna(df_train["Age"].mean())
    
    df_train.info()
    
        <class 'pandas.core.frame.DataFrame'>
        RangeIndex: 891 entries, 0 to 890
        Data columns (total 9 columns):
        PassengerId    891 non-null int64
        Survived       891 non-null int64
        Pclass         891 non-null int64
        Sex            891 non-null object
        Age            891 non-null float64
        SibSp          891 non-null int64
        Parch          891 non-null int64
        Fare           891 non-null float64
        Embarked       889 non-null object
        dtypes: float64(2), int64(5), object(2)
        memory usage: 62.8+ KB
        
    

    3、普通无序分类特征可以用get_dummies编码

    其实就是one-hot编码

    # series
    pd.get_dummies(df_train["Sex"]).head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    female male
    0 0 1
    1 1 0
    2 1 0
    3 1 0
    4 0 1

    注意,One-hot-Encoding一般要去掉一列,不然会出现dummy variable trap,因为一个人不是male就是femal,它俩有推导关系 https://www.geeksforgeeks.org/ml-dummy-variable-trap-in-regression-models/

    # 便捷方法,用df全部替换
    needcode_cat_columns = ["Pclass","Sex","SibSp","Parch","Embarked"]
    df_coded = pd.get_dummies(
        df_train,
        # 要转码的列
        columns=needcode_cat_columns,
        # 生成的列名的前缀
        prefix=needcode_cat_columns,
        # 把空值也做编码
        dummy_na=True,
        # 把1 of k移除(dummy variable trap)
        drop_first=True
    )
    
    df_coded.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    PassengerId Survived Age Fare Pclass_2.0 Pclass_3.0 Pclass_nan Sex_male Sex_nan SibSp_1.0 ... Parch_1.0 Parch_2.0 Parch_3.0 Parch_4.0 Parch_5.0 Parch_6.0 Parch_nan Embarked_Q Embarked_S Embarked_nan
    0 1 0 22.0 7.2500 0 1 0 1 0 1 ... 0 0 0 0 0 0 0 0 1 0
    1 2 1 38.0 71.2833 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
    2 3 1 26.0 7.9250 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
    3 4 1 35.0 53.1000 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
    4 5 0 35.0 8.0500 0 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 1 0

    5 rows × 26 columns

    4、机器学习模型训练

    y = df_coded.pop("Survived")
    y.head()
    
    
    
        0    0
        1    1
        2    1
        3    1
        4    0
        Name: Survived, dtype: int64
    
    
    X = df_coded
    X.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    PassengerId Age Fare Pclass_2.0 Pclass_3.0 Pclass_nan Sex_male Sex_nan SibSp_1.0 SibSp_2.0 ... Parch_1.0 Parch_2.0 Parch_3.0 Parch_4.0 Parch_5.0 Parch_6.0 Parch_nan Embarked_Q Embarked_S Embarked_nan
    0 1 22.0 7.2500 0 1 0 1 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
    1 2 38.0 71.2833 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
    2 3 26.0 7.9250 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
    3 4 35.0 53.1000 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
    4 5 35.0 8.0500 0 1 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 1 0

    5 rows × 25 columns

    from sklearn.linear_model import LogisticRegression
    # 创建模型对象
    logreg = LogisticRegression(solver='liblinear')
    
    # 实现模型训练
    logreg.fit(X, y)
    
    
    
        LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                           intercept_scaling=1, l1_ratio=None, max_iter=100,
                           multi_class='warn', n_jobs=None, penalty='l2',
                           random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                           warm_start=False)
    
    
    logreg.score(X, y)
    
    
        0.8148148148148148
    
    
    
    

    本文使用 文章同步助手 同步

    相关文章

      网友评论

          本文标题:30 Pandas的get_dummies用于机器学习的特征处理

          本文链接:https://www.haomeiwen.com/subject/txxjtdtx.html