美文网首页
30 Pandas的get_dummies用于机器学习的特征处理

30 Pandas的get_dummies用于机器学习的特征处理

作者: Viterbi | 来源:发表于2022-11-16 14:26 被阅读0次

30 Pandas的get_dummies用于机器学习的特征处理

分类特征有两种:

  • 普通分类:性别、颜色
  • 顺序分类:评分、级别

对于评分,可以把这个分类直接转换成1、2、3、4、5表示,因为它们之间有顺序、大小关系

但是对于颜色这种分类,直接用1/2/3/4/5/6/7表达,是不合适的,因为机器学习会误以为这些数字之间有大小关系

get_dummies就是用于颜色、性别这种特征的处理,也叫作one-hot-encoding处理

比如:

  • 男性:1 0
  • 女性:0 1

这就叫做one-hot-encoding,是机器学习对类别的特征处理

1、读取泰坦尼克数据集

import pandas as pd

df_train = pd.read_csv("./datas/titanic/titanic_train.csv")
df_train.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df_train.drop(columns=["Name", "Ticket", "Cabin"], inplace=True)
df_train.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.0 1 0 7.2500 S
1 2 1 1 female 38.0 1 0 71.2833 C
2 3 1 3 female 26.0 0 0 7.9250 S
3 4 1 1 female 35.0 1 0 53.1000 S
4 5 0 3 male 35.0 0 0 8.0500 S
df_train.info()


    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 9 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Sex            891 non-null object
    Age            714 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Fare           891 non-null float64
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(2)
    memory usage: 62.8+ KB

特征说明:

  • 数值特征:Fare
  • 分类-有序特征:Age
  • 分类-普通特征:PassengerId、Pclass、Sex、SibSp、Parch、Embarked

Survived为要预测的Label

2、分类有序特征可以用数字的方法处理

# 使用年龄的平均值,填充空值
df_train["Age"] = df_train["Age"].fillna(df_train["Age"].mean())

df_train.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 9 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Sex            891 non-null object
    Age            891 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Fare           891 non-null float64
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(2)
    memory usage: 62.8+ KB
    

3、普通无序分类特征可以用get_dummies编码

其实就是one-hot编码

# series
pd.get_dummies(df_train["Sex"]).head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
female male
0 0 1
1 1 0
2 1 0
3 1 0
4 0 1

注意,One-hot-Encoding一般要去掉一列,不然会出现dummy variable trap,因为一个人不是male就是femal,它俩有推导关系 https://www.geeksforgeeks.org/ml-dummy-variable-trap-in-regression-models/

# 便捷方法,用df全部替换
needcode_cat_columns = ["Pclass","Sex","SibSp","Parch","Embarked"]
df_coded = pd.get_dummies(
    df_train,
    # 要转码的列
    columns=needcode_cat_columns,
    # 生成的列名的前缀
    prefix=needcode_cat_columns,
    # 把空值也做编码
    dummy_na=True,
    # 把1 of k移除(dummy variable trap)
    drop_first=True
)

df_coded.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Age Fare Pclass_2.0 Pclass_3.0 Pclass_nan Sex_male Sex_nan SibSp_1.0 ... Parch_1.0 Parch_2.0 Parch_3.0 Parch_4.0 Parch_5.0 Parch_6.0 Parch_nan Embarked_Q Embarked_S Embarked_nan
0 1 0 22.0 7.2500 0 1 0 1 0 1 ... 0 0 0 0 0 0 0 0 1 0
1 2 1 38.0 71.2833 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
2 3 1 26.0 7.9250 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
3 4 1 35.0 53.1000 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 1 0
4 5 0 35.0 8.0500 0 1 0 1 0 0 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 26 columns

4、机器学习模型训练

y = df_coded.pop("Survived")
y.head()



    0    0
    1    1
    2    1
    3    1
    4    0
    Name: Survived, dtype: int64


X = df_coded
X.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Age Fare Pclass_2.0 Pclass_3.0 Pclass_nan Sex_male Sex_nan SibSp_1.0 SibSp_2.0 ... Parch_1.0 Parch_2.0 Parch_3.0 Parch_4.0 Parch_5.0 Parch_6.0 Parch_nan Embarked_Q Embarked_S Embarked_nan
0 1 22.0 7.2500 0 1 0 1 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
1 2 38.0 71.2833 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
2 3 26.0 7.9250 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
3 4 35.0 53.1000 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
4 5 35.0 8.0500 0 1 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 25 columns

from sklearn.linear_model import LogisticRegression
# 创建模型对象
logreg = LogisticRegression(solver='liblinear')

# 实现模型训练
logreg.fit(X, y)



    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                       intercept_scaling=1, l1_ratio=None, max_iter=100,
                       multi_class='warn', n_jobs=None, penalty='l2',
                       random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                       warm_start=False)


logreg.score(X, y)


    0.8148148148148148



本文使用 文章同步助手 同步

相关文章

网友评论

      本文标题:30 Pandas的get_dummies用于机器学习的特征处理

      本文链接:https://www.haomeiwen.com/subject/txxjtdtx.html