30 Pandas的get_dummies用于机器学习的特征处理

作者: Viterbi | 来源:发表于2022-11-16 14:26 被阅读0次

30 Pandas的get_dummies用于机器学习的特征处理
27 Pandas怎样找出最影响结果的那些特征
如何使用Pandas的get_dummies在Python中创建
pandas读取hdfs数据
One Hot Encoder 常用方法
Pytorch_数据基础
特征缩放-MinMaxScaler
特征选择与特征学习算法研究--笔记1
pandas的get_dummies
pandas高阶使用技巧

30 Pandas的get_dummies用于机器学习的特征处理

分类特征有两种：

普通分类：性别、颜色
顺序分类：评分、级别

对于评分，可以把这个分类直接转换成1、2、3、4、5表示，因为它们之间有顺序、大小关系

但是对于颜色这种分类，直接用1/2/3/4/5/6/7表达，是不合适的，因为机器学习会误以为这些数字之间有大小关系

get_dummies就是用于颜色、性别这种特征的处理，也叫作one-hot-encoding处理

比如：

男性：1 0
女性：0 1

这就叫做one-hot-encoding，是机器学习对类别的特征处理

1、读取泰坦尼克数据集

import pandas as pd

df_train = pd.read_csv("./datas/titanic/titanic_train.csv")
df_train.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

df_train.drop(columns=["Name", "Ticket", "Cabin"], inplace=True)
df_train.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	male	22.0	1	7.2500	S
1	2	1	1	female	38.0	1	71.2833	C
2	3	1	3	female	26.0	0	7.9250	S
3	4	1	1	female	35.0	1	53.1000	S
4	5	0	3	male	35.0	0	8.0500	S

df_train.info()


    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 9 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Sex            891 non-null object
    Age            714 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Fare           891 non-null float64
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(2)
    memory usage: 62.8+ KB

特征说明：

数值特征：Fare
分类-有序特征：Age
分类-普通特征：PassengerId、Pclass、Sex、SibSp、Parch、Embarked

Survived为要预测的Label

2、分类有序特征可以用数字的方法处理

# 使用年龄的平均值，填充空值
df_train["Age"] = df_train["Age"].fillna(df_train["Age"].mean())

df_train.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 9 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Sex            891 non-null object
    Age            891 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Fare           891 non-null float64
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(2)
    memory usage: 62.8+ KB

3、普通无序分类特征可以用get_dummies编码

其实就是one-hot编码

# series
pd.get_dummies(df_train["Sex"]).head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	female	male
0	0	1
1	1	0
2	1	0
3	1	0
4	0	1

注意，One-hot-Encoding一般要去掉一列，不然会出现dummy variable trap，因为一个人不是male就是femal，它俩有推导关系 https://www.geeksforgeeks.org/ml-dummy-variable-trap-in-regression-models/

# 便捷方法，用df全部替换
needcode_cat_columns = ["Pclass","Sex","SibSp","Parch","Embarked"]
df_coded = pd.get_dummies(
    df_train,
    # 要转码的列
    columns=needcode_cat_columns,
    # 生成的列名的前缀
    prefix=needcode_cat_columns,
    # 把空值也做编码
    dummy_na=True,
    # 把1 of k移除（dummy variable trap）
    drop_first=True
)

df_coded.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	PassengerId	Survived	Age	Fare	Pclass_3.0	Sex_male	SibSp_1.0	...	Embarked_S
0	1	0	22.0	7.2500	1	1	1	...	1
1	2	1	38.0	71.2833	0	0	1	...	0
2	3	1	26.0	7.9250	1	0	0	...	1
3	4	1	35.0	53.1000	0	0	1	...	1
4	5	0	35.0	8.0500	1	1	0	...	1

5 rows × 26 columns

4、机器学习模型训练

y = df_coded.pop("Survived")
y.head()



    0    0
    1    1
    2    1
    3    1
    4    0
    Name: Survived, dtype: int64


X = df_coded
X.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	PassengerId	Age	Fare	Pclass_3.0	Sex_male	SibSp_1.0	...	Embarked_S
0	1	22.0	7.2500	1	1	1	...	1
1	2	38.0	71.2833	0	0	1	...	0
2	3	26.0	7.9250	1	0	0	...	1
3	4	35.0	53.1000	0	0	1	...	1
4	5	35.0	8.0500	1	1	0	...	1

5 rows × 25 columns

from sklearn.linear_model import LogisticRegression
# 创建模型对象
logreg = LogisticRegression(solver='liblinear')

# 实现模型训练
logreg.fit(X, y)



    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                       intercept_scaling=1, l1_ratio=None, max_iter=100,
                       multi_class='warn', n_jobs=None, penalty='l2',
                       random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                       warm_start=False)


logreg.score(X, y)


    0.8148148148148148

本文使用文章同步助手同步

网友评论

本文标题：30 Pandas的get_dummies用于机器学习的特征处理

本文链接：https://www.haomeiwen.com/subject/txxjtdtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

30 Pandas的get_dummies用于机器学习的特征处理