30 Pandas的get_dummies用于机器学习的特征处理
分类特征有两种:
- 普通分类:性别、颜色
- 顺序分类:评分、级别
对于评分,可以把这个分类直接转换成1、2、3、4、5表示,因为它们之间有顺序、大小关系
但是对于颜色这种分类,直接用1/2/3/4/5/6/7表达,是不合适的,因为机器学习会误以为这些数字之间有大小关系
get_dummies就是用于颜色、性别这种特征的处理,也叫作one-hot-encoding处理
比如:
- 男性:1 0
- 女性:0 1
这就叫做one-hot-encoding,是机器学习对类别的特征处理
1、读取泰坦尼克数据集
import pandas as pd
df_train = pd.read_csv("./datas/titanic/titanic_train.csv")
df_train.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
df_train.drop(columns=["Name", "Ticket", "Cabin"], inplace=True)
df_train.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S |
1 | 2 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C |
2 | 3 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S |
3 | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S |
4 | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S |
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB
特征说明:
- 数值特征:Fare
- 分类-有序特征:Age
- 分类-普通特征:PassengerId、Pclass、Sex、SibSp、Parch、Embarked
Survived为要预测的Label
2、分类有序特征可以用数字的方法处理
# 使用年龄的平均值,填充空值
df_train["Age"] = df_train["Age"].fillna(df_train["Age"].mean())
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB
3、普通无序分类特征可以用get_dummies编码
其实就是one-hot编码
# series
pd.get_dummies(df_train["Sex"]).head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
female | male | |
---|---|---|
0 | 0 | 1 |
1 | 1 | 0 |
2 | 1 | 0 |
3 | 1 | 0 |
4 | 0 | 1 |
注意,One-hot-Encoding一般要去掉一列,不然会出现dummy variable trap,因为一个人不是male就是femal,它俩有推导关系 https://www.geeksforgeeks.org/ml-dummy-variable-trap-in-regression-models/
# 便捷方法,用df全部替换
needcode_cat_columns = ["Pclass","Sex","SibSp","Parch","Embarked"]
df_coded = pd.get_dummies(
df_train,
# 要转码的列
columns=needcode_cat_columns,
# 生成的列名的前缀
prefix=needcode_cat_columns,
# 把空值也做编码
dummy_na=True,
# 把1 of k移除(dummy variable trap)
drop_first=True
)
df_coded.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
PassengerId | Survived | Age | Fare | Pclass_2.0 | Pclass_3.0 | Pclass_nan | Sex_male | Sex_nan | SibSp_1.0 | ... | Parch_1.0 | Parch_2.0 | Parch_3.0 | Parch_4.0 | Parch_5.0 | Parch_6.0 | Parch_nan | Embarked_Q | Embarked_S | Embarked_nan | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 22.0 | 7.2500 | 0 | 1 | 0 | 1 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 2 | 1 | 38.0 | 71.2833 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 3 | 1 | 26.0 | 7.9250 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 4 | 1 | 35.0 | 53.1000 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 5 | 0 | 35.0 | 8.0500 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 26 columns
4、机器学习模型训练
y = df_coded.pop("Survived")
y.head()
0 0
1 1
2 1
3 1
4 0
Name: Survived, dtype: int64
X = df_coded
X.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
PassengerId | Age | Fare | Pclass_2.0 | Pclass_3.0 | Pclass_nan | Sex_male | Sex_nan | SibSp_1.0 | SibSp_2.0 | ... | Parch_1.0 | Parch_2.0 | Parch_3.0 | Parch_4.0 | Parch_5.0 | Parch_6.0 | Parch_nan | Embarked_Q | Embarked_S | Embarked_nan | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 22.0 | 7.2500 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 2 | 38.0 | 71.2833 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 3 | 26.0 | 7.9250 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 4 | 35.0 | 53.1000 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 5 | 35.0 | 8.0500 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 25 columns
from sklearn.linear_model import LogisticRegression
# 创建模型对象
logreg = LogisticRegression(solver='liblinear')
# 实现模型训练
logreg.fit(X, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)
logreg.score(X, y)
0.8148148148148148
本文使用 文章同步助手 同步
网友评论