27 Pandas怎样找出最影响结果的那些特征?
应用场景:
- 机器学习的特征选择,去除无用的特征,可以提升模型效果、降低训练时间等等
- 数据分析领域,找出收入波动的最大因素!!
实例演示:泰坦尼克沉船事件中,最影响生死的因素有哪些?
1、导入相关的包
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
2、导入泰坦尼克号的数据
df = pd.read_csv("./datas/titanic/titanic_train.csv")
df.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
df = df[["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].copy()
df.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Fare |
Embarked |
0 |
1 |
0 |
3 |
male |
22.0 |
1 |
0 |
7.2500 |
S |
1 |
2 |
1 |
1 |
female |
38.0 |
1 |
0 |
71.2833 |
C |
2 |
3 |
1 |
3 |
female |
26.0 |
0 |
0 |
7.9250 |
S |
3 |
4 |
1 |
1 |
female |
35.0 |
1 |
0 |
53.1000 |
S |
4 |
5 |
0 |
3 |
male |
35.0 |
0 |
0 |
8.0500 |
S |
3、数据清理和转换
3.1 查看是否有空值列
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB
3.2 给Age列填充平均值
df["Age"] = df["Age"].fillna(df["Age"].median())
df.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Fare |
Embarked |
0 |
1 |
0 |
3 |
male |
22.0 |
1 |
0 |
7.2500 |
S |
1 |
2 |
1 |
1 |
female |
38.0 |
1 |
0 |
71.2833 |
C |
2 |
3 |
1 |
3 |
female |
26.0 |
0 |
0 |
7.9250 |
S |
3 |
4 |
1 |
1 |
female |
35.0 |
1 |
0 |
53.1000 |
S |
4 |
5 |
0 |
3 |
male |
35.0 |
0 |
0 |
8.0500 |
S |
3.2 将性别列变成数字
df.Sex.unique()
array(['male', 'female'], dtype=object)
df.loc[df["Sex"] == "male", "Sex"] = 0
df.loc[df["Sex"] == "female", "Sex"] = 1
df.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Fare |
Embarked |
0 |
1 |
0 |
3 |
0 |
22.0 |
1 |
0 |
7.2500 |
S |
1 |
2 |
1 |
1 |
1 |
38.0 |
1 |
0 |
71.2833 |
C |
2 |
3 |
1 |
3 |
1 |
26.0 |
0 |
0 |
7.9250 |
S |
3 |
4 |
1 |
1 |
1 |
35.0 |
1 |
0 |
53.1000 |
S |
4 |
5 |
0 |
3 |
0 |
35.0 |
0 |
0 |
8.0500 |
S |
3.3 给Embarked列填充空值,字符串转换成数字
df.Embarked.unique()
array(['S', 'C', 'Q', nan], dtype=object)
df["Embarked"] = df["Embarked"].fillna(0)
df.loc[df["Embarked"] == "S", "Embarked"] = 1
df.loc[df["Embarked"] == "C", "Embarked"] = 2
df.loc[df["Embarked"] == "Q", "Embarked"] = 3
df.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Fare |
Embarked |
0 |
1 |
0 |
3 |
0 |
22.0 |
1 |
0 |
7.2500 |
1 |
1 |
2 |
1 |
1 |
1 |
38.0 |
1 |
0 |
71.2833 |
2 |
2 |
3 |
1 |
3 |
1 |
26.0 |
0 |
0 |
7.9250 |
1 |
3 |
4 |
1 |
1 |
1 |
35.0 |
1 |
0 |
53.1000 |
1 |
4 |
5 |
0 |
3 |
0 |
35.0 |
0 |
0 |
8.0500 |
1 |
4、将特征列和结果列拆分开
y = df.pop("Survived")
X = df
X.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
PassengerId |
Pclass |
Sex |
Age |
SibSp |
Parch |
Fare |
Embarked |
0 |
1 |
3 |
0 |
22.0 |
1 |
0 |
7.2500 |
1 |
1 |
2 |
1 |
1 |
38.0 |
1 |
0 |
71.2833 |
2 |
2 |
3 |
3 |
1 |
26.0 |
0 |
0 |
7.9250 |
1 |
3 |
4 |
1 |
1 |
35.0 |
1 |
0 |
53.1000 |
1 |
4 |
5 |
3 |
0 |
35.0 |
0 |
0 |
8.0500 |
1 |
y.head()
0 0
1 1
2 1
3 1
4 0
Name: Survived, dtype: int64
5、使用卡方检验选择topK的特征
bestfeatures = SelectKBest(score_func=chi2, k=len(X.columns))
fit = bestfeatures.fit(X, y)
6、按照重要性顺序打印特征列表
df_scores = pd.DataFrame(fit.scores_)
df_scores
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
0 |
0 |
3.312934 |
1 |
30.873699 |
2 |
170.348127 |
3 |
21.649163 |
4 |
2.581865 |
5 |
10.097499 |
6 |
4518.319091 |
7 |
2.771019 |
df_columns = pd.DataFrame(X.columns)
df_columns
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
0 |
0 |
PassengerId |
1 |
Pclass |
2 |
Sex |
3 |
Age |
4 |
SibSp |
5 |
Parch |
6 |
Fare |
7 |
Embarked |
df_feature_scores = pd.concat([df_columns,df_scores],axis=1)
df_feature_scores.columns = ['feature_name','Score']
df_feature_scores
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
feature_name |
Score |
0 |
PassengerId |
3.312934 |
1 |
Pclass |
30.873699 |
2 |
Sex |
170.348127 |
3 |
Age |
21.649163 |
4 |
SibSp |
2.581865 |
5 |
Parch |
10.097499 |
6 |
Fare |
4518.319091 |
7 |
Embarked |
2.771019 |
df_feature_scores.sort_values(by="Score", ascending=False)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
feature_name |
Score |
6 |
Fare |
4518.319091 |
2 |
Sex |
170.348127 |
1 |
Pclass |
30.873699 |
3 |
Age |
21.649163 |
5 |
Parch |
10.097499 |
0 |
PassengerId |
3.312934 |
7 |
Embarked |
2.771019 |
4 |
SibSp |
2.581865 |
本文使用 文章同步助手 同步
网友评论