美文网首页
27 Pandas怎样找出最影响结果的那些特征

27 Pandas怎样找出最影响结果的那些特征

作者: Viterbi | 来源:发表于2022-11-16 13:05 被阅读0次

27 Pandas怎样找出最影响结果的那些特征?

应用场景:

  • 机器学习的特征选择,去除无用的特征,可以提升模型效果、降低训练时间等等
  • 数据分析领域,找出收入波动的最大因素!!

实例演示:泰坦尼克沉船事件中,最影响生死的因素有哪些?

1、导入相关的包

import pandas as pd
import numpy as np

# 特征最影响结果的K个特征
from sklearn.feature_selection import SelectKBest

# 卡方检验,作为SelectKBest的参数
from sklearn.feature_selection import chi2

2、导入泰坦尼克号的数据

df = pd.read_csv("./datas/titanic/titanic_train.csv")
df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

df = df[["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].copy()
df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.0 1 0 7.2500 S
1 2 1 1 female 38.0 1 0 71.2833 C
2 3 1 3 female 26.0 0 0 7.9250 S
3 4 1 1 female 35.0 1 0 53.1000 S
4 5 0 3 male 35.0 0 0 8.0500 S

3、数据清理和转换

3.1 查看是否有空值列

df.info()


    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 9 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Sex            891 non-null object
    Age            714 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Fare           891 non-null float64
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(2)
    memory usage: 62.8+ KB

3.2 给Age列填充平均值

df["Age"] = df["Age"].fillna(df["Age"].median())

df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.0 1 0 7.2500 S
1 2 1 1 female 38.0 1 0 71.2833 C
2 3 1 3 female 26.0 0 0 7.9250 S
3 4 1 1 female 35.0 1 0 53.1000 S
4 5 0 3 male 35.0 0 0 8.0500 S

3.2 将性别列变成数字

# 性别
df.Sex.unique()


    array(['male', 'female'], dtype=object)



df.loc[df["Sex"] == "male", "Sex"] = 0
df.loc[df["Sex"] == "female", "Sex"] = 1

df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 0 22.0 1 0 7.2500 S
1 2 1 1 1 38.0 1 0 71.2833 C
2 3 1 3 1 26.0 0 0 7.9250 S
3 4 1 1 1 35.0 1 0 53.1000 S
4 5 0 3 0 35.0 0 0 8.0500 S

3.3 给Embarked列填充空值,字符串转换成数字

# Embarked
df.Embarked.unique()


    array(['S', 'C', 'Q', nan], dtype=object)


# 填充空值
df["Embarked"] = df["Embarked"].fillna(0)

# 字符串变成数字
df.loc[df["Embarked"] == "S", "Embarked"] = 1
df.loc[df["Embarked"] == "C", "Embarked"] = 2
df.loc[df["Embarked"] == "Q", "Embarked"] = 3

df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 0 22.0 1 0 7.2500 1
1 2 1 1 1 38.0 1 0 71.2833 2
2 3 1 3 1 26.0 0 0 7.9250 1
3 4 1 1 1 35.0 1 0 53.1000 1
4 5 0 3 0 35.0 0 0 8.0500 1

4、将特征列和结果列拆分开

y = df.pop("Survived")
X = df

X.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
PassengerId Pclass Sex Age SibSp Parch Fare Embarked
0 1 3 0 22.0 1 0 7.2500 1
1 2 1 1 38.0 1 0 71.2833 2
2 3 3 1 26.0 0 0 7.9250 1
3 4 1 1 35.0 1 0 53.1000 1
4 5 3 0 35.0 0 0 8.0500 1
y.head()




    0    0
    1    1
    2    1
    3    1
    4    0
    Name: Survived, dtype: int64

5、使用卡方检验选择topK的特征

# 选择所有的特征,目的是看到特征重要性排序
bestfeatures = SelectKBest(score_func=chi2, k=len(X.columns))
fit = bestfeatures.fit(X, y)

6、按照重要性顺序打印特征列表

df_scores = pd.DataFrame(fit.scores_)
df_scores
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
0
0 3.312934
1 30.873699
2 170.348127
3 21.649163
4 2.581865
5 10.097499
6 4518.319091
7 2.771019
df_columns = pd.DataFrame(X.columns)
df_columns
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
0
0 PassengerId
1 Pclass
2 Sex
3 Age
4 SibSp
5 Parch
6 Fare
7 Embarked
# 合并两个df
df_feature_scores = pd.concat([df_columns,df_scores],axis=1)
# 列名
df_feature_scores.columns = ['feature_name','Score']  #naming the dataframe columns

# 查看
df_feature_scores
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
feature_name Score
0 PassengerId 3.312934
1 Pclass 30.873699
2 Sex 170.348127
3 Age 21.649163
4 SibSp 2.581865
5 Parch 10.097499
6 Fare 4518.319091
7 Embarked 2.771019
df_feature_scores.sort_values(by="Score", ascending=False)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
feature_name Score
6 Fare 4518.319091
2 Sex 170.348127
1 Pclass 30.873699
3 Age 21.649163
5 Parch 10.097499
0 PassengerId 3.312934
7 Embarked 2.771019
4 SibSp 2.581865

本文使用 文章同步助手 同步

相关文章

  • 27 Pandas怎样找出最影响结果的那些特征

    27 Pandas怎样找出最影响结果的那些特征? 应用场景: 机器学习的特征选择,去除无用的特征,可以提升模型效果...

  • FABE销售法

    简单地说,就是在找出顾客最感兴趣的各种特征后,分析这一特征所产生的优点,找出这一优点能够带给顾客的利益,最后提出证...

  • 关键对话一

    今天开始学习第27本书《关键对话》 ——如何高效能沟通? 关键对话是指。可以影响你生活的那些日常对话,他有三个特征...

  • 特征工程

    特征工程 特征使用方案 要实现业务需求目标需要哪些数据? 基于业务理解,尽可能多的找出对因变量影响的所有自变量 可...

  • 读书笔记20220727

    怎样检验自己变得专业,第一就是看到手的结果,第二就是看对你的影响,第三看对受众的影响。 到手的结果中钱是最直接的,...

  • 机器学习入门-降低维度

    降低维度的方法 选择特征从原有的特征中挑选出对结果影响最大的特征 抽取特征将数据从高维度空间投影到低维度空间 选择...

  • 13 Pandas怎样实现DataFrame的Merge

    title: 13 Pandas怎样实现DataFrame的Mergetags: 数据分析,pandas,小书匠g...

  • day02-xgboost和lightgbm简单实现

    xgboost 导入鸢尾花数据 数据展示 pandas格式 结果: pandas格式 结果: 划分训练集和测试集 ...

  • 19 Pandas怎样对每个分组应用apply函数

    19 Pandas怎样对每个分组应用apply函数? 知识:Pandas的GroupBy遵从split、apply...

  • python 一些使用小语法

    pandas只是提取指定时刻数据 pandas 将某列小于1的数设为1:方法1 方法2 方法3 找出标签重复行: ...

网友评论

      本文标题:27 Pandas怎样找出最影响结果的那些特征

      本文链接:https://www.haomeiwen.com/subject/vatjtdtx.html