美文网首页
27 Pandas怎样找出最影响结果的那些特征

27 Pandas怎样找出最影响结果的那些特征

作者: Viterbi | 来源:发表于2022-11-16 13:05 被阅读0次

    27 Pandas怎样找出最影响结果的那些特征?

    应用场景:

    • 机器学习的特征选择,去除无用的特征,可以提升模型效果、降低训练时间等等
    • 数据分析领域,找出收入波动的最大因素!!

    实例演示:泰坦尼克沉船事件中,最影响生死的因素有哪些?

    1、导入相关的包

    import pandas as pd
    import numpy as np
    
    # 特征最影响结果的K个特征
    from sklearn.feature_selection import SelectKBest
    
    # 卡方检验,作为SelectKBest的参数
    from sklearn.feature_selection import chi2
    

    2、导入泰坦尼克号的数据

    df = pd.read_csv("./datas/titanic/titanic_train.csv")
    df.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
    0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
    1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
    2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
    3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
    4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
    
    df = df[["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].copy()
    df.head()
    
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
    0 1 0 3 male 22.0 1 0 7.2500 S
    1 2 1 1 female 38.0 1 0 71.2833 C
    2 3 1 3 female 26.0 0 0 7.9250 S
    3 4 1 1 female 35.0 1 0 53.1000 S
    4 5 0 3 male 35.0 0 0 8.0500 S

    3、数据清理和转换

    3.1 查看是否有空值列

    df.info()
    
    
        <class 'pandas.core.frame.DataFrame'>
        RangeIndex: 891 entries, 0 to 890
        Data columns (total 9 columns):
        PassengerId    891 non-null int64
        Survived       891 non-null int64
        Pclass         891 non-null int64
        Sex            891 non-null object
        Age            714 non-null float64
        SibSp          891 non-null int64
        Parch          891 non-null int64
        Fare           891 non-null float64
        Embarked       889 non-null object
        dtypes: float64(2), int64(5), object(2)
        memory usage: 62.8+ KB
    

    3.2 给Age列填充平均值

    df["Age"] = df["Age"].fillna(df["Age"].median())
    
    df.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
    0 1 0 3 male 22.0 1 0 7.2500 S
    1 2 1 1 female 38.0 1 0 71.2833 C
    2 3 1 3 female 26.0 0 0 7.9250 S
    3 4 1 1 female 35.0 1 0 53.1000 S
    4 5 0 3 male 35.0 0 0 8.0500 S

    3.2 将性别列变成数字

    # 性别
    df.Sex.unique()
    
    
        array(['male', 'female'], dtype=object)
    
    
    
    df.loc[df["Sex"] == "male", "Sex"] = 0
    df.loc[df["Sex"] == "female", "Sex"] = 1
    
    df.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
    0 1 0 3 0 22.0 1 0 7.2500 S
    1 2 1 1 1 38.0 1 0 71.2833 C
    2 3 1 3 1 26.0 0 0 7.9250 S
    3 4 1 1 1 35.0 1 0 53.1000 S
    4 5 0 3 0 35.0 0 0 8.0500 S

    3.3 给Embarked列填充空值,字符串转换成数字

    # Embarked
    df.Embarked.unique()
    
    
        array(['S', 'C', 'Q', nan], dtype=object)
    
    
    # 填充空值
    df["Embarked"] = df["Embarked"].fillna(0)
    
    # 字符串变成数字
    df.loc[df["Embarked"] == "S", "Embarked"] = 1
    df.loc[df["Embarked"] == "C", "Embarked"] = 2
    df.loc[df["Embarked"] == "Q", "Embarked"] = 3
    
    df.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
    0 1 0 3 0 22.0 1 0 7.2500 1
    1 2 1 1 1 38.0 1 0 71.2833 2
    2 3 1 3 1 26.0 0 0 7.9250 1
    3 4 1 1 1 35.0 1 0 53.1000 1
    4 5 0 3 0 35.0 0 0 8.0500 1

    4、将特征列和结果列拆分开

    y = df.pop("Survived")
    X = df
    
    X.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    PassengerId Pclass Sex Age SibSp Parch Fare Embarked
    0 1 3 0 22.0 1 0 7.2500 1
    1 2 1 1 38.0 1 0 71.2833 2
    2 3 3 1 26.0 0 0 7.9250 1
    3 4 1 1 35.0 1 0 53.1000 1
    4 5 3 0 35.0 0 0 8.0500 1
    y.head()
    
    
    
    
        0    0
        1    1
        2    1
        3    1
        4    0
        Name: Survived, dtype: int64
    
    

    5、使用卡方检验选择topK的特征

    # 选择所有的特征,目的是看到特征重要性排序
    bestfeatures = SelectKBest(score_func=chi2, k=len(X.columns))
    fit = bestfeatures.fit(X, y)
    

    6、按照重要性顺序打印特征列表

    df_scores = pd.DataFrame(fit.scores_)
    df_scores
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    0
    0 3.312934
    1 30.873699
    2 170.348127
    3 21.649163
    4 2.581865
    5 10.097499
    6 4518.319091
    7 2.771019
    df_columns = pd.DataFrame(X.columns)
    df_columns
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    0
    0 PassengerId
    1 Pclass
    2 Sex
    3 Age
    4 SibSp
    5 Parch
    6 Fare
    7 Embarked
    # 合并两个df
    df_feature_scores = pd.concat([df_columns,df_scores],axis=1)
    # 列名
    df_feature_scores.columns = ['feature_name','Score']  #naming the dataframe columns
    
    # 查看
    df_feature_scores
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    feature_name Score
    0 PassengerId 3.312934
    1 Pclass 30.873699
    2 Sex 170.348127
    3 Age 21.649163
    4 SibSp 2.581865
    5 Parch 10.097499
    6 Fare 4518.319091
    7 Embarked 2.771019
    df_feature_scores.sort_values(by="Score", ascending=False)
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    feature_name Score
    6 Fare 4518.319091
    2 Sex 170.348127
    1 Pclass 30.873699
    3 Age 21.649163
    5 Parch 10.097499
    0 PassengerId 3.312934
    7 Embarked 2.771019
    4 SibSp 2.581865

    本文使用 文章同步助手 同步

    相关文章

      网友评论

          本文标题:27 Pandas怎样找出最影响结果的那些特征

          本文链接:https://www.haomeiwen.com/subject/vatjtdtx.html