美文网首页
Python数据分析与机器学习25-随机森林项目实战

Python数据分析与机器学习25-随机森林项目实战

作者: 只是甲 | 来源:发表于2022-07-23 09:42 被阅读0次

    一. 数据集介绍

    我们使用的数据集是 泰坦尼克号船员获救的数据集。


    image.png

    数据集:

    1. PassengerId
      船员ID

    2. Survived
      是否获救,0-否,1-是

    3. Pclass
      船仓等级,1等,2等,3等

    4. Name
      船员姓名

    5. Sex
      船员性别

    6. Age
      船员年龄

    7. SibSp
      同乘人中同龄人数

    8. Parch
      同乘人中老人和小孩人数

    9. Ticket
      船票编号

    10. Fare
      船票价格

    11. Cabin
      客舱

    12. Embarked
      登船港口

    二. 数据预处理

    2.1 数据简单分析

    代码:

    import pandas as pd
    
    #设置列不限制数量
    pd.set_option('display.max_columns',None)
    
    titanic = pd.read_csv("E:/file/titanic_train.csv")
    # 输出数据集的行和列总数
    print("###############################")
    print(titanic.shape)
    
    # 输出数据集前5行
    print("###############################")
    print(titanic.head(5))
    
    # 输出数据集前5行
    print("###############################")
    print(titanic.columns)
    
    # 输出数据集的描述
    print("###############################")
    print(titanic.describe())
    

    测试记录:

    ###############################
    (891, 12)
    ###############################
       PassengerId  Survived  Pclass  \
    0            1         0       3   
    1            2         1       1   
    2            3         1       3   
    3            4         1       1   
    4            5         0       3   
    
                                                      Name     Sex   Age  SibSp  \
    0                              Braund, Mr. Owen Harris    male  22.0      1   
    1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0      1   
    2                               Heikkinen, Miss. Laina  female  26.0      0   
    3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
    4                             Allen, Mr. William Henry    male  35.0      0   
    
       Parch            Ticket     Fare Cabin Embarked  
    0      0         A/5 21171   7.2500   NaN        S  
    1      0          PC 17599  71.2833   C85        C  
    2      0  STON/O2. 3101282   7.9250   NaN        S  
    3      0            113803  53.1000  C123        S  
    4      0            373450   8.0500   NaN        S  
    ###############################
    Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
           'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
          dtype='object')
    ###############################
           PassengerId    Survived      Pclass         Age       SibSp  \
    count   891.000000  891.000000  891.000000  714.000000  891.000000   
    mean    446.000000    0.383838    2.308642   29.699118    0.523008   
    std     257.353842    0.486592    0.836071   14.526497    1.102743   
    min       1.000000    0.000000    1.000000    0.420000    0.000000   
    25%     223.500000    0.000000    2.000000   20.125000    0.000000   
    50%     446.000000    0.000000    3.000000   28.000000    0.000000   
    75%     668.500000    1.000000    3.000000   38.000000    1.000000   
    max     891.000000    1.000000    3.000000   80.000000    8.000000   
    
                Parch        Fare  
    count  891.000000  891.000000  
    mean     0.381594   32.204208  
    std      0.806057   49.693429  
    min      0.000000    0.000000  
    25%      0.000000    7.910400  
    50%      0.000000   14.454200  
    75%      0.000000   31.000000  
    max      6.000000  512.329200  
    

    从上述分析可以得知:

    1. 数据集总共891行,12列
    2. Age存在一定的缺失值
    3. 数据集中存在一定的字符列,不便于进行分析

    2.2 数据预处理

    代码:

    import pandas as pd
    
    #设置列不限制数量
    pd.set_option('display.max_columns',None)
    titanic = pd.read_csv("E:/file/titanic_train.csv")
    
    # 将年龄为空值的行,赋值为年龄的平均值
    titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
    
    # 对性别列进行编码
    # print titanic["Sex"].unique()
    titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
    titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
    
    # 对性别登船港口进行编码
    # 如果为空,取值为最多的S
    # print titanic["Embarked"].unique()
    titanic["Embarked"] = titanic["Embarked"].fillna('S')
    titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
    titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
    titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
    
    # 再次查看数据集的描述信息
    print(titanic.describe())
    

    测试记录:

           PassengerId    Survived      Pclass         Sex         Age  \
    count   891.000000  891.000000  891.000000  891.000000  891.000000   
    mean    446.000000    0.383838    2.308642    0.352413   29.361582   
    std     257.353842    0.486592    0.836071    0.477990   13.019697   
    min       1.000000    0.000000    1.000000    0.000000    0.420000   
    25%     223.500000    0.000000    2.000000    0.000000   22.000000   
    50%     446.000000    0.000000    3.000000    0.000000   28.000000   
    75%     668.500000    1.000000    3.000000    1.000000   35.000000   
    max     891.000000    1.000000    3.000000    1.000000   80.000000   
    
                SibSp       Parch        Fare    Embarked  
    count  891.000000  891.000000  891.000000  891.000000  
    mean     0.523008    0.381594   32.204208    0.361392  
    std      1.102743    0.806057   49.693429    0.635673  
    min      0.000000    0.000000    0.000000    0.000000  
    25%      0.000000    0.000000    7.910400    0.000000  
    50%      0.000000    0.000000   14.454200    0.000000  
    75%      1.000000    0.000000   31.000000    1.000000  
    max      8.000000    6.000000  512.329200    2.000000  
    

    三. 用线性回归进行分析

    3.1 简单的线性回归

    代码:

    import pandas as pd
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import KFold
    import numpy as np
    from sklearn.model_selection import train_test_split
    
    # 读取数据集
    titanic = pd.read_csv("E:/file/titanic_train.csv")
    
    # 将年龄为空值的行,赋值为年龄的平均值
    titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
    
    # 对性别列进行编码
    # print titanic["Sex"].unique()
    titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
    titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
    
    # 对性别登船港口进行编码
    # 如果为空,取值为最多的S
    # print titanic["Embarked"].unique()
    titanic["Embarked"] = titanic["Embarked"].fillna('S')
    titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
    titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
    titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
    
    # 选择特征列
    predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
    
    # 初始化算法类
    alg = LinearRegression()
    predictions = []
    
    X = titanic[predictors]
    y = titanic["Survived"]
    
    # 划分训练集和测试集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
    
    # 训练
    alg.fit(X_train, y_train)
    
    # 输出模型评分
    print('逻辑回归调用score方法计算正确率:',alg.score(X_test, y_test))
    

    测试记录:
    逻辑回归调用score方法计算正确率: 0.5114157737150755

    分析:
    对于这么个简单的二分类来说,0.51分的评分真的很差了。对于二分类来说,越接近0.5模型效果就越差。

    3.2 使用KFold进行交叉验证

    K-Folds 交叉验证
    提供训练/测试索引来分割训练/测试集中的数据。将数据集分割为k个连续的折叠(默认情况下不需要变换)。
    然后每个折叠被用作一次验证,而剩下的k - 1个折叠形成训练集。

    代码:

    import pandas as pd
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import KFold
    import numpy as np
    from sklearn.model_selection import train_test_split
    
    # 读取数据集
    titanic = pd.read_csv("E:/file/titanic_train.csv")
    
    # 将年龄为空值的行,赋值为年龄的平均值
    titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
    
    # 对性别列进行编码
    # print titanic["Sex"].unique()
    titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
    titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
    
    # 对性别登船港口进行编码
    # 如果为空,取值为最多的S
    # print titanic["Embarked"].unique()
    titanic["Embarked"] = titanic["Embarked"].fillna('S')
    titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
    titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
    titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
    
    # 选择特征列
    predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
    
    # 初始化算法类
    alg = LinearRegression()
    predictions = []
    
    X = titanic[predictors]
    y = titanic["Survived"]
    
    # 使用Kfold将训练集分为3份
    kf = KFold(n_splits=3, random_state=None, shuffle=False)
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.loc[train_index], X.loc[test_index]
        y_train, y_test = y.loc[train_index], y.loc[test_index]
        alg.fit(X_train, y_train)
        test_predictions = alg.predict(X_test)
        predictions.append(test_predictions)
        print('逻辑回归调用score方法计算正确率:', alg.score(X_test, y_test))
    
    # 预测的结果集在多个分离的numpy数组,我们需要将其合并
    predictions = np.concatenate(predictions, axis=0)
    # 0.5为阀值
    predictions[predictions > .5] = 1
    predictions[predictions <=.5] = 0
    accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
    print(accuracy)
    

    测试记录:
    逻辑回归调用score方法计算正确率: 0.33211124320016117
    逻辑回归调用score方法计算正确率: 0.39263028818478085
    逻辑回归调用score方法计算正确率: 0.39930463868948773
    0.2615039281705948

    分析:
    模型评分0.261 ,初看下很低,但是我们把模型评估的结果反过来,那么我们模型的得分就是 1 - 0.261 = 0.739,这样看,模型评分还过得去,只是还是偏低。

    3.3 使用cross_val_score进行交叉验证

    代码:

    import pandas as pd
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import KFold
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import cross_val_score
    
    # 读取数据集
    titanic = pd.read_csv("E:/file/titanic_train.csv")
    
    # 将年龄为空值的行,赋值为年龄的平均值
    titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
    
    # 对性别列进行编码
    # print titanic["Sex"].unique()
    titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
    titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
    
    # 对性别登船港口进行编码
    # 如果为空,取值为最多的S
    # print titanic["Embarked"].unique()
    titanic["Embarked"] = titanic["Embarked"].fillna('S')
    titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
    titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
    titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
    
    # 选择特征列
    predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
    
    # 初始化算法类
    alg = LinearRegression()
    predictions = []
    
    X = titanic[predictors]
    y = titanic["Survived"]
    
    # 交叉验证
    scores = cross_val_score(alg, X, y, cv=3)
    print(scores.mean())
    

    测试记录:
    0.3746820566914766

    分析:
    1- 0.375 = 0.625 这个评分不行。
    不过cross_val_score用起来比KFlod方便多了。

    四. 用随机森林进行分析

    4.1 随机森林+交叉验证

    代码:

    import pandas as pd
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import KFold
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import cross_val_score
    from sklearn.ensemble import RandomForestClassifier
    
    # 读取数据集
    titanic = pd.read_csv("E:/file/titanic_train.csv")
    
    # 将年龄为空值的行,赋值为年龄的平均值
    titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
    
    # 对性别列进行编码
    # print titanic["Sex"].unique()
    titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
    titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
    
    # 对性别登船港口进行编码
    # 如果为空,取值为最多的S
    # print titanic["Embarked"].unique()
    titanic["Embarked"] = titanic["Embarked"].fillna('S')
    titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
    titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
    titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
    
    # 选择特征列
    predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
    
    X = titanic[predictors]
    y = titanic["Survived"]
    
    # 使用随机森林+交叉验证来生成模型
    alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
    kf = KFold(n_splits=3, random_state=None, shuffle=False)
    scores = cross_val_score(alg, X, y, cv=kf)
    
    print(scores.mean())
    

    测试记录:
    0.7856341189674523

    分析:
    随机森林+交叉验证其实结果还不错,但是对于二分类而言,0.786真的不算高。

    4.2 随机森林调参

    代码:

    # 上一步对应的随机森林参数改为如下:
    alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
    

    测试记录:
    0.8148148148148148

    分析:
    随机森林调整参数,通过测试不同的参数,可以将模型的准确率提升。

    4.3 增加特征

    代码:

    import pandas as pd
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import KFold
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import cross_val_score
    from sklearn.ensemble import RandomForestClassifier
    import re
    from sklearn.feature_selection import SelectKBest, f_classif
    import matplotlib.pyplot as plt
    
    # 读取数据集
    titanic = pd.read_csv("E:/file/titanic_train.csv")
    
    # 将年龄为空值的行,赋值为年龄的平均值
    titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
    
    # 对性别列进行编码
    # print titanic["Sex"].unique()
    titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
    titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
    
    # 对性别登船港口进行编码
    # 如果为空,取值为最多的S
    # print titanic["Embarked"].unique()
    titanic["Embarked"] = titanic["Embarked"].fillna('S')
    titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
    titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
    titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
    
    # 新增列
    titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]
    titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))
    
    # 提取名字中的Mr Miss等
    def get_title(name):
        # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
        title_search = re.search(' ([A-Za-z]+)\.', name)
        # If the title exists, extract and return it.
        if title_search:
            return title_search.group(1)
        return ""
    
    titles = titanic["Name"].apply(get_title)
    
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
    for k,v in title_mapping.items():
        titles[titles == k] = v
    
    titanic["Title"] = titles
    
    # 选择特征列
    predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "NameLength", "Title"]
    
    # 准备数据集
    X = titanic[predictors]
    y = titanic["Survived"]
    
    # 画出各个特征的重要性
    selector = SelectKBest(f_classif, k=5)
    selector.fit(X, y)
    
    scores = -np.log10(selector.pvalues_)
    
    plt.bar(range(len(predictors)), scores)
    plt.xticks(range(len(predictors)), predictors, rotation='vertical')
    plt.show()
    
    # 根据增加的特性,再来使用随机森林训练模型
    #predictors = ["Pclass", "Sex", "Fare", "Title"]
    alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
    kf = KFold(n_splits=3, random_state=None, shuffle=False)
    scores = cross_val_score(alg, X, y, cv=kf)
    
    print(scores.mean())
    

    测试记录:
    0.8350168350168351

    image.png

    参考:

    1. https://study.163.com/course/introduction.htm?courseId=1003590004#/courseDetail?tab=1
    2. https://www.pythonf.cn/read/128402

    相关文章

      网友评论

          本文标题:Python数据分析与机器学习25-随机森林项目实战

          本文链接:https://www.haomeiwen.com/subject/yuxabrtx.html