美文网首页
Python数据分析与机器学习25-随机森林项目实战

Python数据分析与机器学习25-随机森林项目实战

作者: 只是甲 | 来源:发表于2022-07-23 09:42 被阅读0次

一. 数据集介绍

我们使用的数据集是 泰坦尼克号船员获救的数据集。


image.png

数据集:

  1. PassengerId
    船员ID

  2. Survived
    是否获救,0-否,1-是

  3. Pclass
    船仓等级,1等,2等,3等

  4. Name
    船员姓名

  5. Sex
    船员性别

  6. Age
    船员年龄

  7. SibSp
    同乘人中同龄人数

  8. Parch
    同乘人中老人和小孩人数

  9. Ticket
    船票编号

  10. Fare
    船票价格

  11. Cabin
    客舱

  12. Embarked
    登船港口

二. 数据预处理

2.1 数据简单分析

代码:

import pandas as pd

#设置列不限制数量
pd.set_option('display.max_columns',None)

titanic = pd.read_csv("E:/file/titanic_train.csv")
# 输出数据集的行和列总数
print("###############################")
print(titanic.shape)

# 输出数据集前5行
print("###############################")
print(titanic.head(5))

# 输出数据集前5行
print("###############################")
print(titanic.columns)

# 输出数据集的描述
print("###############################")
print(titanic.describe())

测试记录:

###############################
(891, 12)
###############################
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
###############################
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
###############################
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  

从上述分析可以得知:

  1. 数据集总共891行,12列
  2. Age存在一定的缺失值
  3. 数据集中存在一定的字符列,不便于进行分析

2.2 数据预处理

代码:

import pandas as pd

#设置列不限制数量
pd.set_option('display.max_columns',None)
titanic = pd.read_csv("E:/file/titanic_train.csv")

# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

# 再次查看数据集的描述信息
print(titanic.describe())

测试记录:

       PassengerId    Survived      Pclass         Sex         Age  \
count   891.000000  891.000000  891.000000  891.000000  891.000000   
mean    446.000000    0.383838    2.308642    0.352413   29.361582   
std     257.353842    0.486592    0.836071    0.477990   13.019697   
min       1.000000    0.000000    1.000000    0.000000    0.420000   
25%     223.500000    0.000000    2.000000    0.000000   22.000000   
50%     446.000000    0.000000    3.000000    0.000000   28.000000   
75%     668.500000    1.000000    3.000000    1.000000   35.000000   
max     891.000000    1.000000    3.000000    1.000000   80.000000   

            SibSp       Parch        Fare    Embarked  
count  891.000000  891.000000  891.000000  891.000000  
mean     0.523008    0.381594   32.204208    0.361392  
std      1.102743    0.806057   49.693429    0.635673  
min      0.000000    0.000000    0.000000    0.000000  
25%      0.000000    0.000000    7.910400    0.000000  
50%      0.000000    0.000000   14.454200    0.000000  
75%      1.000000    0.000000   31.000000    1.000000  
max      8.000000    6.000000  512.329200    2.000000  

三. 用线性回归进行分析

3.1 简单的线性回归

代码:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split

# 读取数据集
titanic = pd.read_csv("E:/file/titanic_train.csv")

# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

# 选择特征列
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# 初始化算法类
alg = LinearRegression()
predictions = []

X = titanic[predictors]
y = titanic["Survived"]

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# 训练
alg.fit(X_train, y_train)

# 输出模型评分
print('逻辑回归调用score方法计算正确率:',alg.score(X_test, y_test))

测试记录:
逻辑回归调用score方法计算正确率: 0.5114157737150755

分析:
对于这么个简单的二分类来说,0.51分的评分真的很差了。对于二分类来说,越接近0.5模型效果就越差。

3.2 使用KFold进行交叉验证

K-Folds 交叉验证
提供训练/测试索引来分割训练/测试集中的数据。将数据集分割为k个连续的折叠(默认情况下不需要变换)。
然后每个折叠被用作一次验证,而剩下的k - 1个折叠形成训练集。

代码:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split

# 读取数据集
titanic = pd.read_csv("E:/file/titanic_train.csv")

# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

# 选择特征列
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# 初始化算法类
alg = LinearRegression()
predictions = []

X = titanic[predictors]
y = titanic["Survived"]

# 使用Kfold将训练集分为3份
kf = KFold(n_splits=3, random_state=None, shuffle=False)
for train_index, test_index in kf.split(X):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    alg.fit(X_train, y_train)
    test_predictions = alg.predict(X_test)
    predictions.append(test_predictions)
    print('逻辑回归调用score方法计算正确率:', alg.score(X_test, y_test))

# 预测的结果集在多个分离的numpy数组,我们需要将其合并
predictions = np.concatenate(predictions, axis=0)
# 0.5为阀值
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)

测试记录:
逻辑回归调用score方法计算正确率: 0.33211124320016117
逻辑回归调用score方法计算正确率: 0.39263028818478085
逻辑回归调用score方法计算正确率: 0.39930463868948773
0.2615039281705948

分析:
模型评分0.261 ,初看下很低,但是我们把模型评估的结果反过来,那么我们模型的得分就是 1 - 0.261 = 0.739,这样看,模型评分还过得去,只是还是偏低。

3.3 使用cross_val_score进行交叉验证

代码:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# 读取数据集
titanic = pd.read_csv("E:/file/titanic_train.csv")

# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

# 选择特征列
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# 初始化算法类
alg = LinearRegression()
predictions = []

X = titanic[predictors]
y = titanic["Survived"]

# 交叉验证
scores = cross_val_score(alg, X, y, cv=3)
print(scores.mean())

测试记录:
0.3746820566914766

分析:
1- 0.375 = 0.625 这个评分不行。
不过cross_val_score用起来比KFlod方便多了。

四. 用随机森林进行分析

4.1 随机森林+交叉验证

代码:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# 读取数据集
titanic = pd.read_csv("E:/file/titanic_train.csv")

# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

# 选择特征列
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

X = titanic[predictors]
y = titanic["Survived"]

# 使用随机森林+交叉验证来生成模型
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
kf = KFold(n_splits=3, random_state=None, shuffle=False)
scores = cross_val_score(alg, X, y, cv=kf)

print(scores.mean())

测试记录:
0.7856341189674523

分析:
随机森林+交叉验证其实结果还不错,但是对于二分类而言,0.786真的不算高。

4.2 随机森林调参

代码:

# 上一步对应的随机森林参数改为如下:
alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)

测试记录:
0.8148148148148148

分析:
随机森林调整参数,通过测试不同的参数,可以将模型的准确率提升。

4.3 增加特征

代码:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import re
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt

# 读取数据集
titanic = pd.read_csv("E:/file/titanic_train.csv")

# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

# 新增列
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

# 提取名字中的Mr Miss等
def get_title(name):
    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

titles = titanic["Name"].apply(get_title)

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
    titles[titles == k] = v

titanic["Title"] = titles

# 选择特征列
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "NameLength", "Title"]

# 准备数据集
X = titanic[predictors]
y = titanic["Survived"]

# 画出各个特征的重要性
selector = SelectKBest(f_classif, k=5)
selector.fit(X, y)

scores = -np.log10(selector.pvalues_)

plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

# 根据增加的特性,再来使用随机森林训练模型
#predictors = ["Pclass", "Sex", "Fare", "Title"]
alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
kf = KFold(n_splits=3, random_state=None, shuffle=False)
scores = cross_val_score(alg, X, y, cv=kf)

print(scores.mean())

测试记录:
0.8350168350168351

image.png

参考:

  1. https://study.163.com/course/introduction.htm?courseId=1003590004#/courseDetail?tab=1
  2. https://www.pythonf.cn/read/128402

相关文章

网友评论

      本文标题:Python数据分析与机器学习25-随机森林项目实战

      本文链接:https://www.haomeiwen.com/subject/yuxabrtx.html