一. 数据集介绍
我们使用的数据集是 泰坦尼克号船员获救的数据集。
image.png
数据集:
-
PassengerId
船员ID -
Survived
是否获救,0-否,1-是 -
Pclass
船仓等级,1等,2等,3等 -
Name
船员姓名 -
Sex
船员性别 -
Age
船员年龄 -
SibSp
同乘人中同龄人数 -
Parch
同乘人中老人和小孩人数 -
Ticket
船票编号 -
Fare
船票价格 -
Cabin
客舱 -
Embarked
登船港口
二. 数据预处理
2.1 数据简单分析
代码:
import pandas as pd
#设置列不限制数量
pd.set_option('display.max_columns',None)
titanic = pd.read_csv("E:/file/titanic_train.csv")
# 输出数据集的行和列总数
print("###############################")
print(titanic.shape)
# 输出数据集前5行
print("###############################")
print(titanic.head(5))
# 输出数据集前5行
print("###############################")
print(titanic.columns)
# 输出数据集的描述
print("###############################")
print(titanic.describe())
测试记录:
###############################
(891, 12)
###############################
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
###############################
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
###############################
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
从上述分析可以得知:
- 数据集总共891行,12列
- Age存在一定的缺失值
- 数据集中存在一定的字符列,不便于进行分析
2.2 数据预处理
代码:
import pandas as pd
#设置列不限制数量
pd.set_option('display.max_columns',None)
titanic = pd.read_csv("E:/file/titanic_train.csv")
# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
# 再次查看数据集的描述信息
print(titanic.describe())
测试记录:
PassengerId Survived Pclass Sex Age \
count 891.000000 891.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 0.352413 29.361582
std 257.353842 0.486592 0.836071 0.477990 13.019697
min 1.000000 0.000000 1.000000 0.000000 0.420000
25% 223.500000 0.000000 2.000000 0.000000 22.000000
50% 446.000000 0.000000 3.000000 0.000000 28.000000
75% 668.500000 1.000000 3.000000 1.000000 35.000000
max 891.000000 1.000000 3.000000 1.000000 80.000000
SibSp Parch Fare Embarked
count 891.000000 891.000000 891.000000 891.000000
mean 0.523008 0.381594 32.204208 0.361392
std 1.102743 0.806057 49.693429 0.635673
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 7.910400 0.000000
50% 0.000000 0.000000 14.454200 0.000000
75% 1.000000 0.000000 31.000000 1.000000
max 8.000000 6.000000 512.329200 2.000000
三. 用线性回归进行分析
3.1 简单的线性回归
代码:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split
# 读取数据集
titanic = pd.read_csv("E:/file/titanic_train.csv")
# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
# 选择特征列
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# 初始化算法类
alg = LinearRegression()
predictions = []
X = titanic[predictors]
y = titanic["Survived"]
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
# 训练
alg.fit(X_train, y_train)
# 输出模型评分
print('逻辑回归调用score方法计算正确率:',alg.score(X_test, y_test))
测试记录:
逻辑回归调用score方法计算正确率: 0.5114157737150755
分析:
对于这么个简单的二分类来说,0.51分的评分真的很差了。对于二分类来说,越接近0.5模型效果就越差。
3.2 使用KFold进行交叉验证
K-Folds 交叉验证
提供训练/测试索引来分割训练/测试集中的数据。将数据集分割为k个连续的折叠(默认情况下不需要变换)。
然后每个折叠被用作一次验证,而剩下的k - 1个折叠形成训练集。
代码:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split
# 读取数据集
titanic = pd.read_csv("E:/file/titanic_train.csv")
# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
# 选择特征列
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# 初始化算法类
alg = LinearRegression()
predictions = []
X = titanic[predictors]
y = titanic["Survived"]
# 使用Kfold将训练集分为3份
kf = KFold(n_splits=3, random_state=None, shuffle=False)
for train_index, test_index in kf.split(X):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
alg.fit(X_train, y_train)
test_predictions = alg.predict(X_test)
predictions.append(test_predictions)
print('逻辑回归调用score方法计算正确率:', alg.score(X_test, y_test))
# 预测的结果集在多个分离的numpy数组,我们需要将其合并
predictions = np.concatenate(predictions, axis=0)
# 0.5为阀值
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)
测试记录:
逻辑回归调用score方法计算正确率: 0.33211124320016117
逻辑回归调用score方法计算正确率: 0.39263028818478085
逻辑回归调用score方法计算正确率: 0.39930463868948773
0.2615039281705948
分析:
模型评分0.261 ,初看下很低,但是我们把模型评估的结果反过来,那么我们模型的得分就是 1 - 0.261 = 0.739,这样看,模型评分还过得去,只是还是偏低。
3.3 使用cross_val_score进行交叉验证
代码:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# 读取数据集
titanic = pd.read_csv("E:/file/titanic_train.csv")
# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
# 选择特征列
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# 初始化算法类
alg = LinearRegression()
predictions = []
X = titanic[predictors]
y = titanic["Survived"]
# 交叉验证
scores = cross_val_score(alg, X, y, cv=3)
print(scores.mean())
测试记录:
0.3746820566914766
分析:
1- 0.375 = 0.625 这个评分不行。
不过cross_val_score用起来比KFlod方便多了。
四. 用随机森林进行分析
4.1 随机森林+交叉验证
代码:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# 读取数据集
titanic = pd.read_csv("E:/file/titanic_train.csv")
# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
# 选择特征列
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
X = titanic[predictors]
y = titanic["Survived"]
# 使用随机森林+交叉验证来生成模型
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
kf = KFold(n_splits=3, random_state=None, shuffle=False)
scores = cross_val_score(alg, X, y, cv=kf)
print(scores.mean())
测试记录:
0.7856341189674523
分析:
随机森林+交叉验证其实结果还不错,但是对于二分类而言,0.786真的不算高。
4.2 随机森林调参
代码:
# 上一步对应的随机森林参数改为如下:
alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
测试记录:
0.8148148148148148
分析:
随机森林调整参数,通过测试不同的参数,可以将模型的准确率提升。
4.3 增加特征
代码:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import re
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt
# 读取数据集
titanic = pd.read_csv("E:/file/titanic_train.csv")
# 将年龄为空值的行,赋值为年龄的平均值
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
# 对性别列进行编码
# print titanic["Sex"].unique()
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
# 对性别登船港口进行编码
# 如果为空,取值为最多的S
# print titanic["Embarked"].unique()
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
# 新增列
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))
# 提取名字中的Mr Miss等
def get_title(name):
# Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period.
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return ""
titles = titanic["Name"].apply(get_title)
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
titles[titles == k] = v
titanic["Title"] = titles
# 选择特征列
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "NameLength", "Title"]
# 准备数据集
X = titanic[predictors]
y = titanic["Survived"]
# 画出各个特征的重要性
selector = SelectKBest(f_classif, k=5)
selector.fit(X, y)
scores = -np.log10(selector.pvalues_)
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()
# 根据增加的特性,再来使用随机森林训练模型
#predictors = ["Pclass", "Sex", "Fare", "Title"]
alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
kf = KFold(n_splits=3, random_state=None, shuffle=False)
scores = cross_val_score(alg, X, y, cv=kf)
print(scores.mean())
测试记录:
0.8350168350168351
网友评论