美文网首页泰坦尼克之灾数据分析
Machine learning:Titanic数据分析(四)模

Machine learning:Titanic数据分析(四)模

作者: zhk779 | 来源:发表于2020-03-01 20:22 被阅读0次

上一节:数据清洗,特征选取

一、模型训练:

通过之前的数据转换,当前问题无论是回归还是分类算法都可以处理。
以下的算法都可以考虑进来:

Logistic Regression
KNN or k-Nearest Neighbors
Support Vector Machines
Naive Bayes classifier
Decision Tree
Random Forrest
Perceptron
Artificial neural network
RVM or Relevance Vector Machine

整理训练和测试数据

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape

1、Logistic Regression

注意:此处的模型评分我们使用的是训练数据(没有测试数据的目标值)

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)  #四舍五入,保留小数点后两位
acc_log   #执行结果 80.36

执行以下代码可以看看我们之前分析得出的假设是否正确

coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)
correlation

从上图可以看出:
(1)Sex和目标值关系最大,Sex值增加(male: 0 to female: 1), Survived=1的可能性增加最多
(2)与Sex相反的是,Pclass增加,存活可能性降低
(3)Age*Class的效果不错,所有特征中,它的负相关性第二高
(4)Title也是第二高的正相关的特征

2、SVM

# Support Vector Machines

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc #执行结果 83.84

3、k-Nearest Neighbors algorithm

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn #执行结果 84.74

4、Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian  #执行结果 72.28

5、perceptron

# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron #执行结果 78.0

6、linear_svc

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc  #执行结果 79.12

7、Stochastic Gradient Descent

# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd #执行结果 78.56

8、Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree  #执行结果  86.76

9、Random Forest

# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest  #执行结果  86.76

二、模型性能排序

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)
Model
性能排序

可见,随机森林和决策树的效果最好。

上一节:数据清洗,特征选取

相关文章

网友评论

    本文标题:Machine learning:Titanic数据分析(四)模

    本文链接:https://www.haomeiwen.com/subject/imchkhtx.html