数据处理Project_1：Titanic

作者: Nefelibatas | 来源:发表于2022-02-22 12:45 被阅读0次

数据处理Project_1：Titanic
Titanic 数据处理 - Kaggle
泰坦尼克号影评【英文】
飘飘的ScalersTalk第五轮新概念朗读持续力训练Day 1
wex的Scalers Talk第四轮新概念朗读持续力训练Day
HannahLin的ScalersTalk第四轮新概念朗读持续力
卖萌的ScalersTalk第四轮新概念朗读持续力训练Day11
Ivan的ScalersTalk第四轮新概念朗读持续力训练Day
LemonIce的ScalersTalk第四轮新概念朗读持续力训
Vicky的ScalersTalk第六轮新概念朗读持续力训练Da

项目源于kaggle，比赛名称Titanic-Machine Learning from Disaster，本文对该赛题展开数据清洗与分析。

赛题任务：预测乘客是否幸存
赛题数据：乘客基本信息和船票信息
评价指标：准确率

image-20220213123504257.png

需要对数据进行预处理

数据探索
缺失值
数值为英文的情况
特征选择

image-20220213123914898.png

# 使用饼图来进行Survived取值的可视化
#print(type(train_data["Survived"].value_counts()))
train_data["Survived"].value_counts().plot(kind = "pie", label='Survived')
plt.show()

# 不同的Pclass,幸存人数(条形图)
sns.barplot(x = 'Pclass', y = "Survived", data = train_data);
plt.show()

# 不同的Embarked,幸存人数(条形图)
sns.barplot(x = 'Embarked', y = "Survived", data = train_data);
plt.show()

显示特征之间的相关系数

plt.figure(figsize=(10, 10))
plt.title('Pearson Correlation between Features',y=1.05,size=15)
train_data_hot_encoded = train_features.drop('Embarked',1).join(train_features.Embarked.str.get_dummies())
train_data_hot_encoded = train_data_hot_encoded.drop('Sex',1).join(train_data_hot_encoded.Sex.str.get_dummies())
# 计算特征之间的Pearson系数，即相似度
sns.heatmap(train_data_hot_encoded.corr(),linewidths=0.1,vmax=1.0, fmt= '.2f', square=True,linecolor='white',annot=True)
plt.show()

image-20220213124721314.png

训练并显示特征向量的重要程度

def train(train_features, train_labels):
 # 构造CART决策树
 clf = DecisionTreeClassifier()
 # 决策树训练
 clf.fit(train_features, train_labels)
 # 显示特征向量的重要程度
 coeffs = clf.feature_importances_
 #print(coeffs)
 df_co = pd.DataFrame(coeffs, columns=["importance_"])
 # 下标设置为Feature Name
 df_co.index = train_features.columns
 #print(df_co.index)
 df_co.sort_values("importance_", ascending=True, inplace=True)
 df_co.importance_.plot(kind="barh")

 plt.title("Feature Importance")
 plt.show()

image-20220213124911928.png

决策树可视化

通过pydot+GraphViz实现决策树可视化
pip install graphViz

import pydotplus
from six import StringIO
from sklearn.tree import export_graphviz

def show_tree(clf):
 dot_data = StringIO()
 export_graphviz(clf, out_file=dot_data)
 graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
 graph.write_pdf("titanic_tree.pdf")

show_tree(clf)