项目源于kaggle,比赛名称Titanic-Machine Learning from Disaster,本文对该赛题展开数据清洗与分析。
- 赛题任务:预测乘客是否幸存
- 赛题数据:乘客基本信息和船票信息
-
评价指标:准确率
image-20220213123504257.png
需要对数据进行预处理
- 数据探索
- 缺失值
- 数值为英文的情况
-
特征选择
image-20220213123914898.png
# 使用饼图来进行Survived取值的可视化
#print(type(train_data["Survived"].value_counts()))
train_data["Survived"].value_counts().plot(kind = "pie", label='Survived')
plt.show()
# 不同的Pclass,幸存人数(条形图)
sns.barplot(x = 'Pclass', y = "Survived", data = train_data);
plt.show()
# 不同的Embarked,幸存人数(条形图)
sns.barplot(x = 'Embarked', y = "Survived", data = train_data);
plt.show()
显示特征之间的相关系数
plt.figure(figsize=(10, 10))
plt.title('Pearson Correlation between Features',y=1.05,size=15)
train_data_hot_encoded = train_features.drop('Embarked',1).join(train_features.Embarked.str.get_dummies())
train_data_hot_encoded = train_data_hot_encoded.drop('Sex',1).join(train_data_hot_encoded.Sex.str.get_dummies())
# 计算特征之间的Pearson系数,即相似度
sns.heatmap(train_data_hot_encoded.corr(),linewidths=0.1,vmax=1.0, fmt= '.2f', square=True,linecolor='white',annot=True)
plt.show()
![](https://img.haomeiwen.com/i27579716/5b500d06570d2665.png)
训练并显示特征向量的重要程度
def train(train_features, train_labels):
# 构造CART决策树
clf = DecisionTreeClassifier()
# 决策树训练
clf.fit(train_features, train_labels)
# 显示特征向量的重要程度
coeffs = clf.feature_importances_
#print(coeffs)
df_co = pd.DataFrame(coeffs, columns=["importance_"])
# 下标设置为Feature Name
df_co.index = train_features.columns
#print(df_co.index)
df_co.sort_values("importance_", ascending=True, inplace=True)
df_co.importance_.plot(kind="barh")
plt.title("Feature Importance")
plt.show()
![](https://img.haomeiwen.com/i27579716/706f4b98d7c0e0ff.png)
决策树可视化
通过pydot+GraphViz实现决策树可视化
pip install graphViz
import pydotplus
from six import StringIO
from sklearn.tree import export_graphviz
def show_tree(clf):
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("titanic_tree.pdf")
show_tree(clf)
网友评论