决策数算法的核心是要解决最佳节点和最佳分支
- 如何从数据表中找到最佳节点和最佳分支
- 如何让决策数停止生长,防止过拟合
一、sklearn中决策树
tree.DecisionTreeClassifier 分类树
tree.DecisionTreeRegressor 回归树
tree.export_graphviz 将生成的决策树导出为DOT格式,画图专用
tree.ExtraTreeClassifier 高随机版本的分类树
tree.ExtraTreeRegressor 高随机版本的回归树
建树过程
from sklearn import tree #导入需要的模块
clf = tree.DecisionTreeClassifier() #实例化
clf = clf.fit(x_train,y_train) #用训练集数据训练模型
result = clf.score(x_test,y_test) #导入测试集,从接口中调用需要的信息
二、重要参数
2.1 criterion
criterion 这个参数是用来觉得不纯度的计算方法的,sklearn提供了两种选择
- 输入 "entropy" 使用信息熵(entropy)
- 输入 “gini” ,使用基尼系数(Gini Impurity)
导入需要的算法库和模块
from sklearn import tree
from sklearn.datasets import laod_wine
from sklearn.model_selection import train_test_split
探索数据
wine = load_wine()
wine.data.shape
wine.target
##如果wine是一张表 ,应该是这样的
import pandas as pd
pd.contact([pd.DataFrame(wine.data),pd.DataFrame(wine.target)],axis=1)
wine.feature_names
wine.target_names
分测试集和训练集
Xtrain,Xtest,Ytrain,Ytest = train_test_split(wine.data,wine.target,test_size=0.3)
Xtrain.shape
Xtest.shape
画出一颗树吧
feature_name = ['酒精','苹果酸','灰','灰的碱性','镁','总酚','类黄酮','非黄烷类酚类','花青素','颜
色强度','色调','od280/od315稀释葡萄酒','脯氨酸']
import graphviz
dot_data = tree.export_graphviz(clf,out_file=None
,feature_names=feature_name
,class_names = =["琴酒","雪莉","贝尔摩德"]
,filled=Ture
,rounded=Ture)
graph = graphviz.Source(dot_data)
graph
探索决策树
clf.feature_importtances_
[*zip(feature_name,clf.feature_importances))]
评价树
clf = tree.DecisionTreeClassifier(criterion='entropy',random_state=30)
clf = clf.fit(Xtrain,Ytrain)
score = clf.score(Xtest,Ytest)
2.2 random_state & splitter
random_state用来设置分枝中的随机模式的参数
splitter也是用来控制决策树中的随机选项的,有"best" 和 “random”两种参数
clf = tree.DecisionTreeClassifier(criterion='entropy'
,random_state=30
,splitter='random')
clf = clf.fit(Xtrain,Ytrain)
score = clf.score(Xtest,Ytest)
score
import graphviz
dot_data = tree.export_graphviz(clf
,feature_names = feature_name
,class_names==["琴酒","雪莉","贝尔摩德"]
,filled= Ture
,rounded = Ture)
剪枝参数
为了让决策树有更好的泛化性,我们要对决策树进行剪枝。剪枝策略对决策树的影响巨大,正确的剪枝策略是优化
决策树算法的核心。sklearn为我们提供了不同的剪枝策略:
- max_depth 限制树的最大深度,超过设定深度的树枝全部剪掉
- min_samples_leaf 一个节点在分支后每个子节点都必须包含至少min_samples_leaf个训练样本
- min_samples_split 一个节点必须包含至少min_samples_split个训练样本
clf = tree.DecisionTreeClassifier(criterion="entropy"
,random_state=30
,splitter="random"
,max_depth=3
,min_samples_leaf=10
,min_samples_split=10
)
clf = clf.fit(Xtrain, Ytrain)
dot_data = tree.export_graphviz(clf
,feature_names= feature_name
,class_names=["琴酒","雪莉","贝尔摩德"]
,filled=True
,rounded=True
)
graph = graphviz.Source(dot_data)
graph
clf.score(Xtrain,Ytrain)
clf.score(Xtest,Ytest)
max_features & min_impurity_decrease
- max_features max_features限制分枝时考虑的特征个数,超过限制个数的特征都会被舍弃。
- min_impurity_decrease 限制信息增益的大小
确认最优剪枝参数
import matplotlib.pyplot as plt
test = []
for i in range(10):
clf = tree.DecisionTreeClassifier(max_depth=i+1
,criterion="entropy"
,random_state=30
,splitter="random"
)
clf = clf.fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest)
test.append(score)
plt.plot(range(1,11),test,color="red",label="max_depth")
plt.legend()
plt.show()
class_weight & min_weight_fraction_leaf
目标权重参数
网友评论