【西瓜书】第4章决策树

作者: 一杭oneline | 来源:发表于2019-05-17 10:29 被阅读0次

西瓜书-决策树
【西瓜书】第4章决策树
西瓜书第4章-决策树详解
决策树
西瓜书笔记01：logistic回归、决策树
西瓜书（4.1-4.3）关于决策树
决策树
相亲可以用决策树？
《机器学习》西瓜书学习笔记（三）
【机器学习实战】第3章决策树（Decision Tree）

一、决策树初步认识

叶子节点：存放决策结果

非叶子节点：特征属性，及其对应输出，按照输出选择分支

决策过程：从根节点出发，根据数据的各个属性，计算结果，选择对应的输出分支，直到到达叶子节点，得到结果

决策树的目的：为了产生一棵泛化能力强，处理未见示例能力强的决策树。

二、构建决策树

西瓜书中西瓜分类决策树

通过上述例子，构建过程的关键步骤是选择分裂属性，即纹理、根蒂、触感、色泽等属性的选择先后次序。分裂属性是在某个节点处按照某一特征属性的不同划分构造不同的分支，其目标是让各个分裂子集尽可能的“纯”，即每个子集尽量都属于同一分类项。

分裂属性分3种情况：

属性是离散值且不要求生成二叉树

属性的每个值作为一个分支

属性是离散值且要求生成二叉树

按照“属于”和“不属于”分成2个分支

属性是连续值

注意：决策树使用自顶向下递归分治法，并采用不回溯的贪心策略。分裂属性的选择算法很多，书中介绍3种常用的算法：信息增益（Information gain）、增益比率（gain ratio）、基尼指数（Gini index）

算法

决策树算法

吐槽：在决策树学习过程中，看了很多代码，其中的关键就是递归的构建节点，在这个环节中很重要的是确定叶子节点的数据格式，需要包含该节点的标签、上级节点、下节点，虽然图画起来很简单，在最后返回时也只返回整个树的根节点，但根节点中的数据结构存着所有的叶子节点和分支label。

如：

class Node(object):

'''

definition of decision node class

attr: attribution as parent for a new branching

attr_down: dict: {key, value}

key: categoric: categoric attr_value

continuous: '<= div_value' for small part

'> div_value' for big part

value: children (Node class)

label：class label (the majority of current sample labels)

'''

def __init__(self, attr_init=None, label_init=None, attr_down_init={} ):

self.attr = attr_init

self.label = label_init

self.attr_down = attr_down_init

使用的就是字典结构

第8步，选择属性遍历其中取值，可以理解为选择根蒂为划分属性，蜷缩、稍蜷、硬挺为划分值，这种情况下生成子节点，如果蜷缩的全为同一取值（如好瓜），那么就是叶子节点，如果不是再回到上步，以此类推。

在这个属性选择的过程中，使用的就是信息增益（Information gain）、增益比率（gain ratio）、基尼指数（Gini index）3种算法计算划分的属性，个人感觉这三种算法也是构成不同决策树算法的关键。

西瓜书

以下开始讨论信息增益（Information gain）、增益比率（gain ratio）、基尼指数（Gini index）3种算法

信息熵

信息增益

信息增益率

需要注意的是信息增益率对取值较小的有偏好，所以信息增益和信息增益率都不是最佳的选择

C4.5算法并不是直接选择增益率最大的属性划分，而是使用了启发式的：先从候选属性中找出信息增益高于平均值的属性，再从中选择信息增益率最高的

西瓜书基尼指数

CART树介绍

下面按照书本内容，开始学习剪枝的内容

剪枝

预剪枝在树没有生成之前，使用验证集判断这个叶子节点要不要产生

使用的是验证集对生成树进行验证，在预减枝过程中，一定注意使用的是哪个数据集，不能用训练集来决定是否进行减枝；

划分前和划分后进行比较，在进行下次递归迭代前，进行判断。

数据集合

后剪枝生成树后从下而上进行减枝，这种情况下，虽然泛化能力强，但是时间长

------第二遍读西瓜书更改--------

预剪枝的过拟合风险下降，欠拟合的风险上升。

后剪枝的过拟合风险下降，欠拟合风险不变

泛化能力后剪枝强于预剪枝。

1.常见决策树算法有哪些？它们的划分准则分别是什么，是否有缺陷

常见决策树算法有：ID3、C4.5、CART

划分准则分别是：信息增益、信息增益率、基尼指数

信息增益缺点：对可取值较多的属性有偏好，比如编号、日期

信息增益率缺点：对可取值较少的属性有偏好

2.决策树为什么要剪枝？有几种方法？简述一下，并分析其优缺点

为了防止过拟合。有两种方法：预剪枝、后剪枝。

预剪枝：预剪枝的核心思想是在树中进行扩展之前，先计算当前的划分是否能带来模型泛化能力的提升，如果不能则不再生长子树。此时可能存在不同类别的样本同时存在结点中，按照多数投票原则判断该结点所属类别。优点：思想直接、算法简单、效率高等特点，适合解决大规模问题。缺点：有欠拟合风险

后剪枝：后剪枝的核心思想是让算法生成一棵完全生长的决策树，然后从最底层向上计算是否剪枝。剪枝过程将子树删除，用一个叶子结点代替，该结点的类别同样按照多数投票的原则进行判断。同样，后剪枝也可以通过在测试集的准确率进行判断，如果剪枝后准确率有提升则进行剪枝。优点：相比于预剪枝，泛化能力强。缺点：时间开销大

------------------------------------------------------------2019.09.22---------------------------------------------------------------------------

以下讨论的是连续值和缺失值的处理

连续值

相等的划入左侧（是）

缺失值的处理

可以先看例子

缺失值的例子

色泽属性中正样例6个，负样例8个

个人理解这个权重如果出现在一个节点中，会影响这个节点是否分支，其权重是否作为正样本，参加到信息熵或者基尼指数的计算过程中。

下一步划分树节点的过程中有缺失值的样本权重会进行更新，

下面说明带着权重怎么计算

纹理是模糊的

向下分支的过程

Python 中 CART树

代码示例：

from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(max_depth=2)

tree_clf.fit(X, y)

tree_clf.predict_proba([[5, 1.5]]) #通过生成树预测某一个值的分类

CART树计算使用的基尼指数

Once it has successfully split the training set in two, it splits the subsets using the same logic, then the sub-subsets and so on, recursively. It stops recursing once it reaches the maximum depth (defined by the max_depth hyperparameter), or if it cannot find a split that will reduce impurity. A few other hyperparameters (described in a moment) control additional stopping conditions (min_samples_split, min_sam ples_leaf, min_weight_fraction_leaf, and max_leaf_nodes).

——《Hands-on-Machine-Learning-with-Sciki》

So should you use Gini impurity or entropy? The truth is, most of the time it does not make a big difference: they lead to similar trees. Gini impurity is slightly faster to compute, so it is a good default. However, when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees.

——《Hands-on-Machine-Learning-with-Sciki》

ID3算法使用信息增益来选择特征，信息增益大的优先选择，这种方式会使得特征值较多的特征容易被选择。

在C4.5算法中，采用了信息增益比来选择特征，改进了ID3容易选择特征值多的特征的问题。C4.5也是优先选择较大的。

上述2者都是基于信息论的熵模型的，这里面会涉及对数运算，因此计算成本较大。

CART分类树算法使用基尼系数，既减少了计算成本，又保留了熵这种运算形式的优点。基尼系数代表了模型的不纯度，基尼系数越小，则不纯度越低，特征越好。

选择基尼指数或者信息熵并不会有很大的区别，但是基尼指数计算速度

关于Python中CART参数的说明

The DecisionTreeClassifier class has a few other parameters that similarly restrict the shape of the Decision Tree: min_samples_split (the minimum number of samples a node must have before it can be split),

min_samples_leaf (the minimum number of samples a leaf node must have),

min_weight_fraction_leaf (same as min_samples_leaf but expressed as a fraction of the total number of weighte dinstances),

max_leaf_nodes (maximum number of leaf nodes),

max_features (maximum number of features that are evaluated for splitting at each node).

Increasing min_* hyperparameters or reducing max_* hyperparameters will regularize the model.

使用Python 绘制决策树

from sklearn.tree import export_graphviz

export_graphviz( tree_clf, out_file=image_path("iris_tree.dot"), feature_names=iris.feature_names[2:], class_names=iris.target_names, rounded=True, filled=True )

Then you can convert this .dot file to a variety of formats such as PDF or PNG using the dot command-line tool from the graphviz package.1 This command line converts the .dot file to a .png image file:

$ dot -Tpng iris_tree.dot -o iris_tree.png

CART树的回归功能

代码示例：

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2)

tree_reg.fit(X, y)

This tree looks very similar to the classification tree you built earlier. The main difference is that instead of predicting a class in each node, it predicts a value. For example, suppose you want to make a prediction for a new instance with x1 = 0.6. You traverse the tree starting at the root, and you eventually reach the leaf node that predicts value=0.1106. This prediction is simply the average target value of the 110 training instances associated to this leaf node. This prediction results in a Mean Squared Error (MSE) equal to 0.0151 over these 110 instances.

回归树是预测一个值，而不是一个类，使用均方差（MSE）划分类别