ID3 C4.5 CART 比较

ID3（以信息增益为准则选择信息增益最大的属性）

缺点

信息增益对==可取值数目较多的属性==有所偏好，比如通过ID号可将每个样本分成一类，但是没有意义。（具体原因请查信息增益的数学公式）
ID3只能对离散属性（标称型数据）的数据集构造决策树，即只能应用于分类问题。
ID3对缺失值敏感

C.5(以信息增益率为准则选择信息增益率最大的属性)

对ID3的改进

对于离散值处理与ID3一致；可以处理连续数值型属性，方法：
C4.5可以允许变量存在缺失值，会把缺失值单独作为一类或者按概率分到每一个子树上。

CART Classification and Regression tree (以基尼不纯度为准则选择基尼增益最大的属性)

决策树分为分类树和回归树，即可以处理分类问题，也可以处理回归问题。
实际场合中CART 用的较多。

决策树的递归构建

终止条件（满足任意一个）：

遍历完所有划分数据集的属性
每个分支下的所有实例都具有相同的分类（熵为0）

注明：遍历完所有划分数据集的属性，类标签仍不唯一，一般采用多数表决的方法决定该叶子节点的分类

决策树过拟合问题

表现：训练误差较低但泛化误差却比较大的情况
原因：出现噪声数据；缺乏有代表性的样本
解决办法：剪枝（预剪枝，后剪枝）
预剪枝：在构造过程中，当某个节点满足剪枝条件，则直接停止此分支的构造

设置阈值：当观察到的不纯性度量的增益（或估计的泛化误差的改进）低于某个确定的阈值时就停止扩展叶节点
设置树的深度限制，设置samles个数限制
==问题==：阈值太高将导致拟合不足的模型，而阈值太低就不能充分的解决过拟合的问题

后剪枝：先构造完整的决策树，再通过某些条件==自底向上的方式==修剪决策树。

若该结点对应的子树替换为叶结点能提升决策树的泛化能力，则将改子树替换为叶结点。

决策树的局限和优势

局限

精确度相比其他方法较低
树可能非常不健壮：训练集出现小的波动，影响决策结果
学习最优决策树的问题被认为是 NP-complete的：利用启发式算法，贪婪算法，容易出现局部最优解，非全局最优解。
容易出现过拟合的问题：决策树学习者可以创建过于复杂的树，从训练数据中不能很好地推广。

优势

易于理解和解释：人们能够通过简单的解释理解决策树模型。树也可以以非专家易于解释的方式以图形显示
能够处理回归和分类问题。
需要很少的数据准备
使用白盒模型：如果一个给定的情况在模型中是可观察的，那么这个条件的解释可以用布尔逻辑来解释
可以使用统计测试来验证模型
用大数据集执行效果很好
比其他方法更能反映人类的决策

python 实现DT

# -*- coding: utf-8 -*-
"""
Created on 2017/12/27

@author: donghao
"""
import numpy as np
from math import log
import operator


def create_dataset():
    dataset = [[1, 1, 'y'],
               [1, 1, 'y'],
               [1, 0, 'n'],
               [0, 1, 'n'],
               [0, 1, 'n']]
    labels = ['no surfacing', 'flippers'] # features_name
    return dataset, labels


def calc_shannon_ent(dataset):
    """
    # 计算数据集的熵
    :param dataset: 待划分数据集
    :return: 熵
    """
    num_entries = len(dataset)
    label_counts = {}
    for line in dataset:
        current_label = line[-1]
        if current_label not in label_counts.keys():
            label_counts[current_label] = 0
        label_counts[current_label] += 1
        shannon_ent = 0.0
    for key in label_counts:
        prob = float(label_counts[key])/num_entries
        shannon_ent -= prob*log(prob, 2)
    return shannon_ent

def split_dataset(dataset, axis, value):
    """
    按照给定的特征划分数据集
    :param dataset: 待划分数据集
    :param axis: 划分数据集的特性（f1 or f2）
    :param value: 需要返回的特征值（比如axis取f1时，是返回f1=0 or f1=1）
    :return: 返回按axis分，属于value的子数据集
    """
    ret_dataset = []
    for line in dataset:
        if line[axis] == value:
            reduced_line = line[:axis]
            reduced_line.extend(line[axis + 1:])
            ret_dataset.append(reduced_line)  # append and extend
    return ret_dataset


def choose_best_feature_to_split(dataset):
    """
    # 分别计算按每个feature的分裂之后的信息增益（ID3），选择最大的
    :param dataset:
    :return:
    """
    num_features = len(dataset[0]) - 1
    base_entropy = calc_shannon_ent(dataset)
    best_info_gain = 0.0
    best_feature = -1
    for i in range(num_features):
        # 将dataset 的第i列放在一个list中
        feat_list = [example[i] for example in dataset]
        # list 中不重复的数据
        unique_vals = set(feat_list)
        new_entropy = 0.0
        for value in unique_vals:
            sub_dataset = split_dataset(dataset, i, value)
            prob = len(sub_dataset)/float(len(dataset))
            new_entropy += prob * calc_shannon_ent(sub_dataset)
        info_gain = base_entropy - new_entropy
        if info_gain>best_info_gain:
            best_info_gain = info_gain
            best_feature = i
    return best_feature


def majority_cnt(class_list):
    """
    多数判决（所有features都使用完了）
    :param class_list:
    :return:
    """
    class_count = {}
    for vote in class_list:
        if vote not in class_count.keys():
            class_count[vote] = 0
            class_count[vote] += 1
        sorted_class_count = sorted(class_count.items(),
                                    key=operator.itemgetter(1),
                                    reverse=True)
        return sorted_class_count[0][0]


def create_tree(dataset, labels):
    class_list = [example[-1] for example in dataset]
    # 类别完全相同 停止继续划分
    if class_list.count(class_list[0]) == len(class_list):
        return class_list[0]
    # 遍历完所有特征时返回出现次数最多的类别，比如剩('yes'),('yes'),('no')
    if len(dataset[0]) == 1:
        return majority_cnt(class_list)
    best_feature = choose_best_feature_to_split(dataset)
    best_feature_label = labels[best_feature]
    my_tree = {best_feature_label: {}}
    del (labels[best_feature])
    feature_values = [example[best_feature] for example in dataset]
    unique_values = set(feature_values)
    for value in unique_values:
        sub_labels = labels[:]
        my_tree[best_feature_label][value] = create_tree(
            split_dataset(dataset, best_feature, value),
            sub_labels)
    return my_tree


if __name__=='__main__':
    dataset, labels = create_dataset()
    # calc_shannon_ent test
    print(calc_shannon_ent(dataset))

    # split_dataset test
    print(split_dataset(dataset, 1, 1))
    print(split_dataset(dataset, 1, 0))

    # choose_best_feature_to_split test
    print(choose_best_feature_to_split(dataset))

    # create_tree test
    print(create_tree(dataset, labels))