美文网首页
机器学习3-决策树

机器学习3-决策树

作者: Re0 | 来源:发表于2018-08-10 10:16 被阅读0次

    [TOC]

    概述

    决策树模型呈树形结构,在分类问题中,表示基于特征对实例进行分类的过程。它可以认为是 if-then 规则的集合,也可以认为是定义在特征空间与类空间上的条件概率分布。

    决策树学习通常包括 3 个步骤:特征选择决策树的生成决策树的修剪

    决策树的定义

    分类决策树模型是一种描述对实例进行分类的树形结构。决策树由结点(node)和有向边(directed edge)组成。结点有两种类型:内部结点(internal node)和叶结点(leaf node)。内部结点表示一个特征或属性(features),叶结点表示一个类(labels)。

    用决策树对需要测试的实例进行分类:从根节点开始,对实例的某一特征进行测试,根据测试结果,将实例分配到其子结点;这时,每一个子结点对应着该特征的一个取值。如此递归地对实例进行测试并分配,直至达到叶结点。最后将实例分配到叶结点的类中。

    å�¾ 1. å�³ç­�æ �æ¡�ä¾�å�¾

    决策树原理

    信息熵 & 信息增益

    熵(entropy): 熵指的是体系的混乱的程度,在不同的学科中也有引申出的更为具体的定义,是各领域十分重要的参量。

    信息论(information theory)中的熵(香农熵): 是一种信息的度量方式,表示信息的混乱程度,也就是说:信息越有序,信息熵越低。例如:火柴有序放在火柴盒里,熵值很低,相反,熵值很高。

    信息增益(information gain): 在划分数据集前后信息发生的变化称为信息增益。

    如何构造一个决策树?

    def createBranch():
    '''
    此处运用了迭代的思想。 感兴趣可以搜索 迭代 recursion, 甚至是 dynamic programing。
    '''
        检测数据集中的所有数据的分类标签是否相同:
            If so return 类标签
            Else:
                寻找划分数据集的最好特征(划分之后信息熵最小,也就是信息增益最大的特征)
                划分数据集
                创建分支节点
                    for 每个划分的子集
                        调用函数 createBranch (创建分支的函数)并增加返回结果到分支节点中
                return 分支节点
    

    开发流程

    收集数据:可以使用任何方法。
    准备数据:树构造算法 (这里使用的是ID3算法,只适用于标称型数据,这就是为什么数值型数据必须离散化。 还有其他的树构造算法,比如CART)
    分析数据:可以使用任何方法,构造树完成之后,我们应该检查图形是否符合预期。
    训练算法:构造树的数据结构。
    测试算法:使用训练好的树计算错误率。
    使用算法:此步骤可以适用于任何监督学习任务,而使用决策树可以更好地理解数据的内在含义。
    

    算法特点

    优点:计算复杂度不高,输出结果易于理解,数据有缺失也能跑,可以处理不相关特征。
    缺点:容易过拟合。
    适用数据类型:数值型和标称型。
    

    代码实现

    定义decision tree类

    class DecisionNode():
        """Class that represents a decision node or leaf in the decision tree
        """
        def __init__(self, feature_i=None, threshold=None,
                     value=None, true_branch=None, false_branch=None):
            self.feature_i = feature_i          # Index for the feature that is tested
            self.threshold = threshold          # Threshold value for feature
            self.value = value                  # Value if the node is a leaf in the tree
            self.true_branch = true_branch      # 'Left' subtree
            self.false_branch = false_branch    # 'Right' subtree
    
    
    # Super class of RegressionTree and ClassificationTree
    class DecisionTree(object):
        """Super class of RegressionTree and ClassificationTree.
        """
        def __init__(self, min_samples_split=2, min_impurity=1e-7,
                     max_depth=float("inf"), loss=None):
            self.root = None  # Root node in dec. tree
            # Minimum n of samples to justify split
            self.min_samples_split = min_samples_split
            # The minimum impurity to justify split
            self.min_impurity = min_impurity
            # The maximum depth to grow the tree to
            self.max_depth = max_depth
            # Function to calculate impurity (classif.=>info gain, regr=>variance reduct.)
            self._impurity_calculation = None
            # Function to determine prediction of y at leaf
            self._leaf_value_calculation = None
            # If y is one-hot encoded (multi-dim) or not (one-dim)
            self.one_dim = None
            # If Gradient Boost
            self.loss = loss
    
        def fit(self, X, y, loss=None):
            """ Build decision tree """
            self.one_dim = len(np.shape(y)) == 1
            self.root = self._build_tree(X, y)
            self.loss = None
    
        def _build_tree(self, X, y, current_depth=0):
            """ Recursive method which builds out the decision tree and splits X and respective y
            on the feature of X which (based on impurity) best separates the data"""
    
            largest_impurity = 0
            best_criteria = None    # Feature index and threshold
            best_sets = None        # Subsets of the data
    
            # Check if expansion of y is needed
            if len(np.shape(y)) == 1:
                y = np.expand_dims(y, axis=1)
    
            # Add y as last column of X
            Xy = np.concatenate((X, y), axis=1)
    
            n_samples, n_features = np.shape(X)
    
            if n_samples >= self.min_samples_split and current_depth <= self.max_depth:
                # Calculate the impurity for each feature
                for feature_i in range(n_features):
                    # All values of feature_i
                    feature_values = np.expand_dims(X[:, feature_i], axis=1)
                    unique_values = np.unique(feature_values)
    
                    # Iterate through all unique values of feature column i and
                    # calculate the impurity
                    for threshold in unique_values:
                        # Divide X and y depending on if the feature value of X at index feature_i
                        # meets the threshold
                        Xy1, Xy2 = divide_on_feature(Xy, feature_i, threshold)
    
                        if len(Xy1) > 0 and len(Xy2) > 0:
                            # Select the y-values of the two sets
                            y1 = Xy1[:, n_features:]
                            y2 = Xy2[:, n_features:]
    
                            # Calculate impurity
                            impurity = self._impurity_calculation(y, y1, y2)
    
                            # If this threshold resulted in a higher information gain than previously
                            # recorded save the threshold value and the feature
                            # index
                            if impurity > largest_impurity:
                                largest_impurity = impurity
                                best_criteria = {"feature_i": feature_i, "threshold": threshold}
                                best_sets = {
                                    "leftX": Xy1[:, :n_features],   # X of left subtree
                                    "lefty": Xy1[:, n_features:],   # y of left subtree
                                    "rightX": Xy2[:, :n_features],  # X of right subtree
                                    "righty": Xy2[:, n_features:]   # y of right subtree
                                    }
    
            if largest_impurity > self.min_impurity:
                # Build subtrees for the right and left branches
                true_branch = self._build_tree(best_sets["leftX"], best_sets["lefty"], current_depth + 1)
                false_branch = self._build_tree(best_sets["rightX"], best_sets["righty"], current_depth + 1)
                return DecisionNode(feature_i=best_criteria["feature_i"], threshold=best_criteria[
                                    "threshold"], true_branch=true_branch, false_branch=false_branch)
    
            # We're at leaf => determine value
            leaf_value = self._leaf_value_calculation(y)
    
            return DecisionNode(value=leaf_value)
    
    
        def predict_value(self, x, tree=None):
            """ Do a recursive search down the tree and make a prediction of the data sample by the
                value of the leaf that we end up at """
    
            if tree is None:
                tree = self.root
    
            # If we have a value (i.e we're at a leaf) => return value as the prediction
            if tree.value is not None:
                return tree.value
    
            # Choose the feature that we will test
            feature_value = x[tree.feature_i]
    
            # Determine if we will follow left or right branch
            branch = tree.false_branch
            if isinstance(feature_value, int) or isinstance(feature_value, float):
                if feature_value >= tree.threshold:
                    branch = tree.true_branch
            elif feature_value == tree.threshold:
                branch = tree.true_branch
    
            # Test subtree
            return self.predict_value(x, branch)
    
        def predict(self, X):
            """ Classify samples one by one and return the set of labels """
            y_pred = [self.predict_value(sample) for sample in X]
            return y_pred
    
        def print_tree(self, tree=None, indent=" "):
            """ Recursively print the decision tree """
            if not tree:
                tree = self.root
    
            # If we're at leaf => print the label
            if tree.value is not None:
                print (tree.value)
            # Go deeper down the tree
            else:
                # Print test
                print ("%s:%s? " % (tree.feature_i, tree.threshold))
                # Print the true scenario
                print ("%sT->" % (indent), end="")
                self.print_tree(tree.true_branch, indent + indent)
                # Print the false scenario
                print ("%sF->" % (indent), end="")
                self.print_tree(tree.false_branch, indent + indent)
    

    决策树分类

    class ClassificationTree(DecisionTree):
        # 分类
        def _calculate_information_gain(self, y, y1, y2):
            # Calculate information gain
            p = len(y1) / len(y)
            entropy = calculate_entropy(y)
            info_gain = entropy - p * \
                calculate_entropy(y1) - (1 - p) * \
                calculate_entropy(y2)
    
            return info_gain
    
        def _majority_vote(self, y):
            most_common = None
            max_count = 0
            for label in np.unique(y):
                # Count number of occurences of samples with label
                count = len(y[y == label])
                if count > max_count:
                    most_common = label
                    max_count = count
            return most_common
    
        def fit(self, X, y):
            self._impurity_calculation = self._calculate_information_gain
            self._leaf_value_calculation = self._majority_vote
            super(ClassificationTree, self).fit(X, y)
    
    def decision_tree_classification():
        # Import helper functions
        from utils import train_test_split, standardize, accuracy_score
        from utils import mean_squared_error, calculate_variance, Plot
        from sklearn import datasets
        print ("-- Classification Tree --")
    
        data = datasets.load_iris()
        X = data.data
        y = data.target
    
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
    
        clf = ClassificationTree()
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
    
        accuracy = accuracy_score(y_test, y_pred)
    
        print ("Accuracy:", accuracy)
        # clf.print_tree()
    
        Plot().plot_in_2d(X_test, y_pred,
            title="Decision Tree",
            accuracy=accuracy,
            legend_labels=data.target_names)
    
    

    决策树回归

    class RegressionTree(DecisionTree):
        # 回归
        def _calculate_variance_reduction(self, y, y1, y2):
            var_tot = calculate_variance(y)
            var_1 = calculate_variance(y1)
            var_2 = calculate_variance(y2)
            frac_1 = len(y1) / len(y)
            frac_2 = len(y2) / len(y)
    
            # Calculate the variance reduction
            variance_reduction = var_tot - (frac_1 * var_1 + frac_2 * var_2)
    
            return sum(variance_reduction)
    
        def _mean_of_y(self, y):
            value = np.mean(y, axis=0)
            return value if len(value) > 1 else value[0]
    
        def fit(self, X, y):
            self._impurity_calculation = self._calculate_variance_reduction
            self._leaf_value_calculation = self._mean_of_y
            super(RegressionTree, self).fit(X, y)
            
    def decision_tree_regression():
        import numpy as np
        import matplotlib.pyplot as plt
        import pandas as pd
    
        from utils import train_test_split, standardize, accuracy_score
        from utils import mean_squared_error, calculate_variance, Plot
    
        print ("-- Regression Tree --")
    
        # Load temperature data
        data = pd.read_csv('../../data/TempLinkoping2016.txt', sep="\t")
    
        time = np.atleast_2d(data["time"].as_matrix()).T
        temp = np.atleast_2d(data["temp"].as_matrix()).T
    
        X = standardize(time)        # Time. Fraction of the year [0, 1]
        y = temp[:, 0]  # Temperature. Reduce to one-dim
    
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    
        model = RegressionTree()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
        y_pred_line = model.predict(X)
    
        # Color map
        cmap = plt.get_cmap('viridis')
    
        mse = mean_squared_error(y_test, y_pred)
    
        print ("Mean Squared Error:", mse)
    
        # Plot the results
        m1 = plt.scatter(366 * X_train, y_train, color=cmap(0.9), s=10)
        m2 = plt.scatter(366 * X_test, y_test, color=cmap(0.5), s=10)
        m3 = plt.scatter(366 * X_test, y_pred, color='black', s=10)
        plt.suptitle("Regression Tree")
        plt.title("MSE: %.2f" % mse, fontsize=10)
        plt.xlabel('Day')
        plt.ylabel('Temperature in Celcius')
        plt.legend((m1, m2, m3), ("Training data", "Test data", "Prediction"), loc='lower right')
        plt.show()
    

    sklearn实现

    from sklearn.datasets import load_iris
    from sklearn.tree import DecisionTreeClassifier
    
    iris = load_iris()
    X = iris.data[:, 2:] # petal length and width
    y = iris.target
    
    tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
    tree_clf.fit(X, y)
    
    # 可视化, 命令行输入dot -Tpng iris_tree.dot -o iris_tree.png
    from sklearn.tree import export_graphviz
    export_graphviz(
            tree_clf,
            out_file="iris_tree.dot",
            feature_names=iris.feature_names[2:],
            class_names=iris.target_names,
            rounded=True,
            filled=True
        )
    
    # 回归
    # from sklearn.tree import DecisionTreeRegressor
    # tree_reg = DecisionTreeRegressor(max_depth=2)
    # tree_reg.fit(X, y)
    

    参考

    ApacheCN机器学习实战https://github.com/apachecn/MachineLearning/blob/dev/docs/3.%E5%86%B3%E7%AD%96%E6%A0%91.md

    代码ML-From-Scratchhttps://github.com/eriklindernoren/ML-From-Scratch/blob/master/mlfromscratch/supervised_learning/decision_tree.py

    相关文章

      网友评论

          本文标题:机器学习3-决策树

          本文链接:https://www.haomeiwen.com/subject/lcckbftx.html