ID3算法

作者: 洛克黄瓜 | 来源:发表于2018-06-17 13:37 被阅读11次

决策树简记
决策树和随机森林
「数据分类」14决策树分类之CART算法
JS简单实现决策树(ID3算法)
决策树Decision Tree
ID3
c4.5
分类决策树算法
十大机器学习算法的优缺点
day10-决策树

ID3算法（决策树）

经常使用决策树处理分类问题
k-近邻算法最大缺点是无法给出数据的内在含义，决策树的主要优势就在于数据形式非常容易理解（当然，决策树缺点是容易产生过度匹配问题）
示例图：
[图片上传失败...(image-a484d9-1529213847436)]

决策树的构造

构造决策树时，需要知道的是，当前数据集上哪些特征在划分数据分类时起决定作用。这就得评估么个特征划分的作用
划分后，如果某个分支下的所有数据属于同一类型，那就得到分类，不用再分割这个分支了
有些决策树算法是用二分法划分，ID3算法则是根据特征数据（有必要的话需要先离散化）分类的枚举来划分，一个节点会有多个分支

信息增益

由构造过程，会思考选择特征划分的时候，该选先选哪个特征？

划分数据集的最大原则是：将无序的数据变得更加有序。获得信息增益最高的特征就是最好的选择
熵定义为信息的期望值，用来表明信息的有序程度。变量的不确定性越大，香农熵就越大。计算公式：

熵公式.png
信息增益其实就是熵的减少量
计算给定数据集的香农熵(数据dataSet的最后一列用来放分类)

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

划分数据集

按照给定特征划分数据集

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

选择最好的特征划分数据集（实现信息增益最大）

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

递归构建决策树

当划分分支下的数据集分类已经一致就可以终止该分支了，如果不一致就继续划分；不一致而又没有特征可以利用的情况下，使用多数表决：

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

构建决策树的函数代码

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

使用Matplotlib注解绘制树形图

上文的决策树是字典形式展现，不直观；画出树形图就好多了

书中源码如下（大致理解即可，以后可以拿来修改绘制自己的图）

import matplotlib.pyplot as plt

decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")

def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            numLeafs += getNumLeafs(secondDict[key])
        else:   numLeafs +=1
    return numLeafs

def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
             xytext=centerPt, textcoords='axes fraction',
             va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )
    
def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
    numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
    depth = getTreeDepth(myTree)
    firstStr = myTree.keys()[0]     #the text label for this node should be this
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes   
            plotTree(secondDict[key],cntrPt,str(key))        #recursion
        else:   #it's a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
#if you do get a dictonary you know it's a tree, and the first element will be another dict

def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
    #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
    plotTree(inTree, (0.5,1.0), '')
    plt.show()

def retrieveTree(i):
    listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                  {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
                  ]
    return listOfTrees[i]

my_tree = retrieveTree(0)
createPlot(my_tree)

上述代码得到图如下：

树形图.png

使用决策树分类

拿到决策树后，用来分类的函数如下：

def classify(inputTree,featLabels,testVec):
    firstStr = inputTree.keys()[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict): 
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else: classLabel = valueOfFeat
    return classLabel

使用pickle模块存储决策树

import pickle
def storeTree(inputTree,filename):
    with open(filename, 'w') as f:
        pickle.dump(inputTree, f)

def grabTree(filename):
    with open(filename, 'r') as f:
        return pickle.load(f)

决策树简记
具有不同划分准则的算法决策树原理剖析及实现(ID3)理解决策树算法(实例详解)-ID3算法与C4.5算法 ID3（...
决策树和随机森林
随机森林和GBDT算法的基础是决策树而建立决策树的算法由很多，ID3，C4.5,CART等， ID3：ID3算法...
「数据分类」14决策树分类之CART算法
1.CART算法与ID3算法对比（1）CART算法解决了ID3算法的不足，既能用于分类问题，又能用于回归问题。 ...
JS简单实现决策树(ID3算法)
推荐阅读:ID3算法 wiki决策树算法及实现完整示例代码:JS简单实现决策树(ID3算法)_demo.html ...
决策树Decision Tree
决策树是一种解决分类问题的算法。常用的决策树算法有： ID3 算法 ID3 是最早提出的决策树算法，他...
ID3
基于信息增益(Information Gain)的ID3算法 ID3算法的核心是在各个结点上应用信息增益准则来进行...
c4.5
C4.5是机器学习算法中的另一个分类决策树算法，它是基于ID3算法进行改进后的一种重要算法，相比于ID3算法，改进...
分类决策树算法
C4.5是机器学习算法中的另一个分类决策树算法，它是基于ID3算法进行改进后的一种重要算法，相比于ID3算法，改进...
十大机器学习算法的优缺点
C4.5算法 C4.5算法的核心思想是ID3算法，是ID3算法的改进：用信息增益率来选择属性，克服了用信息增益来...
day10-决策树
今天学了决策树的基本知识。基于信息论的决策树算法有：ID3, CART, C4.5等算法。 ID3 算法是根...

ID3算法

ID3算法（决策树）

决策树的构造

信息增益

划分数据集

递归构建决策树

使用Matplotlib注解绘制树形图

使用决策树分类

相关文章

决策树简记

决策树和随机森林

「数据分类」14决策树分类之CART算法

JS简单实现决策树(ID3算法)

决策树Decision Tree

ID3

c4.5

分类决策树算法

十大机器学习算法的优缺点

day10-决策树

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

python自学