美文网首页
决策树与随机森林

决策树与随机森林

作者: 北欧森林 | 来源:发表于2021-05-02 18:36 被阅读0次
    PART I 决策树 (Decision Tree)

    决策树基本知识


    image.png image.png

    决策树何时停止生长:
    (I) all leaf nodes are pure with entropy of zero;
    (II) a prespecified minimum change in purity cannot be made with any splitting methods;
    (III) the number of observations in the leaf node reaches the pre-specified minimum one.

    1. 加载并查看数据集
    data(airquality)
    str(airquality)
    
    # 'data.frame': 153 obs. of  6 variables:
    #   $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
    # $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
    # $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
    # $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
    # $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
    # $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
    
    1. 插补Ozone变量中的缺失值
    set.seed(888)
    airquality[is.na(airquality$Ozone),1] <- sample(airquality[!is.na(airquality$Ozone),1],37) #使用非缺失值进行插补
    summary(airquality$Ozone)
    
    # Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    # 1.00   16.00   30.00   41.31   63.00  168.00 
    
    1. 拟合模型
    library(party)
    airct <- ctree(Ozone ~ ., data = airquality,controls = ctree_control(maxsurrogate = 3))
    airct
    
    # Conditional inference tree with 4 terminal nodes
    # 
    # Response:  Ozone 
    # Inputs:  Solar.R, Wind, Temp, Month, Day 
    # Number of observations:  153 
    # 
    # 1) Temp <= 76; criterion = 1, statistic = 47.479
    # 2)*  weights = 61 
    # 1) Temp > 76
    # 3) Wind <= 6.3; criterion = 1, statistic = 24.235
    # 4)*  weights = 21 
    # 3) Wind > 6.3
    # 5) Temp <= 84; criterion = 0.964, statistic = 7.182
    # 6)*  weights = 45 
    # 5) Temp > 84
    # 7)*  weights = 26 
    
    1. 结果可视化
    plot(airct)
    
    image.png

    查看具体每个节点的情况

    plot(airct, inner_panel = node_boxplot, edge_panel = function(...) invisible(),tnex = 1)
    
    image.png
    inner <- nodes(airct, c(1,3,5,7))
    layout(matrix(1:length(inner), ncol = length(inner)/2))
    out <- sapply(inner, function(i) {
      splitstat <- i$psplit$splitstatistic
      x <- airquality[[i$psplit$variableName]][splitstat >0]
      plot(x, splitstat[splitstat > 0],main =
                  paste("Node",i$nodeID), xlab = i$psplit$variableName,
                ylab = "Statistic",ylim = c(0, 10), cex.axis = 1.2, cex.lab =
                  1.2,cex.main = 1.2)
      abline(v = i$psplit$splitpoint, lty = 4)
    })
    
    image.png

    连续型变量有多种分割方法,用一个统计量statistic可以描述每种分割方法的好差程度,statistic值越大说明分割越好

    1. 决策树的预测
    ind <- sample(2, nrow(airquality), replace=TRUE, prob
                  = c(0.7,0.3))
    newData <- airquality[ind==2,]
    newpred <- predict(airct, newdata= newData)
    plot(newpred,newData$Ozone,xlab="Ozone value predicted by decision tree",
         ylab="Observed ozone value")
    
    image.png
    PART II 随机森林 (Random Forest)
    image.png
    1. 拟合随机森林模型
    aircf<-cforest(Ozone ~ ., data = airquality)
    aircf
    
    # Random Forest using Conditional Inference Trees
    # 
    # Number of trees:  500 
    # 
    # Response:  Ozone 
    # Inputs:  Solar.R, Wind, Temp, Month, Day 
    # Number of observations:  153 
    
    1. 评估预测效果
    predforest <- predict(aircf, newdata= newData)
    plot(predforest,newData$Ozone,ylab="Observed ozone value",
         xlab="Predicted ozone value based on random forest")
    
    image.png
    PART III Model based recursive partitioning
    airmod <- mob(Ozone ~Temp+Day|Solar.R+Wind+Month, data = airquality)
    # Variables after symbol “|” are partitioning variables
    plot(airmod)
    
    image.png
    airmod
    # 1) Wind <= 6.3; criterion = 0.999, statistic = 25.078
    # 2)*  weights = 23 
    # Terminal node model
    # Gaussian GLM with coefficients:
    #   (Intercept)         Temp          Day  
    # -96.4060       2.0736      -0.1832  
    # 
    # 1) Wind > 6.3
    # 3)*  weights = 123 
    # Terminal node model
    # Gaussian GLM with coefficients:
    #   (Intercept)         Temp          Day  
    # -90.0534       1.5811       0.2236  
    # 
    
    coef(airmod)
    # (Intercept)     Temp        Day
    # 2   -96.40597 2.073623 -0.1831996
    # 3   -90.05342 1.581143  0.2235920
    
    sctest(airmod,node = 1)
    # Solar.R         Wind        Month
    # statistic 6.0347325 25.077906381 22.955024196
    # p.value   0.9661175  0.001153747  0.003099691
    

    参考资料

    1. 章仲恒教授丁香园课程:决策树与随机森林
    2. Annals of Translational Medicine: Big-data Clinical Trial Column
    3. Zhongheng Zhang Decision tree modeling using R

    相关文章

      网友评论

          本文标题:决策树与随机森林

          本文链接:https://www.haomeiwen.com/subject/allirltx.html