PART I 决策树 (Decision Tree)
决策树基本知识
image.png image.png
决策树何时停止生长:
(I) all leaf nodes are pure with entropy of zero;
(II) a prespecified minimum change in purity cannot be made with any splitting methods;
(III) the number of observations in the leaf node reaches the pre-specified minimum one.
- 加载并查看数据集
data(airquality)
str(airquality)
# 'data.frame': 153 obs. of 6 variables:
# $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
# $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
# $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
# $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
# $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
# $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
- 插补
Ozone
变量中的缺失值
set.seed(888)
airquality[is.na(airquality$Ozone),1] <- sample(airquality[!is.na(airquality$Ozone),1],37) #使用非缺失值进行插补
summary(airquality$Ozone)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 16.00 30.00 41.31 63.00 168.00
- 拟合模型
library(party)
airct <- ctree(Ozone ~ ., data = airquality,controls = ctree_control(maxsurrogate = 3))
airct
# Conditional inference tree with 4 terminal nodes
#
# Response: Ozone
# Inputs: Solar.R, Wind, Temp, Month, Day
# Number of observations: 153
#
# 1) Temp <= 76; criterion = 1, statistic = 47.479
# 2)* weights = 61
# 1) Temp > 76
# 3) Wind <= 6.3; criterion = 1, statistic = 24.235
# 4)* weights = 21
# 3) Wind > 6.3
# 5) Temp <= 84; criterion = 0.964, statistic = 7.182
# 6)* weights = 45
# 5) Temp > 84
# 7)* weights = 26
- 结果可视化
plot(airct)
image.png
查看具体每个节点的情况
plot(airct, inner_panel = node_boxplot, edge_panel = function(...) invisible(),tnex = 1)
image.png
inner <- nodes(airct, c(1,3,5,7))
layout(matrix(1:length(inner), ncol = length(inner)/2))
out <- sapply(inner, function(i) {
splitstat <- i$psplit$splitstatistic
x <- airquality[[i$psplit$variableName]][splitstat >0]
plot(x, splitstat[splitstat > 0],main =
paste("Node",i$nodeID), xlab = i$psplit$variableName,
ylab = "Statistic",ylim = c(0, 10), cex.axis = 1.2, cex.lab =
1.2,cex.main = 1.2)
abline(v = i$psplit$splitpoint, lty = 4)
})
image.png
连续型变量有多种分割方法,用一个统计量statistic可以描述每种分割方法的好差程度,statistic值越大说明分割越好
- 决策树的预测
ind <- sample(2, nrow(airquality), replace=TRUE, prob
= c(0.7,0.3))
newData <- airquality[ind==2,]
newpred <- predict(airct, newdata= newData)
plot(newpred,newData$Ozone,xlab="Ozone value predicted by decision tree",
ylab="Observed ozone value")
image.png
PART II 随机森林 (Random Forest)
image.png- 拟合随机森林模型
aircf<-cforest(Ozone ~ ., data = airquality)
aircf
# Random Forest using Conditional Inference Trees
#
# Number of trees: 500
#
# Response: Ozone
# Inputs: Solar.R, Wind, Temp, Month, Day
# Number of observations: 153
- 评估预测效果
predforest <- predict(aircf, newdata= newData)
plot(predforest,newData$Ozone,ylab="Observed ozone value",
xlab="Predicted ozone value based on random forest")
image.png
PART III Model based recursive partitioning
airmod <- mob(Ozone ~Temp+Day|Solar.R+Wind+Month, data = airquality)
# Variables after symbol “|” are partitioning variables
plot(airmod)
image.png
airmod
# 1) Wind <= 6.3; criterion = 0.999, statistic = 25.078
# 2)* weights = 23
# Terminal node model
# Gaussian GLM with coefficients:
# (Intercept) Temp Day
# -96.4060 2.0736 -0.1832
#
# 1) Wind > 6.3
# 3)* weights = 123
# Terminal node model
# Gaussian GLM with coefficients:
# (Intercept) Temp Day
# -90.0534 1.5811 0.2236
#
coef(airmod)
# (Intercept) Temp Day
# 2 -96.40597 2.073623 -0.1831996
# 3 -90.05342 1.581143 0.2235920
sctest(airmod,node = 1)
# Solar.R Wind Month
# statistic 6.0347325 25.077906381 22.955024196
# p.value 0.9661175 0.001153747 0.003099691
参考资料
- 章仲恒教授丁香园课程:决策树与随机森林
- Annals of Translational Medicine: Big-data Clinical Trial Column
- Zhongheng Zhang Decision tree modeling using R
网友评论