应该是第四周了吧,这两周写连续值处理,主要看k-means。
关于c4.5(完善过程中发现的问题)的一些补足。
关于连续值的处理:
https://stackoverflow.com/questions/15629398/how-does-the-c4-5-algorithm-handle-continuous-data
简单地说就是排序,取中点,分别算信息增益率然后取最大的作为阈值。
其中,对于无法提升纯度的可能分割点应舍去。
(eg:TEMPERATURE取值72时对应yes,下一个取值75也是yes,则不用考虑middle[72,75]这一分割点)
为避免塞翻,复制原文如下。
For continuous data C4.5 uses a threshold value where everything less than the threshold is in the left node, and everything greater than the threshold goes in the right node. The question is how to create that threshold value from the data you're given. The trick there is to sort your data by the continuous variable in ascending order. Then iterate over the data picking a threshold between data members. For example if your data for attribute x is:
0.5, 1.2, 3.4, 5.4, 6.0
You first pick a threshold between 0.5 and 1.2. In this case we can just use the average: 0.85. Now compute your impurity:
H(x < 0.85) = H(s) - l/N * H(x<0.85) - r/N * H(x>0.85).
Where l is the number of samples in the left node, r is the number of samples in the right node, and N is the total number of samples in the node being split. In our example above with x>0.85 as our split then l=1, r=4, and N=5.
Remember the computed impurity difference, and now compute it for the split between 2 and 3 (ie x>2.3). Repeat that for every split (ie n-1 splits). Then pick the split that minimized H the most. That means your split should be more pure than not splitting. If you can't increase the purity for the resulting nodes then don't split it. You can also have a minimum node size so you don't end up with the left or right nodes containing only one sample in them.
https://discuss.analyticsvidhya.com/t/decision-tree-with-continuous-variables/201/6
是说其实最后还是要回到已有的点上而不是取中点……?
You can see that if there are N possible values, we would have to consider N-1 possible splits.
And note that we do not choose the mid-point between values as the splitting threshold. (We won’t consider Age <=10.5 as 10.5 never appears in the training set)
网友评论