数据挖掘实验报告

实验要求：

对样本进行二分类，获取分类概率

采用方法：

随机森林

方法简介

随机森林就是通过集成学习的思想将多棵树集成的一种算法，它的基本单元是决策树，而它的本质属于机器学习的一大分支——集成学习（Ensemble Learning）方法。随机森林方法要求我们训练n棵决策树，
其中每棵决策树仅随机选取一部分样本和一部分特征进行训练。对于分类问题，每棵决策树都是一个分类器，那么对于一个输入样本，N棵树会有N个分类结果。随机森林集成了所有的分类投票结果，将投票次数最多的类别指定为最终的输出。

决策树实现

采用贪心的策略，获取每轮迭代的局部最优解

获取最佳分割
遍历当前节点的训练样本的每一个特征f：
对当前样本的f特征列进行排序
根据排序结果，遍历特征分割点t
计算根据f特征，t分割点的impurity(可采用gini index 或者信息增益）
获得min_impurity，min_impurity对应的分割就是当前最佳分割

for feature_i in range(n_features):
    # All values of feature_i
    feature_values = X[:, feature_i]

    # 预排序
    feature_values = sorted(enumerate(feature_values), key=lambda x:x[1])

    # 树的左儿子的类别统计
    X1typeNum = np.zeros(len(typeNum))
    # 树的右儿子的类别统计
    X2typeNum = copy.deepcopy(typeNum)

    feature_values_i = 0

    while feature_values_i < n_samples:
        threshold = feature_values[feature_values_i][1]
        while feature_values_i < n_samples and feature_values[feature_values_i][1] == threshold:
            X1typeNum[y[feature_values[feature_values_i][0]][0]] += 1
            X2typeNum[y[feature_values[feature_values_i][0]][0]] -= 1
            feature_values_i += 1

        X1gini = self._impurity_calculation(X1typeNum)
        X2gini = self._impurity_calculation(X2typeNum)
        Rgini = (feature_values_i/n_samples)*X1gini+(1-feature_values_i/n_samples)*X2gini


        if Rgini < minRgini:
            minRgini = Rgini
            best_criteria = {"feature_i": feature_i, "threshold": threshold}
            best_feature_values_i = feature_values_i


    if best_criteria["feature_i"] == feature_i:
        X1array = [e[0] for e in feature_values[:best_feature_values_i]]
        X1array = sorted(X1array)
        # print("X1araay:",X1array)
        X1 = X[X1array,:]
        y1 = y[X1array,:]
        X2 = np.delete(X,X1array,axis=0)
        y2 = np.delete(y,X1array,axis=0)
        best_sets = {
            "leftX": X1,  # X of left subtree
            "lefty": y1,  # y of left subtree
            "rightX": X2,  # X of right subtree
            "righty": y2  # y of right subtree
        }

2.停止条件

树深达到设定的最大深度
分割无法获得更小的impurity
节点已不可分割

3.节点构建
叶子节点：
value: 分类lable

非叶子节点：
feature_i: 选择的分割特征
threshold：对应的分割值
true_branch：小于等于分割值，走这个分支
false_branch：大于分割值，走这个分支

class DecisionNode():
    """Class that represents a decision node or leaf in the decision tree

    Parameters:
    -----------
    feature_i: int
        Feature index which we want to use as the threshold measure.
    threshold: float
        The value that we will compare feature values at feature_i against to
        determine the prediction.
    value: 
        The class prediction if classification tree, or float value if regression tree.
    true_branch: DecisionNode
        Next decision node for samples where features value met the threshold.
    false_branch: DecisionNode
        Next decision node for samples where features value did not meet the threshold.
    """

    def __init__(self, feature_i=None, threshold=None,
                 value=None, true_branch=None, false_branch=None):
        self.feature_i = feature_i  # Index for the feature that is tested
        self.threshold = threshold  # Threshold value for feature
        self.value = value  # Value if the node is a leaf in the tree
        self.true_branch = true_branch  # 'Left' subtree
        self.false_branch = false_branch  # 'Right' subtree

并行化的实现

采用python多进程的方法
（由于python多线程只能使用一个核，对CPU密集型的程序，才用多进程更为合适）

随机森林的并行比较简单，要建100棵树的森林，仅需将任务平均分配给各个进程即可

使用Multiprocessing.Pool

    pool = Pool(processes=4)
    result = []

    processesNum = 4

    n_estimators = 100

    for i in range(processesNum):
        msg = "hello %d" %(i)
        # 异步进程
        result.append(pool.apply_async(doRF,(X_train,y_train,
                X_test,int(n_estimators/processesNum), )))
    pool.close()
    pool.join()

    a = []
    for res in result:
        a.append(res.get())

对比

训练样本数	n_estimators	无并行耗时	4进程耗时	2进程耗时
20000	100	80.5810	43.1366	58.0753
100000	100	426.3829	247.0801	285.9107

显然在速度上有很大提升

Cache友好

获取最佳分割时，先对特征的分割值进行预排序，遍历排序后的分割值，计算基尼指数或熵时，可以不用每次都去遍历整个样本，极大的减少了读取内存的开销。

举例：
对于这样一个样本

feature	label
1	0
6	1
0	1
5	0

没有进行预排序时，我们遍历分割值t = 1，6，0，5，每次都需要扫描整个样本，判断每一行的feature<=t 的结果，再查看对应label，一共需要遍历4次样本

预排后，我们遍历t=0，1，5，6，知道对应index=2，0，3，1

 feature_values = sorted(enumerate(feature_values), key=lambda x:x[1])

先假设左子树节点=NULL, 右子树节点=当前节点，知道了现在两边的label分布情况 left: 0:0, 1:0 right: 0:2, 1:2
t = 0: 查询index=2的feature 发现=t，将index=2的节点划分给左子树，记录其label信息 left: 0:0, 1:1 right: 0:2, 1:1
查询index=0的feature 发现>t,over, t=0分割完毕通过label分布计算基尼指数
t = 1: 查询index = 0的feature 发现 = t，重复上述操作
...
如此我们只需遍历一次样本即可对一个特征的所有分割值进行处理，找出最佳分割值