2020 机器学习之集成学习(上)

作者: zidea | 来源:发表于2020-01-06 20:09 被阅读0次

2020 机器学习之集成学习(上)
周志华西瓜书-AdaBoost算法证明解析
10.machine_learning_model_ensemb
机器学习之-集成学习
梯度提升树（GBDT）
3.1.1.8 集成学习
机器学习入门之 — 集成学习
集成学习之AdaBoost
Task5 模型集成
集成学习

machine_learning.jpg

最近去图书馆和商场都发现一个用于您引导的机器人，他们跟根据您的具体需求给出建议，甚至引导您去您想要去的地方，从而节省您自己寻找时间。看到了这一切再一次激发我学习计算机视觉和机器学习的热情。
现在大家更多感觉是新奇，只是把他当做玩具，我想不久将来大家都会习惯他存在就像今天我们离不开手机一样。

集成学习

有时候我们面对一些难以做出决定问题，总喜欢找几个朋友来谈一谈听取一下他们意见，然后再做出决定。这也就是集成学习基本思想。基本思路是拥有大量的模型，每一个都在训练集上

from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report

from sklearn.model_selection import train_test_split

def get_data():
    no_features = 30
    redundant_features = int(0.1*no_features)
    informative_features = int(0.6*no_features)
    repeated_features = int(0.1*no_features)
    print(no_features,redundant_features,informative_features,repeated_features)
    x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03,
                             n_informative=informative_features,n_redundant=redundant_features,
                             n_repeated=repeated_features,random_state=7)
    return x,y

# 构建单个KNN模型
def build_single_model(x,y):
    model = KNeighborsClassifier()
    model.fit(x,y)
    return model
# bagging方法
def build_bagging_model(x,y):
    bagging = BaggingClassifier(KNeighborsClassifier(),n_estimators=100,random_state=9,
                                max_samples=1.0,max_features=0.7,bootstrap=True,
                                bootstrap_features=True)
    bagging.fit(x,y)
    return bagging

def view_model(model):
    print("\n Sampled attributes in top 10 estimators\n")
    for i,feature_set in enumerate(model.estimators_features_[0:10]):
        print("estimator %d" % (i+1), feature_set)

x,y = get_data()
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size=0.3,random_state=9)

30 3 18 3

x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,
                                             test_size=0.3,random_state=9)

model = build_single_model(x_train,y_train)
predicted_y = model.predict(x_train)
print("\n Single Model Accuracy on training data \n")
print(classification_report(y_train,predicted_y))

 Single Model Accuracy on training data 

              precision    recall  f1-score   support

           0       0.88      0.87      0.88       181
           1       0.87      0.88      0.87       169

   micro avg       0.87      0.87      0.87       350
   macro avg       0.87      0.87      0.87       350
weighted avg       0.87      0.87      0.87       350

bagging = build_bagging_model(x_train,y_train)
predicted_y = bagging.predict(x_train)
print("\n Bagging Model Accracy on training data\n")
print(classification_report(y_train,predicted_y))
view_model(bagging)

 Bagging Model Accracy on training data

              precision    recall  f1-score   support

           0       0.93      0.97      0.95       181
           1       0.96      0.92      0.94       169

   micro avg       0.95      0.95      0.95       350
   macro avg       0.95      0.94      0.95       350
weighted avg       0.95      0.95      0.95       350


 Sampled attributes in top 10 estimators

estimator 1 [25 20 10  6 17 18 11 17  9 14  3 10 10 23 22 18 17 11 21 20  1]
estimator 2 [14  3 27 28 20 20 27 25  0 21  1 12 20 21 29  1  0 28 16  4  9]
estimator 3 [29  5 23 19  2 16 21  4 13 27  1 15 24  5 14  1  4 25 22 26 29]
estimator 4 [23 10 16  7 22 11  0 14 14 17  8 17 27 12 13 23  8  7 27  0 27]
estimator 5 [ 3  0 26 13 23  7 27 15 18 11 26 18 26  3 22  6 11 21  6 12 19]
estimator 6 [16  5 24 19 21  2  2 22 12 21 14 28  5 29  9 19 24 14 21  8 11]
estimator 7 [ 7 23  2 17 22  2 12 14 25  5  7 10 25  5 17 16  9  0  9  9 15]
estimator 8 [16 10  7  8  8 18  6  3 12 29 13 17 20  9  2 25  6 28 15  0 16]
estimator 9 [22 29  2  5  6 11 18  4 19 27 17 28 20 15 21 26 14  5 28 15 21]
estimator 10 [29 22 17 10 16 10 27  8  2 18 26  1  3  2  1 17  2 12 10 22 26]

predicted_y = model.predict(x_dev)
print("\n Single Model Accuracy on Dev Data \n")
print(classification_report(y_dev,predicted_y))

 Single Model Accuracy on Dev Data 

              precision    recall  f1-score   support

           0       0.83      0.84      0.83        51
           1       0.85      0.83      0.84        54

   micro avg       0.84      0.84      0.84       105
   macro avg       0.84      0.84      0.84       105
weighted avg       0.84      0.84      0.84       105

predicted_y = bagging.predict(x_dev)
print("\n Bagging Model Accuracy on Dev Data \n")
print(classification_report(y_dev,predicted_y))

 Bagging Model Accuracy on Dev Data 

              precision    recall  f1-score   support

           0       0.83      0.88      0.86        51
           1       0.88      0.83      0.86        54

   micro avg       0.86      0.86      0.86       105
   macro avg       0.86      0.86      0.86       105
weighted avg       0.86      0.86      0.86       105

AdaBoosting

提升法是一种强大的集成技术，在数学科学中得到广泛应用。
$X = {(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\dots,(x^{(N)},y^{(N)})} x{(i)} \in \mathbb{R}^n$

$y \in {0,+1}$
有关数据我在这里不再啰嗦，大家到现在应该一眼就能看出我们样本
$error_rate = \frac{1}{N} \sum_{i=1}^N instance where y_i != F(X_i)$

假设我们构建了一个弱分类器，其错误比例仅稍好于随机猜测。在提升法里构建一系列弱分类器用来进行微调的数据集合上，每个分类器使用的数据只做了小小的调整，最后结束于第 M 个分类器

$F_1(X), F_2(X), \dots F_M(X)$

$F_{final} = sign\left( \sum_{i=1}^M \alpha_i F_i(X) \right)$

和Bagging 算法不同之处就在于权重 alpha 和顺序建模。在 AdaBoosting 构建一系列分类器，给每一个分类器使用经过微调的数据集。

第一个分类器初始化开始 m = 1 先把每一个实例的权重定为 1/N 也就是如果有 100 个样本，每一个样本获取权重为 0.01 我们用 w 表示权重，现在有 100 个这样的权重值如下
$w_1,w_2,\dots, w_N$
那么现在所有样本被分类器选中的机会是均等的，我们创建一个分类器，对训练集进行测试以获取错误分类比例。之前曾提到过错误分类比例计算公式，

$error rate_1 = \frac{\sum_{i=1}^N w_i \times abs(y_i - \hat{y}_i)}{\sum_{i=1}^N w_i}$

公式中的 abs 表示取绝对值，根据错误比例，采用下面的公式来计算 alpha 值

$\alpha_1 = 0.5 \times \log \left( \frac{(1 - errorrate_1 + \epsilon)}{ errorrate_1 -\epsilon} \right)$

大家都知道 $\epsilon$ 是一个非常小数，

$w_i = w_i \times exp(\alpha_i \times abs(y_i - predited (y_i))$

如你所见，那些被错误分类的属性的权重都上升了，这就提高了那些分类错误的记录被下一个分类器选中的概率。序列中随后分类器会选择权重较大的样本

后续的分类器都会对前一个分类器错误分类的实例更加关注

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import zero_one_loss
import numpy as np
import matplotlib.pyplot as plt
import itertools

def build_single_tree_model(x,y):
    model = DecisionTreeClassifier()
    model.fit(x,y)
    return model

def build_boosting_model(x,y,no_estimators=20):
    boosting = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1,min_samples_leaf=1
                                                        ),random_state=9,
                                 n_estimators=no_estimators,algorithm="SAMME")
    boosting.fit(x,y)
    return boosting

def view_model(model):
    print("\n Estimator Weights and Error \n")
    for i,weight in enumerate(model.estimator_weights_):
        print("estimator %d weight = %0.4f error = %0.4f"%(i+1,weight,model.estimator_error_[i]))
    plt.figure(1)
    plt.title("Model weight vs error")
    plt.xlabel("Weight")
    plt.ylabel("Error")
    plt.plot(model.estimator_weights_,model.estimator_errors_)

def number_estimators_vs_error(x,y,x_dev,y_dev):
    no_estimators = range(20,120,10)
    misclassy_rate = []
    misclassy_rate_dev = []
    
    for no_estimator in no_estimators:
        boosting = build_boosting_model(x,y,no_estimators=no_estimator)
        predicted_y = boosting.predict(x)
        predicted_y_dev = boosting.predict(x_dev)
        misclassy_rate.append(zero_one_loss(y,predicted_y))
        misclassy_rate_dev.append(zero_one_loss(y_dev,predicted_y_dev))
        
        plt.figure(2)
        plt.title("No estimators vs Mis-classification rate")
        plt.xlabel("No of estimators")
        plt.ylabel("Mis-classification rate")
        plt.plot(no_estimators,misclassy_rate,label='Train')
        plt.plot(no_estimators,misclassy_rate_dev,label='Dev')
        
        plt.show()

x,y = get_data()
# plot_data(x,y)
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size=0.3,random_state=9)
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)

30 3 18 3

model = build_single_tree_model(x_train,y_train)
predicted_y = model.predict(x_train)
print("\n Single Model Accuracy on trainning data \n")
print(classification_report(y_train,predicted_y))
print("Fraction of misclassfication = %0.2f"%(zero_one_loss(y_train,predicted_y)*100),"%")

 Single Model Accuracy on trainning data 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       181
           1       1.00      1.00      1.00       169

   micro avg       1.00      1.00      1.00       350
   macro avg       1.00      1.00      1.00       350
weighted avg       1.00      1.00      1.00       350

Fraction of misclassfication = 0.00 %

boosting = build_boosting_model(x_train,y_train,no_estimators=85)
predicted_y = model.predict(x_train)
print("\n Boosting Model Accuracy on trainning data \n")
print(classification_report(y_train,predicted_y))
print("Fraction of misclassfication = %0.2f"%(zero_one_loss(y_train,predicted_y)*100),"%")

 Boosting Model Accuracy on trainning data 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       181
           1       1.00      1.00      1.00       169

   micro avg       1.00      1.00      1.00       350
   macro avg       1.00      1.00      1.00       350
weighted avg       1.00      1.00      1.00       350

Fraction of misclassfication = 0.00 %

view_model(boosting)

 Estimator Weights and Error 




---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-30-70d4f75f74ec> in <module>
----> 1 view_model(boosting)


<ipython-input-22-e24175a69f0c> in view_model(model)
      2     print("\n Estimator Weights and Error \n")
      3     for i,weight in enumerate(model.estimator_weights_):
----> 4         print("estimator %d weight = %0.4f error = %0.4f"%(i+1,weight,model.estimator_error_[i]))
      5     plt.figure(1)
      6     plt.title("Model weight vs error")


AttributeError: 'AdaBoostClassifier' object has no attribute 'estimator_error_'

wechat.jpeg

2020 机器学习之集成学习(上)
最近去图书馆和商场都发现一个用于您引导的机器人，他们跟根据您的具体需求给出建议，甚至引导您去您想要去的地方，从而节...
周志华西瓜书-AdaBoost算法证明解析
本节证明并未从集成学习源头开始，如若对集成学习还不是很清楚的同学，参考文章：经典机器学习系列之【集成学习】 ...
10.machine_learning_model_ensemb
机器学习集成学习与boosting模型机器学习中的集成学习顾名思义，集成学习（ensemble learnin...
机器学习之-集成学习
梯度提升树（GBDT）
sklearn机器学习库集成学习(ensemble learning) 集成学习并非某一种机器学习算法，更像是一...
3.1.1.8 集成学习
集成学习原理《机器学习》周志华 8.1 个体与集成集成学习(ensemble learning) 通过构建并...
机器学习入门之 — 集成学习
前置技能决策树and回归树https://www.jianshu.com/p/479e92cf4c2chttps...
集成学习之AdaBoost
一. AdaBoost介绍我们在机器学习(八)-集成学习(Ensemble learning)中介绍了集成学习的...
Task5 模型集成
这次主要学习的知识点是：集成学习方法、深度学习中的集成学习和结果后处理思路。 1、集成学习方法在机器学习中的集成...
集成学习
集成学习与个体学习器集成学习是机器学习中常用的一种方法，常用的集成学习方法有boosting,bagging以及...