最近去图书馆和商场都发现一个用于您引导的机器人,他们跟根据您的具体需求给出建议,甚至引导您去您想要去的地方,从而节省您自己寻找时间。看到了这一切再一次激发我学习计算机视觉和机器学习的热情。
现在大家更多感觉是新奇,只是把他当做玩具,我想不久将来大家都会习惯他存在就像今天我们离不开手机一样。
集成学习
有时候我们面对一些难以做出决定问题,总喜欢找几个朋友来谈一谈听取一下他们意见,然后再做出决定。这也就是集成学习基本思想。基本思路是拥有大量的模型,每一个都在训练集上
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
def get_data():
no_features = 30
redundant_features = int(0.1*no_features)
informative_features = int(0.6*no_features)
repeated_features = int(0.1*no_features)
print(no_features,redundant_features,informative_features,repeated_features)
x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03,
n_informative=informative_features,n_redundant=redundant_features,
n_repeated=repeated_features,random_state=7)
return x,y
# 构建单个KNN模型
def build_single_model(x,y):
model = KNeighborsClassifier()
model.fit(x,y)
return model
# bagging方法
def build_bagging_model(x,y):
bagging = BaggingClassifier(KNeighborsClassifier(),n_estimators=100,random_state=9,
max_samples=1.0,max_features=0.7,bootstrap=True,
bootstrap_features=True)
bagging.fit(x,y)
return bagging
def view_model(model):
print("\n Sampled attributes in top 10 estimators\n")
for i,feature_set in enumerate(model.estimators_features_[0:10]):
print("estimator %d" % (i+1), feature_set)
x,y = get_data()
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size=0.3,random_state=9)
30 3 18 3
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,
test_size=0.3,random_state=9)
model = build_single_model(x_train,y_train)
predicted_y = model.predict(x_train)
print("\n Single Model Accuracy on training data \n")
print(classification_report(y_train,predicted_y))
Single Model Accuracy on training data
precision recall f1-score support
0 0.88 0.87 0.88 181
1 0.87 0.88 0.87 169
micro avg 0.87 0.87 0.87 350
macro avg 0.87 0.87 0.87 350
weighted avg 0.87 0.87 0.87 350
bagging = build_bagging_model(x_train,y_train)
predicted_y = bagging.predict(x_train)
print("\n Bagging Model Accracy on training data\n")
print(classification_report(y_train,predicted_y))
view_model(bagging)
Bagging Model Accracy on training data
precision recall f1-score support
0 0.93 0.97 0.95 181
1 0.96 0.92 0.94 169
micro avg 0.95 0.95 0.95 350
macro avg 0.95 0.94 0.95 350
weighted avg 0.95 0.95 0.95 350
Sampled attributes in top 10 estimators
estimator 1 [25 20 10 6 17 18 11 17 9 14 3 10 10 23 22 18 17 11 21 20 1]
estimator 2 [14 3 27 28 20 20 27 25 0 21 1 12 20 21 29 1 0 28 16 4 9]
estimator 3 [29 5 23 19 2 16 21 4 13 27 1 15 24 5 14 1 4 25 22 26 29]
estimator 4 [23 10 16 7 22 11 0 14 14 17 8 17 27 12 13 23 8 7 27 0 27]
estimator 5 [ 3 0 26 13 23 7 27 15 18 11 26 18 26 3 22 6 11 21 6 12 19]
estimator 6 [16 5 24 19 21 2 2 22 12 21 14 28 5 29 9 19 24 14 21 8 11]
estimator 7 [ 7 23 2 17 22 2 12 14 25 5 7 10 25 5 17 16 9 0 9 9 15]
estimator 8 [16 10 7 8 8 18 6 3 12 29 13 17 20 9 2 25 6 28 15 0 16]
estimator 9 [22 29 2 5 6 11 18 4 19 27 17 28 20 15 21 26 14 5 28 15 21]
estimator 10 [29 22 17 10 16 10 27 8 2 18 26 1 3 2 1 17 2 12 10 22 26]
predicted_y = model.predict(x_dev)
print("\n Single Model Accuracy on Dev Data \n")
print(classification_report(y_dev,predicted_y))
Single Model Accuracy on Dev Data
precision recall f1-score support
0 0.83 0.84 0.83 51
1 0.85 0.83 0.84 54
micro avg 0.84 0.84 0.84 105
macro avg 0.84 0.84 0.84 105
weighted avg 0.84 0.84 0.84 105
predicted_y = bagging.predict(x_dev)
print("\n Bagging Model Accuracy on Dev Data \n")
print(classification_report(y_dev,predicted_y))
Bagging Model Accuracy on Dev Data
precision recall f1-score support
0 0.83 0.88 0.86 51
1 0.88 0.83 0.86 54
micro avg 0.86 0.86 0.86 105
macro avg 0.86 0.86 0.86 105
weighted avg 0.86 0.86 0.86 105
AdaBoosting
提升法是一种强大的集成技术,在数学科学中得到广泛应用。
有关数据我在这里不再啰嗦,大家到现在应该一眼就能看出我们样本
假设我们构建了一个弱分类器,其错误比例仅稍好于随机猜测。在提升法里构建一系列弱分类器用来进行微调的数据集合上,每个分类器使用的数据只做了小小的调整,最后结束于第 M 个分类器
和Bagging 算法不同之处就在于权重 alpha 和顺序建模。在 AdaBoosting 构建一系列分类器,给每一个分类器使用经过微调的数据集。
第一个分类器初始化开始 m = 1 先把每一个实例的权重定为 1/N 也就是如果有 100 个样本,每一个样本获取权重为 0.01 我们用 w 表示权重,现在有 100 个这样的权重值如下
那么现在所有样本被分类器选中的机会是均等的,我们创建一个分类器,对训练集进行测试以获取错误分类比例。之前曾提到过错误分类比例计算公式,
公式中的 abs 表示取绝对值,根据错误比例,采用下面的公式来计算 alpha 值
大家都知道 是一个非常小数,
如你所见,那些被错误分类的属性的权重都上升了,这就提高了那些分类错误的记录被下一个分类器选中的概率。序列中随后分类器会选择权重较大的样本
后续的分类器都会对前一个分类器错误分类的实例更加关注
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import zero_one_loss
import numpy as np
import matplotlib.pyplot as plt
import itertools
def build_single_tree_model(x,y):
model = DecisionTreeClassifier()
model.fit(x,y)
return model
def build_boosting_model(x,y,no_estimators=20):
boosting = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1,min_samples_leaf=1
),random_state=9,
n_estimators=no_estimators,algorithm="SAMME")
boosting.fit(x,y)
return boosting
def view_model(model):
print("\n Estimator Weights and Error \n")
for i,weight in enumerate(model.estimator_weights_):
print("estimator %d weight = %0.4f error = %0.4f"%(i+1,weight,model.estimator_error_[i]))
plt.figure(1)
plt.title("Model weight vs error")
plt.xlabel("Weight")
plt.ylabel("Error")
plt.plot(model.estimator_weights_,model.estimator_errors_)
def number_estimators_vs_error(x,y,x_dev,y_dev):
no_estimators = range(20,120,10)
misclassy_rate = []
misclassy_rate_dev = []
for no_estimator in no_estimators:
boosting = build_boosting_model(x,y,no_estimators=no_estimator)
predicted_y = boosting.predict(x)
predicted_y_dev = boosting.predict(x_dev)
misclassy_rate.append(zero_one_loss(y,predicted_y))
misclassy_rate_dev.append(zero_one_loss(y_dev,predicted_y_dev))
plt.figure(2)
plt.title("No estimators vs Mis-classification rate")
plt.xlabel("No of estimators")
plt.ylabel("Mis-classification rate")
plt.plot(no_estimators,misclassy_rate,label='Train')
plt.plot(no_estimators,misclassy_rate_dev,label='Dev')
plt.show()
x,y = get_data()
# plot_data(x,y)
x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size=0.3,random_state=9)
x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9)
30 3 18 3
model = build_single_tree_model(x_train,y_train)
predicted_y = model.predict(x_train)
print("\n Single Model Accuracy on trainning data \n")
print(classification_report(y_train,predicted_y))
print("Fraction of misclassfication = %0.2f"%(zero_one_loss(y_train,predicted_y)*100),"%")
Single Model Accuracy on trainning data
precision recall f1-score support
0 1.00 1.00 1.00 181
1 1.00 1.00 1.00 169
micro avg 1.00 1.00 1.00 350
macro avg 1.00 1.00 1.00 350
weighted avg 1.00 1.00 1.00 350
Fraction of misclassfication = 0.00 %
boosting = build_boosting_model(x_train,y_train,no_estimators=85)
predicted_y = model.predict(x_train)
print("\n Boosting Model Accuracy on trainning data \n")
print(classification_report(y_train,predicted_y))
print("Fraction of misclassfication = %0.2f"%(zero_one_loss(y_train,predicted_y)*100),"%")
Boosting Model Accuracy on trainning data
precision recall f1-score support
0 1.00 1.00 1.00 181
1 1.00 1.00 1.00 169
micro avg 1.00 1.00 1.00 350
macro avg 1.00 1.00 1.00 350
weighted avg 1.00 1.00 1.00 350
Fraction of misclassfication = 0.00 %
view_model(boosting)
Estimator Weights and Error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-30-70d4f75f74ec> in <module>
----> 1 view_model(boosting)
<ipython-input-22-e24175a69f0c> in view_model(model)
2 print("\n Estimator Weights and Error \n")
3 for i,weight in enumerate(model.estimator_weights_):
----> 4 print("estimator %d weight = %0.4f error = %0.4f"%(i+1,weight,model.estimator_error_[i]))
5 plt.figure(1)
6 plt.title("Model weight vs error")
AttributeError: 'AdaBoostClassifier' object has no attribute 'estimator_error_'
wechat.jpeg
网友评论