Python大规模建模的特征值选择和性能评估方法详解！

作者: 14e61d025165 | 来源:发表于2019-06-17 14:46 被阅读0次

Python大规模建模的特征值选择和性能评估方法详解！
Python大规模建模的特征值选择和性能评估方法详解
欢迎LuigiVigneri加入IOTA基金会
【挖掘篇】：模型的评估
Linux性能优化实战 —— CPU
05 模型训练和测试
聚类分析
机器学习-问题建模
三节课数据分析1指标建模
动力系统建模和性能评估

大量的特征变量，很多的模型，模型也有很多参数，如何选择合适的特征、合适的模型和合适的模型参数，这对建模是很重要的，但也是很困难的。并且选择最优的方案，方法也是很多的，这里将其中一种方法尽量描述清楚：

<bi style="box-sizing: border-box; display: block;">通过遍历所有的特征组合，用最一般的模型去拟合，并计算各种特征组合的模型的性能评估，选择最好的特征组合。用最好的特征组合去创建其他模型及各种参数，确定最好的模型和参数。</bi>

数据说明

加载sklearn的数据集，X是一个13维度的特征变量，y是一个一维的分类离散变量。这里我们寻求一个最好的X的特征组合去拟合y的分类。下面是加载数据集的代码：

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data
y = wine.target
</pre>

<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560753964741 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;">

image

一、加载需要用到的python模块

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import matplotlib.pyplot as plt Python学习交流群：1004391443
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn import ensemble
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
from sklearn.grid_search import GridSearchCV
</pre>

二、选择最佳的特征

2.1 加载数据

整理数据，把X转换为pandas的DataFrame类型，定义一个X的所有特征的组合。

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">wine = load_wine()
X = wine.data
y = wine.target
X = pd.DataFrame(X)
features = [0,1,2,3,4,5,6,7,8,9,10,11,12]
</pre>

2.2 定义特征遍历函数

定义一个特征遍历函数combinations，并且把特征组合遍历存放在group_combinations

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">def combinations(ls):
n=1<<len(ls)
tmp=[]
for i in range(n):
bits=[i>>offset&1 for offset in range(len(ls)-1,-1,-1)]
if np.sum(bits)>0:
current=[ls[index] for (index,bit) in enumerate(bits) if bit==1]
tmp= tmp+[current]
return tmp
group_combinations = combinations(features)
</pre>

这个变量的部分内容是这样的，总共有8191中组合

<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560753964772 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;">

image <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image> <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560753964778 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;">

image

其实就是

<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560753964785 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;">

image

2.3 遍历各类回归模型

定义两个变量存放最好的特征组合和准确率，遍历group_combinations里面的每一项，通过cross_val_score计算模型得分。

我们选择简单的常用的logistic分类回归模型去寻找最好的特征
我们通过cross_val_score交叉检验（分5组交叉）去计算模型预测准确率
cross_val_score函数返回的是准确率，cv=5就是分成5组，返回5个准确率

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">lm = linear_model.LogisticRegression()
best_feature=''
best_score = 0;
for v in group_combinations:
x = X[v]
score = np.mean(cross_val_score(lm,x,y,cv=5,scoring='accuracy'))*100
if score>best_score:
best_score=score
best_feture=v
print('特征'+str(v) +'的平均准确率：'+ '%.4f' % score + '%')
print('最好的特征组合是'+str(best_feture)+'，对应的准确率是：' +'%.4f' % best_feature + '%')
</pre>

最好的特征组合是[0, 1, 2, 3, 6, 8, 9, 10, 12]，对应的准确率是：96.7267%

然后我们重新定义x特征，只使用[0, 1, 2, 3, 6, 8, 9, 10, 12]

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">features = [0, 1, 2, 3, 6, 8, 9, 10, 12]
x=X[[0, 1, 2, 3, 6, 8, 9, 10, 12]]
</pre>

三、寻找最合适的模型

3.1 决策树模型

建立一个基本的决策树模型，看看效果如何。

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">dtc = tree.DecisionTreeClassifier(random_state=0)
dtc_s = np.mean(cross_val_score(dtc,x,y,cv=5,scoring='accuracy'))*100
print('平均准确率：'+ '%.4f' % dtc_s + '%')
</pre>

平均准确率：87.5765%

比logistic分类回归差，那我们在看看决策树模型的其他参数的情况如何。

接下来我们用网格搜索遍历一些参数进行调参，我们遍历criterion的两个和max_depth

criterion

<bi style="box-sizing: border-box; display: block;">criterion=’gini’,分裂节点时评价准则是Gini指数。 </bi><bi style="box-sizing: border-box; display: block;">criterion=’entropy’,分裂节点时的评价指标是信息增益</bi>

max_depth

<bi style="box-sizing: border-box; display: block;">如果为None，表示树的深度不限。直到所有的叶子节点都是纯净的，即叶子节点中所有的样本点都属于同一个类别 </bi><bi style="box-sizing: border-box; display: block;">还有其他参数可以遍历，就不一一列举了。</bi>

GridSearchCV模块是对指定的dtc模型，对parameters参数进行遍历，返回模型得分的一个模块

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">parameters={
'criterion':['gini','entropy'],
'max_depth':[1,2,3,4,5,6,7,8]
}
grid_search=GridSearchCV(dtc,parameters,scoring='accuracy',cv=5)
grid_search.fit(x,y)
print('最佳参数组合是'+str(grid_search.best_params_) +'，对应的准确率是：'+'%.4f' % (grid_search.best_score_*100)+'%' )
</pre>

最佳参数组合是{‘criterion’: ‘entropy’, ‘max_depth’: 2}，对应的准确率是：92.1348%

总体来说，决策树模型比logistic模型差。

3.2 随机森林模型

先用默认的模型去建立一个基本的随机森林模型。

在随机森林中random_state的作用是告诉代码生成一个固定的森林，但是里面的每一课树长的都是不一样的

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">rfc = ensemble.RandomForestClassifier(random_state=0)
rfc_s = np.mean(cross_val_score(rfc,x,y,cv=5,scoring='accuracy'))*100
print('平均准确率：'+ '%.4f' % rfc_s + '%')
</pre>

平均准确率：96.6658%

非常好的准确率，我们接下来进行寻找有没有更优的参数。

n_estimators参数是弱学习器的最大迭代次数，或者说最大的弱学习器的个数。一般来说n_estimators太小，容易欠拟合，n_estimators太大，计算量会太大，并且n_estimators到一定的数量后，再增大n_estimators获得的模型提升会很小，所以一般选择一个适中的数值。默认是100。我们遍历n_estimators从10 20 30 到150，

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">parameters = {'n_estimators':np.arange(10,151,10)}
grid_search=GridSearchCV(rfc,parameters,scoring='accuracy',cv=5)
grid_search.fit(x,y)
print('最佳参数组合是'+str(grid_search.best_params_) +'，对应的准确率是：'+'%.4f' % (grid_search.best_score_*100)+'%' )
</pre>

最佳参数组合是{‘n_estimators’: 40}，对应的准确率是：98.3146%

这样我们的准确率得到进一步的提高。

我们看看其他参数的调整能不能进一步提高准确率。我们锁定n_estimators=40，

rfc = ensemble.RandomForestClassifier(n_estimators=40) parameters = {'max_depth':np.arange(3,11,2), 'min_samples_split':np.arange(5,21,2)} grid_search=GridSearchCV(rfc,parameters,scoring='accuracy',cv=5) grid_search.fit(x,y) print('最佳参数组合是'+str(grid_search.best_params_) +'，对应的准确率是：'+'%.4f' % (grid_search.best_score_*100)+'%' ) 最佳参数组合是{‘max_depth’: 5, ‘min_samples_split’: 13}，对应的准确率是：98.3146%，和前面的一样，没有找到更好的。

这里我们是锁定锁定n_estimators=40去做遍历，我们可以通过三个参数一起来遍历试一下。但这样运行的速度会慢一点。

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">rfc = ensemble.RandomForestClassifier(random_state=0)
parameters = {'n_estimators':np.arange(10,71,10),
'max_depth':np.arange(3,11,2),
'min_samples_split':np.arange(5,21,2)
}
grid_search=GridSearchCV(rfc,parameters,scoring='accuracy',cv=5)
grid_search.fit(x,y)
print('最佳参数组合是'+str(grid_search.best_params_) +'，对应的准确率是：'+'%.4f' % (grid_search.best_score_*100)+'%' )
</pre>

最佳参数组合是{‘max_depth’: 3, ‘min_samples_split’: 9, ‘n_estimators’: 40}，对应的准确率是：97.1910%

这样反而没有找到比之前更好的。

如果我们其他参数不变，只改变n_estimators，从10到150，看看模型的准确率是怎么样的趋势。

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">rfc = ensemble.RandomForestClassifier(random_state=0)
parameters = {'n_estimators':np.arange(4,151,2)}
grid_search=GridSearchCV(rfc,parameters,scoring='accuracy',cv=5)
grid_search.fit(x,y)
ss = pd.DataFrame(grid_search.grid_scores_)
f = lambda x: x['n_estimators']
ss['n_estimators'] = ss['parameters'].map(f)
plt.plot(ss['n_estimators'],ss['mean_validation_score'],label="RandomForest")
</pre>

grid_search.grid_scores_是返回网格搜索的各个参数的得分

<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560753964891 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;">

image

我们看看ss的内容

<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560753964898 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;">

image

我们通过map函数去提取parameters的n_estimators的值，并写到ss的n_estimators列中。

<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560753964904 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;">

image

然后我们把ss[‘n_estimators’]作为x，ss[‘mean_validation_score’]作为y，画图看看准确率的变化。

<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560753964909 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;">

image

我们看到70后就平稳了，那我们看看前面那一段。

<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1560753964918 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;">

image

可以看出来n_estimators在40-50之间是最好的，也是稳定的。

最终，我们确定是随机森林的n_estimators=40就是我们最优的模型了。

我们看看n_estimators=40的时候，模型是怎么样的。

<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini',max_leaf_nodes=None,
max_depth=None, max_features='auto',
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=1,
oob_score=False, random_state=0, verbose=0, warm_start=False)
</pre>

还可以从里面的参数进行进一步调参，就不一一演示了。

四、总结