模型调参——随机森林在泰坦尼克数据集上的调参应用

作者: YUENFUNGDATA | 来源:发表于2020-05-01 20:47 被阅读0次

模型调参——随机森林在泰坦尼克数据集上的调参应用
模型调参——随机森林在乳腺癌数据集上的调参应用
随机森林03
机器学习：06. 调参的基本思想(乳腺癌数据)
机器学习
贝叶斯调参
深度模型训练方法(二)
模型调参
模型调参
算法调参 - 交叉验证

一、数据集

Kaggle泰坦尼克数据集train.csv

二、模型选择

泰坦尼克数据集是二分类模型，本文选择使用随机森林模型进行调参。

三、数据预处理

泰坦尼克数据集需要进行数据预处理才能后续建模导入，删除了列Name、Ticket、Cabin，对列Sex、Embarked进行编码，使用平均值填补列Age缺失样本，分离出特征集与标签集。

四、调参流程

1）简单建模，观察模型在数据集上具体的表现效果
2）调参——n_estimators
3）调参——max_depth
4）调参——min_samples_leaf
5）调参——min_samples_split
6）调参——max_features
7）调参——criterion
8）确定最佳参数组合

五、调参详解应用步骤

1）导入相关库

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

2）查看数据集概况

data=pd.read_csv("C:\\Users\\DRF\\Desktop\\tatanic\\datasets\\train.csv",index_col=0)
data.head()
data.info()

发现数据存在缺失值等问题，需要进行数据预处理后才能建模

3）数据预处理

data.loc[:,'Age']=SimpleImputer(missing_values=np.nan,strategy='mean').fit_transform(data.loc[:,'Age'].values.reshape(-1,1))
data.drop(['Name','Cabin','Ticket'],axis=1,inplace=True)
data.loc[:,'Sex']=(data.loc[:,'Sex']=='male').astype('int32')
data=data.dropna()
labels=data.loc[:,'Embarked'].unique().tolist()
data['Embarked']=data['Embarked'].apply(lambda x:labels.index(x))
x=data.iloc[:,data.columns!='Survived']
y=data.iloc[:,data.columns=='Survived']
y=y.values.ravel()

这样就预处理成功泰坦尼克数据集，并区分好特征x和标签y，可进行下一步的建模。

4）简单建模，观察模型在数据集上具体的表现效果

rfc=RandomForestClassifier(n_estimators=100,random_state=90)
score_pre=cross_val_score(rfc,x,y,cv=10).mean()
score_pre

score_pre 分数为 0.809920837589377

5）调参 n_estimators

scorel=[]
for i in range(1,201,10):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90)
    score=cross_val_score(rfc,x,y,cv=10).mean()
    scorel.append(score)
print(max(scorel),((scorel.index(max(scorel))*10))+1)
plt.figure(figsize=[20,5])
plt.plot(range(1,201,10),scorel)
plt.show()

运行结果：

通过数据和学习曲线可以发现，当n_estimators=71的时候，阶段性准确率最高，达到0.8121807967313586，调整n_estimators效果显著，准确率较之前有提升。

接下来缩小范围，继续探索n_estimators在 [65,75] 之间的表现效果

scorel=[]
for i in range(65,75):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90)
    score=cross_val_score(rfc,x,y,cv=10).mean()
    scorel.append(score)
print(max(scorel),[*range(65,75)][scorel.index(max(scorel))])
plt.figure(figsize=[20,5])
plt.plot(range(65,75),scorel)
plt.show()

运行结果：

缩小范围后，正好也是当n_estimators=71时，模型准确度为0.8121807967313586。确定最佳 n_estimators 为 71，接下来就进入网格搜索，我们将使用网格搜索对参数一个个进行调整。窥探如何通过复杂度-泛化误差方法调整参数进而提高模型的准确度。

6）调参max_depth

grid_param={'max_depth':[*np.arange(1,7)]}

rfc=RandomForestClassifier(n_estimators=71,random_state=90)
GS=GridSearchCV(rfc,grid_param,cv=10)
GS.fit(x,y)
GS.best_params_

运行结果：

通过运行结果可以看到，网格搜索给出的最佳参数max_depth是5，此时最佳准确度为0.8346456692913385

限制max_depth减小至5，模型准确率有所提升，说明模型现在位于图像右边，即泛化误差最低点的右边。最终确定参数max_depth=5。

7）调参min_samples_leaf

grid_param={'min_samples_leaf':[*np.arange(1,11,1)]}

rfc=RandomForestClassifier(n_estimators=71,random_state=90)
GS=GridSearchCV(rfc,grid_param,cv=10)
GS.fit(x,y)
GS.best_params_
GS.best_score_

运行结果：

当min_samples_leaf=3时，准确率最高为0.8335208098987626，较之前的0.8346456692913385小，说明模型准确度下降了，位于图像左边，即泛化误差最低点的左边。舍弃调整参数min_samples_leaf。

8）调参min_samples_split

grid_param={'min_samples_split':[*np.arange(2,22,1)]}

rfc=RandomForestClassifier(n_estimators=71,random_state=90)
GS=GridSearchCV(rfc,grid_param,cv=10)
GS.fit(x,y)
GS.best_params_
GS.best_score_

运行结果：

当min_samples_split=17时，准确率最高为.8357705286839145，较之前的0.8346456692913385有所提高，说明模型准确度随着模型复杂度的下降而降低了，表明模型现在位于图像右边，即泛化误差最低点的右边。确认参数min_samples_split=17。

9）调参max_features

grid_param={'max_features':[*np.arange(2,7,1)]}

rfc=RandomForestClassifier(n_estimators=71,min_samples_split=17,random_state=90)
GS=GridSearchCV(rfc,grid_param,cv=10)
GS.fit(x,y)
GS.best_params_
GS.best_score_

运行结果：

网格搜索给出的最佳参数max_features是2，此时准确度与之前相同。而max_features的默认值是特征数的开平方即为2，因此模型的准确率没有变化。最终确认参数max_features=2。

10）调参criterion

grid_param={'criterion':['gini','entropy']}

rfc=RandomForestClassifier(n_estimators=71,min_samples_split=17,max_features=2,random_state=90)
GS=GridSearchCV(rfc,grid_param,cv=10)
GS.fit(x,y)
GS.best_params_
GS.best_score_

运行结果：

网格搜索给出的最佳参数criterion是gini，此时准确度与之前相同，默认的criterion也是gini，因此模型的准确率没有变化。

六、调参完毕，总结模型最佳参数组合

RandomForestClassifier(n_estimators=71
                      ,min_samples_split=17
                      ,max_features=2
                      ,criterion='gini'
                      ,random_state=90)

调参前模型准确率：0.809920837589377（80.99%）
调参后模型准确率：0.835770528683914（83.58%）
模型提升的准确率：0.025849691094537（+2.58%）

·································································································································································
完整代码：

#导入相关库
from sklearn.ensemble import RandomForestClassifier #导入集成算法随机森林模块
from sklearn.model_selection import cross_val_score #导入交叉验证模块
from sklearn.model_selection import GridSearchCV    #导入网格搜索模块
from sklearn.impute import SimpleImputer            #导入SimpleImputer用于填补缺失值
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#数据集概况
data=pd.read_csv("C:\\Users\\DRF\\Desktop\\tatanic\\datasets\\train.csv",index_col=0) #导入数据集
data.head()
data.info()

#数据预处理
data.loc[:,'Age']=SimpleImputer(missing_values=np.nan,strategy='mean').fit_transform(data.loc[:,'Age'].values.reshape(-1,1))
data.drop(['Name','Cabin','Ticket'],axis=1,inplace=True)
data.loc[:,'Sex']=(data.loc[:,'Sex']=='male').astype('int32')
data=data.dropna()
labels=data.loc[:,'Embarked'].unique().tolist()
data['Embarked']=data['Embarked'].apply(lambda x:labels.index(x))
x=data.iloc[:,data.columns!='Survived']
y=data.iloc[:,data.columns=='Survived']
y=y.values.ravel()

#简单建模，观察模型在数据集上具体的表现效果
rfc=RandomForestClassifier(n_estimators=100,random_state=90)      #实例化
score_pre=cross_val_score(rfc,x,y,cv=10).mean() #交叉验证
score_pre

#调参n_estimators
scorel=[]
for i in range(1,201,10):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90) #设置n_estimators[1,201]依次建模评分
    score=cross_val_score(rfc,x,y,cv=10).mean()
    scorel.append(score)
print(max(scorel),((scorel.index(max(scorel))*10))+1)
plt.figure(figsize=[20,5]) #绘制学习曲线
plt.plot(range(1,201,10),scorel)
plt.show()

scorel=[]
for i in range(65,75):
    rfc=RandomForestClassifier(n_estimators=i,random_state=90) #设置n_estimators[1,201]依次建模评分
    score=cross_val_score(rfc,x,y,cv=10).mean()
    scorel.append(score)
print(max(scorel),[*range(65,75)][scorel.index(max(scorel))])
plt.figure(figsize=[20,5]) #绘制学习曲线
plt.plot(range(65,75),scorel)
plt.show()

#调参max_depth 网格搜索最佳参数
grid_param={'max_depth':[*np.arange(1,7)]} #网格搜索设置参数及参数大小范围
rfc=RandomForestClassifier(n_estimators=71,random_state=90) #实例化
GS=GridSearchCV(rfc,param_grid,cv=10) #网格搜索
GS.fit(data.data,data.target)  #训练模型
GS.best_params_   #最佳参数
GS.best_score_    #最佳分数

#调参min_samples_leaf 网格搜索最佳参数
grid_param={'min_samples_leaf':[*np.arange(1,11,1)]} #网格搜索设置参数及参数大小范围
rfc=RandomForestClassifier(n_estimators=71,random_state=90) #实例化
GS=GridSearchCV(rfc,param_grid,cv=10) #网格搜索
GS.fit(data.data,data.target)  #训练模型
GS.best_params_   #最佳参数
GS.best_score_    #最佳分数

#调参min_samples_split 网格搜索最佳参数
grid_param={'min_samples_split':[*np.arange(2,22,1)]} #网格搜索设置参数及参数大小范围
rfc=RandomForestClassifier(n_estimators=71,random_state=90) #实例化
GS=GridSearchCV(rfc,param_grid,cv=10) #网格搜索
GS.fit(data.data,data.target)  #训练模型
GS.best_params_   #最佳参数
GS.best_score_    #最佳分数

#调参max_features 网格搜索最佳参数
grid_param={'max_features':[*np.arange(2,7,1)]} #网格搜索设置参数及参数大小范围
rfc=RandomForestClassifier(n_estimators=71,min_samples_split=17,random_state=90) #实例化
GS=GridSearchCV(rfc,grid_param,cv=10) #网格搜索
GS.fit(data.data,data.target)  #训练模型
GS.best_params_  #最佳参数
GS.best_score_   #最佳分数

#调参criterion 网格搜索最佳参数
grid_param={'criterion':['gini','entropy']} #网格搜索设置参数及参数大小范围
rfc=RandomForestClassifier(n_estimators=71,min_samples_split=17,max_features=2,random_state=90) #实例化
GS=GridSearchCV(rfc,grid_param,cv=10) #网格搜索
GS.fit(data.data,data.target)  #训练模型
GS.best_params_  #最佳参数
GS.best_score_   #最佳分数

#确定最佳参数组合
RandomForestClassifier(n_estimators=71
                      ,min_samples_split=17
                      ,max_features=2
                      ,criterion='gini'
                      ,random_state=90)

以上全部是我对关于随机森林算法对泰坦尼克号数据集的调参思路分享。

模型调参——随机森林在泰坦尼克数据集上的调参应用
一、数据集 Kaggle泰坦尼克数据集train.csv 二、模型选择泰坦尼克数据集是二分类模型，本文选择使用随...
模型调参——随机森林在乳腺癌数据集上的调参应用
一、数据集 Sklearn自带数据集——乳腺癌数据集二、模型选择乳腺癌数据集是二分类模型，选择随机森林模型进行...
随机森林03
一、案例随机森林调参 1.1、导入库 1.2、导入数据集，探索数据 1.3、进行一次简单的建模，看看模型本身在数据...
机器学习：06. 调参的基本思想(乳腺癌数据)
1. 机器学习中调参的基本思想调参的目的就是为了提升模型的准确率。在机器学习中，我们用来衡量模型在未知数据上的准...
机器学习
随机森林参考文章 scikit-learn随机森林调参小结用scikit-learn和pandas学习线性回归...
贝叶斯调参
机器学习中参数调优的目的是为了找到模型在测试集上表现最好的参数，目前常见的调参方法主要有四种： 1、手动调参； 2...
深度模型训练方法(二)
在文章<深度模型训练方法>中提到，深度模型中拥有很多超参，模型的训练其实就是一个调超参的过程。而在调超参时，我们主...
模型调参
贪心调参（坐标下降）所谓贪心算法是指，在对问题求解时，总是做出在当前看来是最好的选择。也就是说，不从整体最优上...
模型调参
相关模型线性回归决策树决策树（Decision Tree）是一种非参数的有监督学习方法，它能够从一系列有特征...
算法调参 - 交叉验证
算法模型训练过程中，获取模型项目参数(比如λ、p)的最优值，这个过程叫做调参。 - 模型调参的方法： ...