优化:一种将grid-search速度提升10倍的方法

作者: mrlevo520 | 来源:发表于2016-11-25 15:20 被阅读2945次

    Python 2.7
    IDE Pychrm 5.0.3
    sci-kit learn 0.18.1


    前言

    抖了个机灵,不要来打我,这是没有理论依据证明的,只是模型测试出来的确有效,并且等待时间下降(约)为原来的十分之一!!刺不刺激,哈哈哈。


    原理

    基本思想:先找重点在细分,再细分,伸缩Flexible你怕不怕。以下简称这种方法为FCV

    不知道CV的请看@MrLevo520--总结:Bias(偏差),Error(误差),Variance(方差)及CV(交叉验证)


    伪代码

    原理很好理解,直接上伪代码,懒得打字,上手稿

    这里写图片描述

    FCV测试时间

    以GBDT为例,我测试了下,参数 n_estimators从190到300,max_depth从2到9,CV=3

    • 普通的GridSerachCV总共fit了11073=2310次,耗时1842min,也就是30.7个小时,得出最优参数n_estimators=289,max_depth=3

    • FCV总共总共fit了345次,跑了166min,也就是2.8小时,得出最优参数n_estimators=256,max_depth=3

    时间方面,相差11倍,那么效果呢,请看下面的CV得分


    FCV测试效果

    选取了GBDT,RF,XGBOST,SVM做了交叉验证比较,同一算法之间保持相同参数。

    GBDT的测试结果

    clf1 = GradientBoostingClassifier(max_depth=3,n_estimators=289)#.fit(train_data,train_label)
    score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
    print score1
    -------------------------------------
    clf2 = GradientBoostingClassifier(max_depth=3,n_estimators=256)#.fit(train_data,train_label)
    score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
    print score2
    
    ------------------------------------
    # 查看两种方法的交叉验证效果
    
    #传统方法CV=5:[ 0.79807692  0.82038835  0.80684597  0.76108374  0.78163772]
    #改进方法CV=5:[ 0.79567308  0.82038835  0.799511    0.76847291  0.78411911]
    
    ---------------------------------------
    
    #传统方法CV=10:[ 0.83333333  0.78571429  0.84615385  0.7961165   0.81067961  0.80097087   0.77227723  0.78109453  0.785   0.74111675]
    #FCV方法CV=10:[ 0.85238095  0.78095238  0.85096154  0.7961165   0.81553398  0.7961165   0.76732673  0.79104478  0.795   0.75126904]
    
    

    Xgboost的测试结果

    clf1 = XGBClassifier(max_depth=6,n_estimators=200)#.fit(train_data,train_label)
    score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
    print score1
    
    clf2 = XGBClassifier(max_depth=4,n_estimators=292)#.fit(train_data,train_label)
    score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
    print score2
    
    
    -----------------------------
    #传统方法CV=5:[ 0.79086538  0.83737864  0.80929095  0.79310345  0.7866005 ]
    #FCV方法CV=5:[ 0.80288462  0.84466019  0.8190709   0.79064039  0.78163772]
    
    

    RF的测试结果

    注:由于RF的特殊性,选择样本的方式和选择特征的方式都随机,所以即使交叉验证,效果也不是稳定的,就像我在服务器上跑多进程和笔记本上跑同一个程序,出来的最佳值一个是247,一个是253,并不是说都会一致。(说的好像GBDT,XGBOOST不是树结构似得0.o)

    clf1 = RandomForestClassifier(n_estimators=205)#.fit(train_data,train_label)
    score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
    print score1
    clf2 = RandomForestClassifier(n_estimators=253)#.fit(train_data,train_label)
    score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
    print score2
    
    
    ---------------
    #传统方法CV=5: [ 0.77884615  0.82038835  0.78239609  0.79310345  0.74937965]
    #FCV方法CV=5:[ 0.75721154  0.83495146  0.79217604  0.79310345  0.75186104]
    
    

    SVM的测试结果

    clf1 = SVC(kernel='rbf', C=16, gamma=0.18)#.fit(train_data,train_label)
    score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=5)
    print score1
    
    clf2 = SVC(kernel='rbf', C=94.75, gamma=0.17)#.fit(train_data,train_label)
    score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=5)
    print score2
    
    # 默认CV参数搜索(2^-8,2^8),[ 0.88468992  0.88942774  0.88487805  0.8856305   0.8745098 ]
    
    # FCV参数搜索:[ 0.90116279  0.9185257   0.90146341  0.90713587  0.89509804]
    
    

    FCV优点

    • 分割测试阈,极大提高速度,相较于普通的寻优,速度提升Step倍左右,寻优范围越大,提升越明显
    • 可根据自己的数据集进行弹性系数设置,默认为0.5Step,设置有效的弹性系数有利于减少遗漏值

    FCV缺点

    • 有一定几率陷入局部最优
    • 需要有一定的阈值经验和步进Step设置经验

    FCV代码

    以GBDT为例,(RF被我改成多进程了),假设寻找两个最优参数,概念和上面的是一样的,上面的理解了,这里没啥问题的。

    # -*- coding: utf-8 -*-
    from sklearn import metrics, model_selection
    import LoadSolitData
    import Confusion_Show as cs
    from sklearn.model_selection import GridSearchCV
    from sklearn.ensemble import GradientBoostingClassifier
    
    
    def gradient_boosting_classifier(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep):
        
        model = GradientBoostingClassifier()
        param_grid = {'n_estimators': [i for i in range(Nminedge,Nmaxedge,Nstep)], 'max_depth': [j for j in range(Dminedge,Dmaxedge,Dstep)]}
        grid_search = GridSearchCV(model, param_grid, n_jobs=1, verbose=1)
        grid_search.fit(train_x, train_y)
        best_parameters = grid_search.best_estimator_.get_params()
        for para, val in best_parameters.items():
            print para, val
        return best_parameters["n_estimators"],best_parameters["max_depth"]
    
    def FlexSearch(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep):
        NmedEdge,DmedEdge = gradient_boosting_classifier(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep)
        Nminedge = NmedEdge - Nstep-Nstep/2
        Nmaxedge = NmedEdge + Nstep+Nstep/2
        Dminedge = DmedEdge - Dstep
        Dmaxedge = DmedEdge + Dstep
    
        Nstep = Nstep/2
        Dstep = Dstep
        print "Current bestPara:N-%.2f,D-%.2f"%(NmedEdge,DmedEdge)
        print "Next Range:N-%d-%d,D--%d-%d"%(Nminedge,Nmaxedge,Dminedge,Dmaxedge)
        print "Next step:N-%.2f,D-%.2f"%(Nstep,Nstep)
        if Nstep > 0 :
            return FlexSearch(train_x, train_y,Nminedge,Nmaxedge,Nstep,Dminedge,Dmaxedge,Dstep)
        else:
            print "The bestPara:N-%.2f,D-%.2f"%(NmedEdge,DmedEdge)
            return NmedEdge,DmedEdge
    
    if __name__ == '__main__':
    
        train_data,train_label,test_data,test_label = LoadSolitData.TrainSize(1,0.2,[i for i in range(1,14)],"outclass")
    #这里数据自己导,我是写在别的子函数里面了。
        NmedEdge,DmedEdge = FlexSearch(train_data, train_label, 190, 300, 10, 2, 9, 1)
        Classifier = GradientBoostingClassifier(n_estimators=NmedEdge,max_depth=DmedEdge)
        Classifier.fit(train_data,train_label)
        predict_label = Classifier.predict(test_data)
        report = metrics.classification_report(test_label, predict_label)
        print report
        
    

    What's More

    采用多进程+FCV的方法,速度再次提升一个数量级。我假定命名为MFCV方法

    不知道什么是多进程和多线程请看
    @MrLevo520--Python小白带小白初涉多进程
    @MrLevo520--Python小白带小白初涉多线程


    MFCV原理

    在FCV的基础上,gap = (maxedge-minedge)/k,其中的k为进程个数,把大块切割成小块,然后小块在按照FCV进行计算,最后得出K个最优值,再进行K个最优值选择一下就可以了。


    MFCV优点

    • 速度更快,多进程并发
    • 因为分割片在片上做FCV,然后再集合做CV,一定程度上避免FCV陷入局部最优

    MFCV缺点

    • 与FCV不同,只能进行单参数的多进程并发寻优,两个参数的现在多进程我还没想到怎么实现,想到再说。

    MFCV测试时间

    因为只能单参数,所以RF为例,参数 n_estimators从190到330,CV=10,普通GridSerachCV时间为97.2min,而FCV时间为18min,MFCV为14min,这里因为fit总数太少了,所以效果不是非常明显,但是大家可以想象一下,单参数跨度很大的时候是什么效果。。。


    MFCV测试效果

    clf1 = RandomForestClassifier(n_estimators=205)#.fit(train_data,train_label)
    score1 = model_selection.cross_val_score(clf1,train_data,train_label,cv=10)
    print score1
    clf2 = RandomForestClassifier(n_estimators=239)#.fit(train_data,train_label)
    score2 = model_selection.cross_val_score(clf2,train_data,train_label,cv=10)
    print score2
    
    
    #传统方法CV=10:[ 0.84169884  0.85328185  0.85907336  0.84912959  0.82101167  0.8540856 0.84015595  0.87968442  0.84980237  0.83003953]
    #MFCV方法CV=10:[ 0.84169884  0.86100386  0.85521236  0.86266925  0.82101167  0.85019455  0.83625731  0.87771203  0.85177866  0.83201581]
    

    效果和传统方法相比,相差不多,持平,并且速度提升一个数量集,我认为可以采用


    MFCV代码

    # -*- coding: utf-8 -*-
    #! /usr/bin/python
    from multiprocessing import Process
    from sklearn.ensemble import RandomForestClassifier
    from sklearn import metrics
    import Confusion_Show as cs
    import LoadSolitData
    from sklearn.model_selection import GridSearchCV
    
    def write2txt(data,storename):
        f = open(storename,"w")
        f.write(str(data))
        f.close()
    
    def V_candidate(candidatelist,train_x, train_y):
        model = RandomForestClassifier()
        param_grid = {'n_estimators': [i for i in candidatelist]}
        grid_search = GridSearchCV(model, param_grid, n_jobs=1, verbose=1,cv=10)
        grid_search.fit(train_x, train_y)
        best_parameters = grid_search.best_estimator_.get_params()
        for para, val in best_parameters.items():
            print para, val
        return best_parameters["n_estimators"]
    
    
    def random_forest_classifier(train_x, train_y,minedge,maxedge,step):
    
        model = RandomForestClassifier()
        param_grid = {'n_estimators': [i for i in range(minedge,maxedge,step)]}
        grid_search = GridSearchCV(model, param_grid, n_jobs=1, verbose=1,cv=10)
        grid_search.fit(train_x, train_y)
        best_parameters = grid_search.best_estimator_.get_params()
        for para, val in best_parameters.items():
            print para, val
        return best_parameters["n_estimators"]
    
    
    def FlexSearch(train_x, train_y,minedge,maxedge,step,storename):
        mededge = random_forest_classifier(train_x, train_y,minedge,maxedge,step)
        minedge = mededge - step-step/2
        maxedge = mededge + step+step/2
        step = step/2
        print "Current bestPara:",mededge
        print "Next Range%d-%d"%(minedge,maxedge)
        print "Next step:",step
        if step > 0:
            return FlexSearch(train_x, train_y,minedge,maxedge,step,storename)
        elif step==0:
            print "The bestPara:",mededge
            write2txt(mededge,"RF_%s.txt"%storename)
    
    
    def Calc(classifier):
        Classifier.fit(train_data,train_label)
        predict_label = Classifier.predict(test_data)
        predict_label_prob = Classifier.predict_proba(test_data)
    
        total_cor_num = 0.0
        dictTotalLabel = {}
        dictCorrectLabel = {}
        for label_i in range(len(predict_label)):
    
            if predict_label[label_i] == test_label[label_i]:
                total_cor_num += 1
                if predict_label[label_i] not in dictCorrectLabel: dictCorrectLabel[predict_label[label_i]] = 0
                dictCorrectLabel[predict_label[label_i]] += 1.0
            if test_label[label_i] not in dictTotalLabel: dictTotalLabel[test_label[label_i]] = 0
            dictTotalLabel[test_label[label_i]] += 1.0
    
        accuracy = metrics.accuracy_score(test_label, predict_label) * 100
        kappa_score = metrics.cohen_kappa_score(test_label, predict_label)
        average_accuracy = 0.0
        label_num = 0
    
        for key_i in dictTotalLabel:
            try:
                average_accuracy += (dictCorrectLabel[key_i] / dictTotalLabel[key_i]) * 100
                label_num += 1
            except:
                average_accuracy = average_accuracy
    
        average_accuracy = average_accuracy / label_num
    
        result = "OA:%.4f;AA:%.4f;KAPPA:%.4f"%(accuracy,average_accuracy,kappa_score)
        print result
        report = metrics.classification_report(test_label, predict_label)
        print report
        cm = metrics.confusion_matrix(test_label, predict_label)
        cs.ConfusionMatrixPng(cm,['1','2','3','4','5','6','7','8','9','10','11','12','13'])
    
    
    if __name__ == '__main__':
    
        train_data,train_label,test_data,test_label = LoadSolitData.TrainSize(1,0.5,[i for i in range(1,14)],"outclass") # 数据集自导自己的
    
        minedge = 180
        maxedge = 330
        step = 10
        gap = (maxedge-minedge)/3
        subprocess = []
        for i in range(1,4):#设置了3个进程
            p = Process(target=FlexSearch,args=(train_data,train_label,minedge+(i-1)*gap,minedge+i*gap,step,i))
            subprocess.append(p)
        for j in subprocess:
            j.start()
        for k in subprocess:
            k.join()
    
        candidatelist = []
        for i in range(1,4):
            with open("RF_%s.txt"%i) as f:
                candidatelist.append(int(f.readlines()[0].strip()))
    
        print "candidatelist:",candidatelist
    
        best_n_estimators = V_candidate(candidatelist, train_data,train_label)
        print "best_n_estimators",best_n_estimators
        Classifier = RandomForestClassifier(n_estimators=best_n_estimators)
        Calc(Classifier)
    
    

    总结

    在单参数寻优上可以选用MFCV,多参数寻优上选择FCV,经过CV得分可知,效果与普通遍历查询保持一致,但是速度提升效果明显,越是广阈查询,提升越明显,已通过ISO9001认证,请放心食用。。。。。。算法非常简单啊,但万一哪天火了呢,所以我要保留所有权利,转载请注明。(我能说Bootstrap这么简单的一种思想是怎么统治统计学的么,BTW,GridSerach思想本来也很简单呢)


    附录1--FCV手稿


    手稿1

    这里写图片描述

    手稿2

    这里写图片描述

    附录2

    如果你也想尝试xgboost这个非常好用的分类器,强烈建议你采用FCV进行调参,因为xgboost借助OpenMP,能自动利用单机CPU的多核进行并行计算,就像这样,一运行就陷入烤鸡模式,所以尽量减少调参时候的运行次数极为重要!

    这里写图片描述

    如何安装Xgboost请看@MrLevo520--解决:win10_x64 xgboost python安装所遇到问题


    致谢

    @abcjennifer--python并行调参——scikit-learn grid_search
    @MrLevo520--解决:win10_x64 xgboost python安装所遇到问题 @wzmsltw--XGBoost-Python完全调参指南-参数解释篇
    @u010414589--xgboost 调参经验
    @MrLevo520--总结:Bias(偏差),Error(误差),Variance(方差)及CV(交叉验证)
    @MrLevo520--Python小白带小白初涉多进程
    @MrLevo520--Python小白带小白初涉多线程

    相关文章

      网友评论

      • 家里蹲大学毕业生:你好,所有代码能不能分享一下么 正在膜拜中 593956670@qq.com0
        mrlevo520:@QQ_28e8 变成负值这个,的确我没考虑到,但是只要第一次取值得时候没有取边界就ok了,你完全可以刚开始设置搜索范围比较广,治愈第二个step的意思本身即是搜索粒度,随着搜索范围的变小,搜索粒度也要相应的缩小来适应整片区域的缩小,步进肯定是大于0的啊
        QQ追梦人:感谢作者的分享, 有两个小问题
        Nminedge = NmedEdge - Nstep-Nstep/2
        Nmaxedge = NmedEdge + Nstep+Nstep/2
        Dminedge = DmedEdge - Dstep
        Dmaxedge = DmedEdge + Dstep


        1. minedge 很有可能会变成负值,2. step = step/2, 所以step 一直都会是大于0,

        按照作者的意思, 应该是step = step//2?
        mrlevo520:额,代码我不是都写了么。。。

      本文标题:优化:一种将grid-search速度提升10倍的方法

      本文链接:https://www.haomeiwen.com/subject/jtfwpttx.html