美文网首页
7.10 参数调优

7.10 参数调优

作者: 操作系统 | 来源:发表于2017-06-28 16:20 被阅读0次

    7.10.1 grid_search.GridSearchCV类

    sklearn库提供了grid_search模块,用来对模型参数进行调优。grid search采用基于网格搜索的交叉验证法来选择模型参数,避免了参数选择的盲目性和随意性。
      其中grid_search类的GridSearchCV方法实现了fit,predict,predict_proba等方法,并通过交叉验证对参数空间进行求解,寻找最佳的参数。使用方法如下:

    sklearn.grid_search.GridSearchCV(estimator,
        param_grid, scoring=None, fit_params=None,
        n_jobs=1, iid=True, refit=True, cv=None,
        verbose=0, pre_dispatch='2*n_jobs', error_score='raise')
    

    7.10.2 常用参数解读

    estimator:所使用的分类器,如estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features='sqrt',random_state=10), 并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数,或者score方法。
      param_grid:值为字典或者列表,即需要最优化的参数的取值,param_grid =param_test1,param_test1 = {'n_estimators':range(10,71,10)}。
      scoring :准确度评价标准,默认None,这时需要使用score函数;或者如scoring='roc_auc',根据所选模型不同,评价准则不同。字符串(函数名),或是可调用对象,需要其函数签名形如:scorer(estimator, X, y);如果是None,则使用estimator的误差估计函数。scoring参数选择如下:

    Scoring Function Comment
    accuracy metrics.accuracy_score
    average_precision metrics.average_precision_score
    f1 metrics.f1_score for binary targets
    f1_micro metrics.f1_score micro-averaged
    f1_macro metrics.f1_score macro-averaged
    f1_weighted metrics.f1_score weighted average
    f1_samples metrics.f1_score by multilabel sample
    neg_log_loss metrics.log_loss requires predict_proba support
    precision metrics.precision_score suffixes apply as with ‘f1’
    recall metrics.recall_score suffixes apply as with f1
    roc_auc metrics.roc_auc_score
    adjusted_rand_score metrics.adjusted_rand_score
    neg_mean_absolute_error metrics.mean_absolute_error
    neg_mean_squared_error metrics.mean_squared_error
    neg_median_absolute_error metrics.median_absolute_error
    r2 metrics.r2_score

    cv :交叉验证参数,默认None,使用三折交叉验证。指定fold数量,默认为3,也可以是yield训练/测试数据的生成器。
      refit :默认为True,程序将会以交叉验证训练集得到的最佳参数,重新对所有可用的训练集与开发集进行,作为最终用于性能评估的最佳模型参数。即在搜索参数结束后,用最佳参数结果再次fit一遍全部数据集。
      iid:默认True,为True时,默认为各个样本fold概率分布一致,误差估计为所有样本之和,而非各个fold的平均。
      verbose:日志冗长度,int:冗长度,0:不输出训练过程,1:偶尔输出,>1:对每个子模型都输出。
      n_jobs: 并行数,int:个数,-1:跟CPU核数一致, 1:默认值。
      pre_dispatch:指定总共分发的并行任务数。当n_jobs大于1时,数据将在每个运行点进行复制,这可能导致OOM,而设置pre_dispatch参数,则可以预先划分总共的job数量,使数据最多被复制pre_dispatch次

    7.10.3 常用方法

    grid.fit():运行网格搜索
      grid_scores_:给出不同参数情况下的评价结果
      best_params_:描述了已取得最佳结果的参数的组合
      best_score_:成员提供优化过程期间观察到的最好的评分

    7.10.4 代码实例

    示例代码一:

    #-*- coding:utf-8 -*-
    import numpy as np
    import pandas as pd
    import scipy as sp
    import copy,os,sys,psutil
    import lightgbm as lgb
    from lightgbm.sklearn import LGBMRegressor
    from sklearn.model_selection import GridSearchCV
    from sklearn.datasets import dump_svmlight_file
    from svmutil import svm_read_problem
    
    from sklearn import  metrics   #Additional scklearn functions
    from sklearn.grid_search import GridSearchCV   #Perforing grid search
    
    from featureProject.ly_features import make_train_set
    from featureProject.my_import import split_data
    # from featureProject.features import TencentReport
    from featureProject.my_import import feature_importance2file
    
    
    def print_best_score(gsearch,param_test):
         # 输出best score
        print("Best score: %0.3f" % gsearch.best_score_)
        print("Best parameters set:")
        # 输出最佳的分类器到底使用了怎样的参数
        best_parameters = gsearch.best_estimator_.get_params()
        for param_name in sorted(param_test.keys()):
            print("\t%s: %r" % (param_name, best_parameters[param_name]))
    
    def lightGBM_CV():
        print ('获取内存占用率: '+(str)(psutil.virtual_memory().percent)+'%')
        data, labels = make_train_set(24000000,25000000)
        values = data.values;
        param_test = {
            'max_depth': range(5,15,2),
            'num_leaves': range(10,40,5),
        }
        estimator = LGBMRegressor(
            num_leaves = 50, # cv调节50是最优值
            max_depth = 13,
            learning_rate =0.1, 
            n_estimators = 1000, 
            objective = 'regression', 
            min_child_weight = 1, 
            subsample = 0.8,
            colsample_bytree=0.8,
            nthread = 7,
        )
        gsearch = GridSearchCV( estimator , param_grid = param_test, scoring='roc_auc', cv=5 )
        gsearch.fit( values, labels )
        gsearch.grid_scores_, gsearch.best_params_, gsearch.best_score_
        print_best_score(gsearch,param_test)
    
    
    if __name__ == '__main__':
        lightGBM_CV()
    

    示例代码二:

    from __future__ import print_function
    from pprint import pprint
    from time import time
    import logging
    
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.linear_model import SGDClassifier
    from sklearn.grid_search import GridSearchCV
    from sklearn.pipeline import Pipeline
    
    print(__doc__)
    
    # Display progress logs on stdout
    logging.basicConfig(level=logging.INFO,
                        format='%(asctime)s %(levelname)s %(message)s')
    
    
    ###############################################################################
    # Load some categories from the training set
    categories = [
        'alt.atheism',
        'talk.religion.misc',
    ]
    # Uncomment the following to do the analysis on all the categories
    #categories = None
    
    print("Loading 20 newsgroups dataset for categories:")
    print(categories)
    
    data = fetch_20newsgroups(subset='train', categories=categories)
    print("%d documents" % len(data.filenames))
    print("%d categories" % len(data.target_names))
    print()
    
    ###############################################################################
    # 使用pipeline定义文本分类问题常见的工作流,包含向量化和一个简单的分类器
    pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', SGDClassifier()),
    ])
    
    # 参数空间:
    # 定义了pipeline中各个模型的需要穷尽求解的参数空间,比如:clf__penalty': ('l2', 'elasticnet')
    # 表示SGDClassifier分类器的正则化选项为L2和elasticnet,训练时模型会分别使用这两个正则化方法来寻求最佳的方式
    parameters = {
        'vect__max_df': (0.5, 0.75, 1.0),
        #'vect__max_features': (None, 5000, 10000, 50000),
        'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
        #'tfidf__use_idf': (True, False),
        #'tfidf__norm': ('l1', 'l2'),
        'clf__alpha': (0.00001, 0.000001),
        'clf__penalty': ('l2', 'elasticnet'),
        #'clf__n_iter': (10, 50, 80),
    }
    
    if __name__ == "__main__":
    
        # 通过GridSearchCV来寻求最佳参数空间
        grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
    
        print("Performing grid search...")
        print("pipeline:", [name for name, _ in pipeline.steps])
        print("parameters:")
        pprint(parameters)
        t0 = time()
    
        # 这里只需调用一次fit函数就可以了
        grid_search.fit(data.data, data.target)
        print("done in %0.3fs" % (time() - t0))
        print()
    
        # 输出best score
        print("Best score: %0.3f" % grid_search.best_score_)
        print("Best parameters set:")
        # 输出最佳的分类器到底使用了怎样的参数
        best_parameters = grid_search.best_estimator_.get_params()
        for param_name in sorted(parameters.keys()):
            print("\t%s: %r" % (param_name, best_parameters[param_name]))
    

    运行结果:

    Loading 20 newsgroups dataset for categories:
    ['alt.atheism', 'talk.religion.misc']
    1427 documents
    2 categories
    
    Performing grid search...
    pipeline: ['vect', 'tfidf', 'clf']
    parameters:
    {'clf__alpha': (1.0000000000000001e-05, 9.9999999999999995e-07),
     'clf__n_iter': (10, 50, 80),
     'clf__penalty': ('l2', 'elasticnet'),
     'tfidf__use_idf': (True, False),
     'vect__max_n': (1, 2),
     'vect__max_df': (0.5, 0.75, 1.0),
     'vect__max_features': (None, 5000, 10000, 50000)}
    done in 1737.030s
    
    Best score: 0.940
    Best parameters set:
        clf__alpha: 9.9999999999999995e-07
        clf__n_iter: 50
        clf__penalty: 'elasticnet'
        tfidf__use_idf: True
        vect__max_n: 2
        vect__max_df: 0.75
        vect__max_features: 50000
    

    相关文章

      网友评论

          本文标题:7.10 参数调优

          本文链接:https://www.haomeiwen.com/subject/fomucxtx.html