基于xgboost的特征筛选

作者: 樱桃小丸子zz | 来源:发表于2018-06-13 18:59 被阅读0次

基于xgboost的特征筛选
基于sklearn的特征筛选
2022-04-27-xgboost
基于社交网络深度学习（报告记录）
XGBoost的GPU加速插件
特征筛选
机器学习之xgboost算法及特征筛选和GridSearchCV
XGBoost面试题详解
scala 连续特征转化成离散特征
XGBOOST查看特征分数

在大数据挖掘比赛中，除了模型的选择要到位，其特征工程的重要性也不言而喻，以至于大家经常会调侃，只要你的特征工程做得好，那你离冠军就不远了。

在特征工程中，特征选择是其中的重头戏，因为在大数据挖掘中，给出的数据特征数非常多，直接利用大量的特征开始进行模型训练，会非常耗时且效果并不好。因此特征选择就显得十分重要，特征选择需要挑选出那些有效的特征，从而代入到后面的训练模型中。

本文主要是基于xgboost进行特征选择，很多人都知道在后面的模型选择时，xgboost模型是一个非常热门的模型。但其实在前面特征选择部分，基于xgboost进行特征筛选也大有可为。

基于xgboost的特征选择，其代码如下：

import pandas as pd
import xgboost as xgb
import os,random,cPickle

os.mkdir('featurescore1')

def pipeline(iteration,random_seed,gamma,max_depth,lambd,subsample,colsample_bytree,min_child_weight):
    params={
        'booster':'gbtree',
        'objective': 'rank:pairwise',
        'scale_pos_weight': float(len(train_y)-sum(train_y))/float(sum(train_y)),
        'eval_metric': 'auc',
        'gamma':gamma,
        'max_depth':max_depth,
        'lambda':lambd,
        'subsample':subsample,
        'colsample_bytree':colsample_bytree,
        'min_child_weight':min_child_weight, 
        'eta': 0.2,
        'seed':random_seed,
        'nthread':8
     }

    watchlist  = [(dtrain,'train')]
    model = xgb.train(params, dtrain, num_boost_round=700, evals=watchlist)
    
    #save feature score
    feature_score = model.get_fscore()
    feature_score = sorted(feature_score.items(), key=lambda x:x[1],reverse=True)
    fs = []
    for (key,value) in feature_score:
        fs.append("{0},{1}\n".format(key,value))
    
    with open('./featurescore1/feature_score_{0}.csv'.format(iteration),'w') as f:
        f.writelines("feature,score\n")
        f.writelines(fs)


if __name__ == "__main__":
    train = pd.read_csv('./data/data_train_merge.csv')
    print(train.shape)
    train_y = train['label']
    train_x = train.drop(['id','label'],axis=1)
    dtrain = xgb.DMatrix(train_x, label=train_y)

    # params set and  shuffle
    random_seed = range(10000,20000,100)
    gamma = [i/1000.0 for i in range(0,300,3)]
    max_depth = [5,6,7]
    lambd = range(400,600,2)
    subsample = [i/1000.0 for i in range(500,700,2)]
    colsample_bytree = [i/1000.0 for i in range(550,750,4)]
    min_child_weight = [i/1000.0 for i in range(250,550,3)]
    
    random.shuffle(random_seed)
    random.shuffle(gamma)
    random.shuffle(max_depth)
    random.shuffle(lambd)
    random.shuffle(subsample)
    random.shuffle(colsample_bytree)
    random.shuffle(min_child_weight)
    
    with open('./featurescore1/params.pkl','w') as f:
        cPickle.dump((random_seed,gamma,max_depth,lambd,subsample,colsample_bytree,min_child_weight),f)

    for i in range(36):
        pipeline(i,random_seed[i],gamma[i],max_depth[i%3],lambd[i],subsample[i],colsample_bytree[i],min_child_weight[i])

因为xgboost的参数选择非常重要，因此进行了参数shuffle的操作。最后可以基于以上不同参数组合的xgboost所得到的feature和socre，再进行score平均操作，筛选出高得分的特征。

网友评论

机器学习算法

本文标题：基于xgboost的特征筛选

本文链接：https://www.haomeiwen.com/subject/pihseftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

基于xgboost的特征筛选

相关文章