美文网首页
day04-二分类比赛快速实现

day04-二分类比赛快速实现

作者: wenyilab | 来源:发表于2020-02-03 09:57 被阅读0次

    交叉验证,是用来验证分类器的性能的一种统计分析方法,基本思想是在某种意义下将原始数据进行分组,一部分作为训练集,另一部分作为验证集,首先用训练集对分类器进行训练,在利用验证集来测试训练得到的模型,以此来做为评价分类器的性能指标。
    kfold
    将原始数据分为k组(一般是均分),将每个子集数据分别做一次验证集,其余k-1子集数据作为训练集,这样会得到k个模型,用这k个模型最终的验证集的分类准确率的均值作为在此k-cv下分类器的性能指标。
    StratifiedKFold是k-fold的变种,会返回stratified(分层)的折叠:每个小集合中,各个类别的样例比例大致和完整数据集中相同。

    mac 安装lightgbm
    git clone --recursive https://github.com/Microsoft/LightGBM ; cd LightGBM
    export CXX=g++-7 CC=gcc-7
    mkdir build ; cd build
    cmake ..
    make -j4
    
    brew install libomp
    或者
    brew install open-mpi
    

    案例1:提供银行精准营销解决方案
    https://www.kesci.com/home/competition/5c234c6626ba91002bfdfdd3/leaderboard
    导入package

    from sklearn.model_selection import train_test_split
    import lightgbm as lgb
    import numpy as np
    import pandas as pd
    # 精准率,召回率
    from sklearn.metrics import precision_score,recall_score
    

    读取数据:

    path = './'
    train = pd.read_csv(path+'input/train_set.csv')
    test = pd.read_csv(path+'input/test_set.csv')
    print(train.info())
    print(test.info())
    

    结果:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 25317 entries, 0 to 25316
    Data columns (total 18 columns):
    ID           25317 non-null int64
    age          25317 non-null int64
    job          25317 non-null object
    marital      25317 non-null object
    education    25317 non-null object
    default      25317 non-null object
    balance      25317 non-null int64
    housing      25317 non-null object
    loan         25317 non-null object
    contact      25317 non-null object
    day          25317 non-null int64
    month        25317 non-null object
    duration     25317 non-null int64
    campaign     25317 non-null int64
    pdays        25317 non-null int64
    previous     25317 non-null int64
    poutcome     25317 non-null object
    y            25317 non-null int64
    dtypes: int64(9), object(9)
    memory usage: 3.5+ MB
    None
    ...
    

    训练集描述统计:

    train.describe()
    

    结果:

        ID  age balance day duration    campaign    pdays   previous    y
    count   25317.000000    25317.000000    25317.000000    25317.000000    25317.000000    25317.000000    25317.000000    25317.000000    25317.000000
    mean    12659.000000    40.935379   1357.555082 15.835289   257.732393  2.772050    40.248766   0.591737    0.116957
    std 7308.532719 10.634289   2999.822811 8.319480    256.975151  3.136097    100.213541  2.568313    0.321375
    min 1.000000    18.000000   -8019.000000    1.000000    0.000000    1.000000    -1.000000   0.000000    0.000000
    25% 6330.000000 33.000000   73.000000   8.000000    103.000000  1.000000    -1.000000   0.000000    0.000000
    50% 12659.000000    39.000000   448.000000  16.000000   181.000000  2.000000    -1.000000   0.000000    0.000000
    75% 18988.000000    48.000000   1435.000000 21.000000   317.000000  3.000000    -1.000000   0.000000    0.000000
    max 25317.000000    95.000000   102127.000000   31.000000   3881.000000 55.000000   854.000000  275.000000  1.000000
    

    查看job特征统计量:

    train.job.value_counts()
    

    结果:

    blue-collar      5456
    management       5296
    technician       4241
    admin.           2909
    services         2342
    retired          1273
    self-employed     884
    entrepreneur      856
    unemployed        701
    housemaid         663
    student           533
    unknown           163
    Name: job, dtype: int64
    

    测试集:

    test['y'] = -1
    print(len(test.columns))
    data = train.append(test).reset_index(drop=True)
    

    结果:

    18
    

    编码与构造特征:

    from tqdm import tqdm_notebook
    from sklearn.preprocessing import LabelEncoder
    cat_col = [i for i in data.select_dtypes(object).columns if i not in ['ID','y']]
    for i in tqdm_notebook(cat_col):
        lbl = LabelEncoder()
        data['count_'+i] = data.groupby([i])[i].transform('count')
        data[i] = lbl.fit_transform(data[i].astype(str))
    

    特征:

    feats = [i for i in data.columns if i not in ['ID','y']]
    feats
    

    结果:

    ['age',
     'job',
     'marital',
     'education',
     'default',
     'balance',
     'housing',
     'loan',
     'contact',
     'day',
     'month',
     'duration',
     'campaign',
     'pdays',
     'previous',
     'poutcome',
     'count_job',
     'count_marital',
     'count_education',
     'count_default',
     'count_housing',
     'count_loan',
     'count_contact',
     'count_month',
     'count_poutcome']
    

    构建模型:

    from xgboost import XGBClassifier
    model = XGBClassifier(
            learning_rate=0.01,#学习率
            n_estimators=3000,#步长
            max_depth=4,#深度
            objective='binary:logistic',
            seed=27
        )
    

    模型训练:

    train_x =data[data['y']!=-1][feats]
    train_y =data[data['y']!=-1]['y']
    testx= data[data['y']==-1][feats]
    model.fit(train_x,train_y)
    test_pre = model.predict_proba(testx)[:,1]
    pre = data[data['y'] == -1][['ID']]
    pre['pred'] = test_pre
    pre.head()
    

    结果:

        ID  pred
    25317   25318   0.040049
    25318   25319   0.008510
    25319   25320   0.001712
    25320   25321   0.673808
    25321   25322   0.035965
    

    五折交叉验证:

    from sklearn.model_selection import KFold
    n_split=10
    kfold = KFold(n_splits=10,shuffle=True,random_state=42)
    train_x = data[data['y']!=-1][feats]
    train_y = data[data['y']!=-1]['y']
    res=data[data['y']==-1][['ID']]
    test_x= data[data['y']==-1][feats]
    res['pred'] = 0
    
    for train_idx,val_idx in kfold.split(train_x):
        model.random_state = model.random_state+1
        train_x1 = train_x.loc[train_idx]
        train_y1 = train_y.loc[train_idx]
        test_x1 = train_x.loc[val_idx]
        test_y1 = train_y.loc[val_idx]
        model.fit(train_x1,train_y1,eval_set=[(train_x1,train_y1),(test_x1,test_y1)],
                 eval_metric='auc',early_stopping_rounds=100)
        res['pred'] += model.predict_proba(test_x)[:,1]
        
    res['pred'] = res['pred'] / 10
    

    相关文章

      网友评论

          本文标题:day04-二分类比赛快速实现

          本文链接:https://www.haomeiwen.com/subject/nrptxhtx.html