交叉验证,是用来验证分类器的性能的一种统计分析方法,基本思想是在某种意义下将原始数据进行分组,一部分作为训练集,另一部分作为验证集,首先用训练集对分类器进行训练,在利用验证集来测试训练得到的模型,以此来做为评价分类器的性能指标。
kfold
将原始数据分为k组(一般是均分),将每个子集数据分别做一次验证集,其余k-1子集数据作为训练集,这样会得到k个模型,用这k个模型最终的验证集的分类准确率的均值作为在此k-cv下分类器的性能指标。
StratifiedKFold是k-fold的变种,会返回stratified(分层)的折叠:每个小集合中,各个类别的样例比例大致和完整数据集中相同。
mac 安装lightgbm
git clone --recursive https://github.com/Microsoft/LightGBM ; cd LightGBM
export CXX=g++-7 CC=gcc-7
mkdir build ; cd build
cmake ..
make -j4
brew install libomp
或者
brew install open-mpi
案例1:提供银行精准营销解决方案
https://www.kesci.com/home/competition/5c234c6626ba91002bfdfdd3/leaderboard
导入package
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import numpy as np
import pandas as pd
# 精准率,召回率
from sklearn.metrics import precision_score,recall_score
读取数据:
path = './'
train = pd.read_csv(path+'input/train_set.csv')
test = pd.read_csv(path+'input/test_set.csv')
print(train.info())
print(test.info())
结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25317 entries, 0 to 25316
Data columns (total 18 columns):
ID 25317 non-null int64
age 25317 non-null int64
job 25317 non-null object
marital 25317 non-null object
education 25317 non-null object
default 25317 non-null object
balance 25317 non-null int64
housing 25317 non-null object
loan 25317 non-null object
contact 25317 non-null object
day 25317 non-null int64
month 25317 non-null object
duration 25317 non-null int64
campaign 25317 non-null int64
pdays 25317 non-null int64
previous 25317 non-null int64
poutcome 25317 non-null object
y 25317 non-null int64
dtypes: int64(9), object(9)
memory usage: 3.5+ MB
None
...
训练集描述统计:
train.describe()
结果:
ID age balance day duration campaign pdays previous y
count 25317.000000 25317.000000 25317.000000 25317.000000 25317.000000 25317.000000 25317.000000 25317.000000 25317.000000
mean 12659.000000 40.935379 1357.555082 15.835289 257.732393 2.772050 40.248766 0.591737 0.116957
std 7308.532719 10.634289 2999.822811 8.319480 256.975151 3.136097 100.213541 2.568313 0.321375
min 1.000000 18.000000 -8019.000000 1.000000 0.000000 1.000000 -1.000000 0.000000 0.000000
25% 6330.000000 33.000000 73.000000 8.000000 103.000000 1.000000 -1.000000 0.000000 0.000000
50% 12659.000000 39.000000 448.000000 16.000000 181.000000 2.000000 -1.000000 0.000000 0.000000
75% 18988.000000 48.000000 1435.000000 21.000000 317.000000 3.000000 -1.000000 0.000000 0.000000
max 25317.000000 95.000000 102127.000000 31.000000 3881.000000 55.000000 854.000000 275.000000 1.000000
查看job特征统计量:
train.job.value_counts()
结果:
blue-collar 5456
management 5296
technician 4241
admin. 2909
services 2342
retired 1273
self-employed 884
entrepreneur 856
unemployed 701
housemaid 663
student 533
unknown 163
Name: job, dtype: int64
测试集:
test['y'] = -1
print(len(test.columns))
data = train.append(test).reset_index(drop=True)
结果:
18
编码与构造特征:
from tqdm import tqdm_notebook
from sklearn.preprocessing import LabelEncoder
cat_col = [i for i in data.select_dtypes(object).columns if i not in ['ID','y']]
for i in tqdm_notebook(cat_col):
lbl = LabelEncoder()
data['count_'+i] = data.groupby([i])[i].transform('count')
data[i] = lbl.fit_transform(data[i].astype(str))
特征:
feats = [i for i in data.columns if i not in ['ID','y']]
feats
结果:
['age',
'job',
'marital',
'education',
'default',
'balance',
'housing',
'loan',
'contact',
'day',
'month',
'duration',
'campaign',
'pdays',
'previous',
'poutcome',
'count_job',
'count_marital',
'count_education',
'count_default',
'count_housing',
'count_loan',
'count_contact',
'count_month',
'count_poutcome']
构建模型:
from xgboost import XGBClassifier
model = XGBClassifier(
learning_rate=0.01,#学习率
n_estimators=3000,#步长
max_depth=4,#深度
objective='binary:logistic',
seed=27
)
模型训练:
train_x =data[data['y']!=-1][feats]
train_y =data[data['y']!=-1]['y']
testx= data[data['y']==-1][feats]
model.fit(train_x,train_y)
test_pre = model.predict_proba(testx)[:,1]
pre = data[data['y'] == -1][['ID']]
pre['pred'] = test_pre
pre.head()
结果:
ID pred
25317 25318 0.040049
25318 25319 0.008510
25319 25320 0.001712
25320 25321 0.673808
25321 25322 0.035965
五折交叉验证:
from sklearn.model_selection import KFold
n_split=10
kfold = KFold(n_splits=10,shuffle=True,random_state=42)
train_x = data[data['y']!=-1][feats]
train_y = data[data['y']!=-1]['y']
res=data[data['y']==-1][['ID']]
test_x= data[data['y']==-1][feats]
res['pred'] = 0
for train_idx,val_idx in kfold.split(train_x):
model.random_state = model.random_state+1
train_x1 = train_x.loc[train_idx]
train_y1 = train_y.loc[train_idx]
test_x1 = train_x.loc[val_idx]
test_y1 = train_y.loc[val_idx]
model.fit(train_x1,train_y1,eval_set=[(train_x1,train_y1),(test_x1,test_y1)],
eval_metric='auc',early_stopping_rounds=100)
res['pred'] += model.predict_proba(test_x)[:,1]
res['pred'] = res['pred'] / 10
网友评论