该文章为2019年科大讯飞广告营销反作弊挑战赛的赛后分享。我们组最终获得复赛第九名,下面是具体的模型搭建过程。
- 赛题介绍
1.1 赛题背景
1.2 赛题数据
1.4 评分标准 - EDA
2.1 工具包导入&数据读取
2.2 查看数据整体分布
2.3 查看变量唯一值
2.4 查看变量缺失值
2.5 查看变量不同取值情况 - 数据预处理
3.1 缺失值处理
3.2 时间维度处理
3.3 IP信息处理
3.4 设备信息处理 - 特征构造
4.1 构建转换特征
4.2 构建聚合特征 - 模型训练
一、赛题概要
1. 赛题背景
广告欺诈是数字营销面临的一个重大挑战,随着技术的成熟化,广告欺诈逐渐呈现出规模化、集团化的趋势。
本次大赛使用讯飞AI营销云的现网流量数据作为训练样本,需要基于基本数据、媒体信息、时间、IP信息、设备信息五个方面的数据构建模型,预测流量作弊与否。
2.赛题数据
3.评分标准
使用F1-score作为模型的评价指标。
二、EDA
1.工具包导入&数据读取
import numpy as np
import pandas as pd
import catboost as cbt
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.metrics import f1_score
from datetime import datetime,timedelta
import time
import scipy.spatial.distance as dist
from scipy import stats
from collections import Counter
import math
from tqdm import tqdm
import os
import re
import gc
train = pd.read_table('round1_iflyad_anticheat_traindata.txt')
test = pd.read_table('round1_iflyad_anticheat_testdata_feature.txt')
data = train.append(test).reset_index(drop=True)
print('train shape:{},test shape:{},data shape:{}'.format(train.shape,test.shape,data.shape))
2. 查看数据整体的分布
2.1 查看数据的前5行
data.head()
2.2 查看数据列的类型
print(data.dtypes)
2.3 查看定量数据的描述性统计信息
print(data.describe())
2.4 查看标签值的分布
count_classes = train['label'].value_counts()
count_classes.plot(kind='bar')
plt.title('Class Histogram')
plt.xlabel('Class')
plt.ylabel('Frequency')
有下图可得,该数据集正负样本基本均衡,不用做特殊处理。
3. 查看变量唯一值
unique_train = train.nunique(dropna=False)
unique_test = test.nunique(dropna=False)
unique_data = data.nunique(dropna=False)
unique_same = unique_train+unique_test-unique_data
unique_table = pd.concat([unique_train,unique_test,unique_same],axis=1,join='outer',sort=False)
unique_table = unique_table.rename(columns={0:'unique_train',1:'unique_test',2:'unique_same'})
unique_table = unique_table.sort_values(by='unique_train',ascending=False)
unique_table.drop(['label','sid'],axis=0,inplace=True)
overlap_train = {}
overlap_test = {}
for column in data.columns.drop(['sid','label','nginxtime']):
intersection = pd.DataFrame(set(train[column]).intersection(set(test[column]))).reset_index()
intersection.rename(columns={0:column},inplace=True)
overlap_train_ = train[column].reset_index()
overlap_test_ = test[column].reset_index()
overlap_train[column] = len(overlap_train_.merge(intersection,on=column,how='inner'))/len(train)
overlap_test[column] = len(overlap_test_.merge(intersection,on=column,how='inner'))/len(test)
overlap_train = pd.Series(overlap_train)
overlap_test = pd.Series(overlap_test)
unique_table = pd.concat([unique_table,overlap_train,overlap_test],axis=1,join='outer',sort=False)
unique_table.rename(columns={0:'overlap_train',1:'overlap_test'},inplace=True)
下表各列表示的含义:1.查看训练集中各变量唯一值数量;2.测试集中各变量唯一值数量;3.训练集和测试集中各变量相同唯一值数量; 4.训练集中各变量相同唯一值覆盖的行数,测试集中各变量相同唯一值覆盖的行数。
4. 查看缺失值
mis_train = train.isnull().sum()
mis_test = test.isnull().sum()
mis_data = data.isnull().sum()
mis_train_percent = 100*mis_train/len(train)
mis_test_percent = 100*mis_test/len(test)
mis_data_percent = 100*mis_data/len(data)
mis_table = pd.concat([mis_train,mis_train_percent,mis_test,mis_test_percent,mis_data,mis_data_percent],axis=1,join='outer',
sort=False)
mis_table = mis_table.rename(columns={0:'mis_train',1:'mis_train_percent',2:'mis_test',3:'mis_test_percent',
4:'mis_data',5:'mis_data_percent'})
mis_table = mis_table.sort_values(by='mis_train',ascending=False)
mis_table.drop('label',axis=0,inplace=True)
mis_table = mis_table[mis_table['mis_data']!=0]
5. 查看变量不同取值情况
此部分通过画图查看17个类别型变量的变量取值对应的行数,以及这些行数中lable=1的行数占的比例。
for column in data.select_dtypes('object').columns.drop('sid'):
bar_table = train[column].value_counts().reset_index()
bar_table = bar_table.rename(columns={'index':column,column:'count'})
label1_table = data[data['label']==1].groupby([column]).size().reset_index()
label1_table = label1_table.rename(columns={0:'label_count'})
bar_table = bar_table.merge(label1_table,on=column,how='left')
bar_table.sort_values(by='count',ascending=True,inplace=True)
bar_table['label_count'] = bar_table['label_count']/bar_table['count']
bar_table = bar_table.tail(20)
bar_table.fillna(0,inplace=True)
mpl.rcParams['font.sans-serif']=['SimHei']
mpl.rcParams['axes.unicode_minus']=False
plt.barh(bar_table[column],bar_table['label_count'],align='center',color='c',
tick_label = bar_table[column])
plt.xlabel('{} % of label'.format(column))
for a,b,c in zip(bar_table[column],bar_table['count'],bar_table['label_count']):
print(a,b,c)
plt.text(c+0.01,a,'%d'%(b),ha='center',va='bottom')
plt.savefig('{}.jpg'.format(column))
plt.show()
5.1 pkgname
通过下图可以明显的看出pkgname还是有很取值具备有很强的倾向性的,应该是一个不错的特征,但是同时也注意取值为empty记录数最多,应该是缺失值。
5.2 ver
5.3 adunitshowid
5.4 mediashowid
5.5 ip
5.6 city
5.7 reqrealip
5.8 adidmd5
5.9 imeimd5
5.10 idfamd5
除了empty,几乎所有的取值都是出现一次,基本可以断定跟行记录差不多,没有太多聚集和离散特征,可以考虑丢弃
5.11 openudidmd5
该变量区分度不高,可以考虑丢弃。
5.12 macmd5
5.13 model
5.14 make
5.15 os
5.16 osv
5.17 lan
三、数据预处理
3.1 缺失值处理
cols = data.columns
for col in cols:
if data[col].dtypes==object:
data[col] = data[col].fillna('empty')
elif col!='label':
data[col] = data[col].fillna(9999)
3.2 时间维度处理
data['time'] = pd.to_datetime(data['nginxtime']*1e+6)+timedelta(hours=8)
data['day'] = data['time'].dt.day
data['hour'] = data['time'].dt.hour
3.3 IP信息处理
- ip reqrealip
data['ip_0'] = data['ip'].astype(str).map(lambda x:'.'.join(x.split('.')[:1]))
data['ip_1'] = data['ip'].map(lambda x:'.'.join(x.split('.')[0:2]))
data['ip_2'] = data['ip'].map(lambda x:'.'.join(x.split('.')[0:3]))
data['reqrealip_0'] = data['reqrealip'].astype(str).map(lambda x:'.'.join(x.split('.')[:1]))
data['reqrealip_1'] = data['reqrealip'].astype(str).map(lambda x:'.'.join(x.split('.')[:2]))
data['reqrealip_2'] = data['reqrealip'].astype(str).map(lambda x:'.'.join(x.split('.')[:3]))
data['ip_equal'] = (data['ip'].astype(str)==data['reqrealip'].astype(str)).astype(int)
ip_feat = ['ip_0','ip_1','ip_2','reqrealip_0','reqrealip_1','reqrealip_2','ip_equal']
- city province
province_city = pd.read_excel('city_province.xlsx',encoding='gb18030')
data = data.merge(province_city,on='city',how='left')
data.loc[data['city']=='迪庆藏族自治州','new_province'] = '云南省'
data.loc[data['city']=='巴音郭楞蒙古自治州','new_province'] = '新疆'
data.loc[data['city']=='襄阳市','new_province'] = '湖北省'
data.loc[data['city']=='甘孜藏族自治州','new_province'] = '四川省'
data.loc[data['city']=='昌吉回族自治州','new_province'] = '新疆'
data.loc[data['city']=='伊犁哈萨克自治州','new_province'] = '新疆'
data.loc[data['city']=='凉山彝族自治州','new_province'] = '四川省'
data.loc[data['city']=='黔南布依族苗族自治州','new_province'] = '贵州省'
data.loc[data['city']=='楚雄彝族自治州','new_province'] = '云南省'
data.loc[data['city']=='红河哈尼族彝族自治州','new_province'] = '云南省'
data.loc[data['city']=='黔东南苗族侗族自治州','new_province'] = '贵州省'
data.loc[data['city']=='黔西南布依族苗族自治州','new_province'] = '贵州省'
data.loc[data['city']=='延边朝鲜族自治州','new_province'] = '吉林省'
data.loc[data['city']=='恩施土家族苗族自治州','new_province'] = '湖北省'
data.loc[data['city']=='大理白族自治州','new_province'] = '云南省'
data.loc[data['city']=='大兴安岭地区','new_province'] = '黑龙江省'
data.loc[data['city']=='文山壮族苗族自治州','new_province'] = '云南省'
data.loc[data['city']=='湘西土家族苗族自治州','new_province'] = '湖南省'
data.loc[data['city']=='和田地区','new_province'] = '新疆'
data.loc[data['city']=='塔城地区','new_province'] = '新疆'
data.loc[data['city']=='锡林郭勒盟','new_province'] = '内蒙古'
data.loc[data['city']=='喀什地区','new_province'] = '新疆'
data.loc[data['city']=='兴安盟','new_province'] = '内蒙古'
data.loc[data['city']=='阿克苏地区','new_province'] = '新疆'
data.loc[data['city']=='临夏回族自治州','new_province'] = '甘肃省'
data.loc[data['city']=='西双版纳傣族自治州','new_province'] = '云南省'
data.loc[data['city']=='甘南藏族自治州','new_province'] = '甘肃省'
data.loc[data['city']=='香港','new_province'] = '香港'
data.loc[data['city']=='阿勒泰地区','new_province'] = '新疆'
data.loc[data['city']=='博尔塔拉蒙古自治州','new_province'] = '新疆'
data.loc[data['city']=='黄南藏族自治州','new_province'] = '青海省'
data.loc[data['city']=='海西蒙古族藏族自治州','new_province'] = '青海省'
data.loc[data['city']=='阿拉善盟','new_province'] = '内蒙古'
data.loc[data['city']=='台湾','new_province'] = '台湾'
data.loc[data['city']=='自治区直辖县级行政区划','new_province'] = 'emp222'
data.loc[data['city']=='澳门','new_province'] = '澳门'
data.loc[data['city']=='克孜勒苏柯尔克孜自治州','new_province'] = '新疆'
data.loc[data['city']=='海北藏族自治州','new_province'] = '青海省'
data.loc[data['city']=='海南藏族自治州','new_province'] = '青海省'
data.loc[data['city']=='玉树藏族自治州','new_province'] = '青海省'
data.loc[data['city']=='果洛藏族自治州','new_province'] = '青海省'
data.loc[data['city']=='阿里地区','new_province'] = '西藏'
3.4 设备信息处理
- model
data['model'] = data['model'].str.lower()
data['make'] = data['make'].str.lower()
data['model1'] = data['model']
data['make1'] = data['make']
# 将model1中含有品牌字符串的部分替换成空白
data['model1'].replace('lg-',"",inplace=True,regex=True)
data['model1'].replace('oppo',"",inplace=True,regex=True)
data['model1'].replace('vivo',"",inplace=True,regex=True)
data['model1'].replace('oneplus', "", inplace=True,regex=True)
data['model1'].replace('samsung-', "", inplace=True,regex=True)
data['model1'].replace('zte', "", inplace=True,regex=True)
data['model1'].replace('letv', "", inplace=True,regex=True)
data['model1'].replace('huawei', "", inplace=True,regex=True)
data['model1'].replace('lenovo', "", inplace=True,regex=True)
data['model1'].replace('coolpad', "", inplace=True,regex=True)
data['model1'].replace('gionee', "", inplace=True,regex=True)
data['model1'].replace('lenovo', "", inplace=True,regex=True)
data['model1'].replace('htc', "", inplace=True,regex=True)
data['model1'].replace('samsung__samsung__', "", inplace=True,regex=True)
data['model1'].replace('nokia', "", inplace=True,regex=True)
data['model1'].replace('hisense', "", inplace=True,regex=True)
data['model1'].replace('samsung_', "", inplace=True,regex=True)
data['model1'].replace('xiaomi__', "", inplace=True,regex=True)
data['model1'].replace('meizu', "", inplace=True,regex=True)
data['model1'].replace('lephone', "", inplace=True,regex=True)
data['model1'].replace('meitu', "", inplace=True,regex=True)
# model make lan model1 make1清洗非打印字符
for fea in ['model','make','lan','model1','make1']:
data[fea] = data[fea].astype('str')
data[fea] = data[fea].map(lambda x:x.lower())
from urllib.parse import unquote
def url_clean(x):
x = unquote(x,'utf-8').replace('%2B',' ').replace('%20',' ').replace('%2F','/')\
.replace('%3F','?').replace('%25','%').replace('%23','#').replace(".",' ').replace('??',' ').\
replace('%26',' ').replace("%3D",'=').replace('%22','').replace('_',' ').\
replace('+',' ').replace('-',' ').replace('__',' ').replace(' ',' ').replace(',',' ')
return x
data[fea] = data[fea].map(url_clean)
- make
# 清洗make,make中如果含有下列厂家字符串,就替换成厂家字符串,否则就保留原来字符串
xinghao=['oppo','vivo','coolpad','lenovo','xiaomi','koobee','meizu','huawei','nokia','hw','samsung','lephone','letv','nubia',
'zte','oneplus','smartisan','redmi','honor','htc','meitu','360','hisense','realme','mi','gionee','sm',
'htc','meitu','360','hisense','lemobile','mi','sony','iphone']
for i in xinghao:
data['make1'] = data['make1'].astype(str).map(lambda x:i if i in x else x)
# 清洗make,小米
data['make1'] = data['make1'].astype(str).map(lambda x:'mi' if 'm1' in x else x)
data['make1'] = data['make1'].astype(str).map(lambda x:'mi' if 'm2' in x else x)
data['make1'] = data['make1'].astype(str).map(lambda x:'mi' if 'm3' in x else x)
data['make1'] = data['make1'].astype(str).map(lambda x:'mi' if 'm4' in x else x)
data['make1'] = data['make1'].astype(str).map(lambda x:'mi' if 'm5' in x else x)
data['make1'] = data['make1'].astype(str).map(lambda x:'mi' if 'm6' in x else x)
data['make1'] = data['make1'].astype(str).map(lambda x:'mi' if 'm7' in x else x)
data['make1'] = data['make1'].astype(str).map(lambda x:'mi' if 'm8' in x else x)
data['make1'] = data['make1'].astype(str).map(lambda x:'mi' if 'm9' in x else x)
data['make1'] = data['make1'].astype(str).map(lambda x:'mi' if 'm10' in x else x)
# 清洗make,将用中文和英文两种表述的厂家统一
data['make1'].replace('coolpad','酷派',inplace=True)
data['make1'].replace('lenovo','联想',inplace=True)
data['make1'].replace('xiaomi', "小米", inplace=True)
data['make1'].replace('koobee', "酷比", inplace=True)
data['make1'].replace('meizu', "魅族", inplace=True)
data['make1'].replace('huawei', "华为", inplace=True)
data['make1'].replace('nokia', "诺基亚", inplace=True)
data['make1'].replace('hw', "华为", inplace=True)
data['make1'].replace('samsung', "三星", inplace=True)
data['make1'].replace('lephone', "lephone", inplace=True)
data['make1'].replace('letv', "乐视", inplace=True)
data['make1'].replace('nubia', "努比亚", inplace=True)
data['make1'].replace('zte', "中兴", inplace=True)
data['make1'].replace('smartisan', "锤子", inplace=True)
data['make1'].replace('redmi', "小米", inplace=True)
data['make1'].replace('honor', "华为", inplace=True)
data['make1'].replace('htc', "htc", inplace=True)
data['make1'].replace('meitu', "美图", inplace=True)
data['make1'].replace('360', "360", inplace=True)
data['make1'].replace('hisense', "海信", inplace=True)
data['make1'].replace('mi', "小米", inplace=True)
data['make1'].replace('sm', "三星", inplace=True)
data['make1'].replace('meitu', "美图", inplace=True)
data['make1'].replace('coolpad', "酷派", inplace=True)
data['make1'].replace('lemobile', "lemobile", inplace=True)
data['make1'].replace('sony', "索尼", inplace=True)
- osv
data['osv1'] = data['osv']
data['osv1'].replace('Android_',"",inplace=True,regex=True)
data['osv1'].replace('Android',"",inplace=True,regex=True)
data['osv1'] = data['osv1'].str.split('_',expand=True,n=0)[0]
data['osv1'] = data['osv1'].astype(str).map(lambda x:'.'.join(x.split('.')[0:2]))
四、特征构造
4.1 构建转换特征
- h w ppi
data['size'] = (np.sqrt(data['h']**2+data['w']**2)/2.54)/1000
data['ratio'] = data['h']/data['w']
data['px'] = data['ppi']*data['size']
data['mj'] = data['h']*data['w']
- 将col1中变量转换成字符串连接起来
col1 = ['model','adunitshowid','h','make','osv','imeimd5','ratio','ver']
com_fat = []
for i,col in enumerate(col1):
for j,col3 in enumerate(col1):
if (j>i):
data[col+'_'+col3] = data[col].astype(str)+data[col3].astype(str)
com_fat.append(col+'_'+col3)
else:
continue
- osv make model
data['make_model_osv'] = data['make']+data['model']+data['osv']
data['osv2'] = data['osv1'].astype(str).map(lambda x:'.'.join(x.split('.')[0:1]))
data['hasv'] = data['osv2'].astype(str).map(lambda x:1 if 'V' in x else 0)
- orientation
data['ishlhw']=data.apply(lambda x:1 if (x.orientation==0 and x.h>=x.w) or (x.orientation==1 and x.w>=x.h) else 0,axis=1 )
- imeimd5 ntt
data['imeimd5_ntt'] = data['imeimd5']+data['ntt'].astype(str)
data['shebeiid'] = data['adidmd5']+data['imeimd5']+data['idfamd5']+data['openudidmd5']
4.2 构建聚合特征
def merge_nunique(df,columns_groupby,column,new_column_name,type='uint64'):
add = pd.DataFrame(df.groupby(columns_groupby)[column].nunique()).reset_index()
add.columns = columns_groupby+[new_column_name]
df = df.merge(add,on=columns_groupby,how='left')
df[new_column_name] = df[new_column_name].astype(type)
return df
def merge_mean_dist(df,columns_groupby,column,new_column_name,type='float64'):
add = pd.DataFrame(df.groupby(columns_groupby)[column].mean()).reset_index()
add.columns = columns_groupby+[new_column_name]
df = df.merge(add,on=columns_groupby,how='left')
df[new_column_name] = df[new_column_name].astype(type)
df[new_column_name] = df.apply(lambda x:x[column]-x[new_column_name],axis=1)
return df
def entropy(pr):
cate = Counter(pr)
log2 = math.log2
total = len(pr)
ent = 0
for i in cate:
p = float(cate[i]/total)
if p==0:
ent = 0
continue
ent = ent-p*(log2(p))
return ent
def merge_min(df,columns_groupby,column,new_column_name,type='float64'):
add = pd.DataFrame(df.groupby(columns_groupby)[column].min()).reset_index()
add.columns = columns_groupby+[new_column_name]
df = df.merge(add,on=columns_groupby,how='left')
df[new_column_name] = df[new_column_name].astype(type)
return df
def merge_count(df,columns_groupby,new_column_name,type='uint64'):
add = pd.DataFrame(df.groupby(columns_groupby).size()).reset_index()
add.columns = columns_groupby+[new_column_name]
df = df.merge(add,on=columns_groupby,how='left')
df[new_column_name] = df[new_column_name].astype(type)
return df
def merge_max(df,columns_groupby,column,new_column_name,type='float64'):
add = pd.DataFrame(df.groupby(columns_groupby)[column].max()).reset_index()
add.columns = columns_groupby+[new_column_name]
df = df.merge(add,on=columns_groupby,how='left')
df[new_column_name] = df[new_column_name].astype(type)
return df
def merge_mean(df,columns_groupby,column,new_column_name,type='float64'):
add = pd.DataFrame(df.groupby(columns_groupby)[column].mean()).reset_index()
add.columns = columns_groupby+[new_column_name]
df = df.merge(add,on=columns_groupby,how='left')
df[new_column_name] = df[new_column_name].astype(type)
return df
# 计算熵值
data['imeimd5_entropy_pkgname'] = data.groupby(['imeimd5'])['pkgname'].transform(entropy)
data['adidmd5_entropy_pkgname'] = data.groupby(['adidmd5'])['pkgname'].transform(entropy)
data['imeimd5_entropy_model1'] = data.groupby(['imeimd5'])['model1'].transform(entropy)
data['imeimd5_entropy_make1'] = data.groupby(['imeimd5'])['make1'].transform(entropy)
data['adidmd5_entropy_model1'] = data.groupby(['adidmd5'])['model1'].transform(entropy)
data['adidmd5_entropy_make1'] = data.groupby(['adidmd5'])['make1'].transform(entropy)
data['macmd5_entropy_model1'] = data.groupby(['macmd5'])['model1'].transform(entropy)
data['macmd5_entropy_make1'] = data.groupby(['macmd5'])['make1'].transform(entropy)
data = merge_mean_dist(data,['model1','make1'],'dvctype','meandist_dvctype_model1_make1')
data = merge_nunique(data,['model1'],'make1','nunique_nake1_model1')
data = merge_nunique(data,['model1','make1'],'ntt','nunique_ntt_model1_make1')
data = merge_nunique(data,['shebeiid','pkgname'],'ver','nunique_ver_shebeiid_pkgname')
data = merge_nunique(data,['shebeiid'],'ip_2','nunique_ip_2_shebeiid')
data = merge_nunique(data,['shebeiid'],'city','nunique_city_shebeiid')
data = merge_nunique(data,['shebeiid'],'dvctype','nunique_dvctype_shebeiid')
data = merge_nunique(data,['imeimd5'],'adidmd5','nunique_adidmd5_imeimd5')
data = merge_nunique(data,['adidmd5'],'imeimd5','nunique_imeimd5_adidmd5')
data = merge_nunique(data,['macmd5'],'imeimd5','nunique_imeimd5_macmd5')
data = merge_nunique(data,['imeimd5'],'macmd5','nunique_macmd5_imeimd5')
data = merge_count(data,['pkgname','day'],'cntpkgnameday')
data = merge_count(data,['imeimd5','day'],'cntimeimd5day')
data = merge_count(data,['model1','make1'],'cntmodel1make1')
data = merge_max(data,['model1'],'cntmodel1make1','maxmodel1cntmm',type='uint64')
data['dictmodelnum'] = data['maxmodel1cntmm']-data['cntmodel1make1']
data = merge_count(data,['model1','make1','osv2'],'cntmodel1make1osv2')
data = merge_max(data,['model1','make1'],'cntmodel1make1osv2','maxosv2cntmm',type='uint64')
data['dictosv2um'] = data['maxosv2cntmm']-data['cntmodel1make1osv2']
data = merge_count(data,['imeimd5','city'],'cntimeimd5city')
data = merge_max(data,['imeimd5'],'cntimeimd5city','maxcitycntimeimd5',type='uint64')
data['dictimeicitynum'] = data['maxcitycntimeimd5']-data['cntimeimd5city']
data = merge_count(data,['adidmd5','city'],'cntadidmd5city')
data = merge_max(data,['adidmd5'],'cntadidmd5city','maxcitycntadidmd5',type='uint64')
data['dictadidcitynum'] = data['maxcitycntadidmd5']-data['cntadidmd5city']
data = merge_min(data,['pkgname'],'day','pkgnameminday')
data = merge_min(data,['adunitshowid'],'day','adunitshowidminday')
data = merge_min(data,['adidmd5'],'day','adidmd5minday')
data = merge_min(data,['imeimd5'],'day','imeimd5minday')
data = merge_min(data,['macmd5'],'day','macmd5minday')
data['dictpkgnameminday'] = data['day']-data['pkgnameminday']
data['dictadunitshowidminday'] = data['day']-data['adunitshowidminday']
data['dictadidmd5minday'] = data['day']-data['adidmd5minday']
data['dictimeimd5minday'] = data['day']-data['imeimd5minday']
data['dictmacmd5minday'] = data['day']-data['macmd5minday']
del data['cntmodel1make1']
del data['maxmodel1cntmm']
del data['cntmodel1make1osv2']
del data['maxosv2cntmm']
del data['pkgnameminday']
del data['adunitshowidminday']
del data['adidmd5minday']
del data['imeimd5minday']
del data['macmd5minday']
del data['osv2']
del data['hasv']
del data['shebeiid']
4.3 特征编码
object_col = [i for i in data.select_dtypes(object).columns if i not in ['sid','label']]
for i in tqdm(object_col):
lbl = LabelEncoder()
data[i] = lbl.fit_transform(data[i].astype(str))
cat_list = [i for i in train.columns if i not in ['sid','label','nginxtime']]+['hour']+['make1','model1']+ip_feat
for i in tqdm(cat_list):
data['{}_count'.format(i)] = data.groupby(['{}'.format(i)])['sid'].transform('count')
data['begin_time'] = data['sid'].apply(lambda x:int(x.split('-')[-1]))
data['nginxtime-begin_time'] = data['nginxtime']-data['begin_time']
gc.collect()
cat_list = cat_list+com_fat
feature_name = [i for i in data.columns if i not in ['sid','label','time','day','begin_time']]
五、模型训练
tr_index = ~data['label'].isnull()
X_train = data[tr_index][list(set(feature_name))+['sid']].reset_index(drop=True)
y = data[tr_index]['label'].reset_index(drop=True).astype(int)
X_test = data[~tr_index][list(set(feature_name))+['sid']].reset_index(drop=True)
print(X_train.shape,X_test.shape)
oof_cat = np.zeros(X_train.shape[0])
prediction_cat = np.zeros(X_test.shape[0])
skf = StratifiedKFold(n_splits=5,random_state=2019,shuffle=True)
for index,(train_index,test_index) in enumerate(skf.split(X_train,y)):
train_x,test_x,train_y,test_y = X_train[feature_name].iloc[train_index],X_train[feature_name].ioc[test_index],y.iloc[train_index],y.iloc[test_index]
train_y = [0 if i==0 else i for i in train_y]
train_weight = [1 if i==0 else 1 for i in train_y]
cbt_model = cbt.CatBoostClassifier(iterations=3500,
learning_rate=0.1,
max_depth=9,
random_seed=42,
verbose=100,
early_stopping_rounds=500,
task_type='GPU',
eval_metric='F1',
cat_features=cat_list)
cbt_model.fit(train_x[feature_name],train_y,eval_set=(test_x[feature_name],test_y),sample_weight=train_weight)
oof_array = np.array(cbt_model.predict_proba(test_x)[:,1])
oof_cat[test_index] += oof_array
prediction_array = np.array(cbt_model.prediction_proba(X_test[feature_name])[:,1])
prediction_cat += prediction_array/5
del cbt_model
gc.collect()
print('F1',f1_score(y,np.round(oof_cat)))
# 输出结果
submit = test[['sid']]
submit['label'] = (prediction>=0.499).astype(int)
print(submit['label'].value_counts())
submit.to_csv('submission.csv',index=False)
网友评论