文章目录
Amazon Employee Access 数据分析报告
报告摘要
一、问题描述
二、数据加载
三、数据探索
3.1 描述性统计
3.2 变量间的对应关系探索
3.2.1 ROLE_TITLE与ROLE_CODE
3.2.2 ROLE_ROLLUP_1与ROLE_DEPTNAME
3.2.3 ROLE_ROLLUP_2与ROLE_DEPTNAME
3.2.4 ROLE_ROLLUP_1与ROLE_ROLLUP_2
3.2.5 ROLE_FAMILY与ROLE_FAMILY_DESC
3.2.6 ROLE_TITLE和ROLE_FAMILY
3.3 变量分布探索
四、特征工程
4.1 降维
4.2 新增单变量的频率
4.3 新增双变量的频率
4.4 新增三变量的频率
4.5 新增四变量的频率
4.6 新增各变量出现频率的条件概率
五、模型建立
六、模型预测与评价
Amazon Employee Access 数据分析报告
报告摘要
目标:本分析旨在利用Amazon的员工编号相关信息,来分析和预测当员工申请访问某个编号的资源时,是否被允许访问。
方法: 在原有部分变量的基础上,利用特征工程的方法,新增了单变量、双变量、三变量、四变量出现的频率和变量出现的条件概率等变量,利用随机森林模型,对目标变量进行预测。
结论:
一、对于训练集数据分析发现,各变量之间存在着一定的联系,其中ROLE_TITLE变量和ROLE_RODE变量存在一对一的关系,ROLE_TITLE变量和ROLE_FAMILY变量存在多对一的关系,其他变量之间也存在较强的对应关系。
二、根据这种方法建模,发现模型具有一定的预测效果。
目录
问题描述
数据加载
数据探索
描述性统计
变量间的对应关系探索
变量分布探索
特征工程
降维
新增单变量频率
新增双变量频率
新增三变量频率
新增四变量频率
新增各变量出现的条件概率
模型建立
模型预测与评价
一、问题描述
利用Amazon员工的编号信息,包括员工经理的编号、员工所在分类的编号、员工所在部门编号、员工职位编号、员工类别编号等信息,来预测当员工申请访问某个编号的资源时,是否被允许访问。
变量名含义
ACTION1代表资源被授权访问,0代表资源未被授权访问
RESOURCE资源编号
MGR_ID员工经理的编号
ROLE_ROLLUP_1公司员工分类1,如美国工程
ROLE_ROLLUP_2公司员工分类2,如美国零售
ROLE_DEPTNAME公司部门描述,如零售
ROLE_TITLE职位名称,如高级工程零售经理
ROLE_FAMILY_DESC公司员工类别扩展描述,如零售经理,软件工程
ROLE_FAMILY公司员工类别描述,如零售经理
ROLE_CODE员工角色编号
二、数据加载
加载所需的python库
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.graphics.api as smg
import patsy
get_ipython().magic('matplotlib inline')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
from scipy import stats
import seaborn as sns
载入train数据集
amazon = pd.read_csv("C:/Users/cs/Desktop/Amazon/train.csv")
data =amazon
data.head()
| ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 |ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | ROLE_FAMILY |ROLE_CODE---|---|---|---|---|---|---|---|---|---|---0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 |117906 | 290919 | 1179081 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 |118536 | 308574 | 1185392 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 |267952 | 19721 | 1178803 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 |240983 | 290919 | 1183224 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 |123932 | 19793 | 119325
三、数据探索
3.1 描述性统计
train数据集共有32769个样本,不存在缺失值
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32769 entries, 0 to 32768
Data columns (total 10 columns):
ACTION 32769 non-null int64
RESOURCE 32769 non-null int64
MGR_ID 32769 non-null int64
ROLE_ROLLUP_1 32769 non-null int64
ROLE_ROLLUP_2 32769 non-null int64
ROLE_DEPTNAME 32769 non-null int64
ROLE_TITLE 32769 non-null int64
ROLE_FAMILY_DESC 32769 non-null int64
ROLE_FAMILY 32769 non-null int64
ROLE_CODE 32769 non-null int64
dtypes: int64(10)
memory usage: 2.5 MB
样本中,约有5.8%的员工授权申请没有通过,除ACTION外,各变量编号从个位数到六位数不等。
data.describe()
| ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 |ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | ROLE_FAMILY |ROLE_CODE---|---|---|---|---|---|---|---|---|---|---count | 32769.000000 | 32769.000000 | 32769.000000 | 32769.000000 |32769.000000 | 32769.000000 | 32769.000000 | 32769.000000 |32769.000000 | 32769.000000mean | 0.942110 | 42923.916171 | 25988.957979 | 116952.627788 |118301.823156 | 118912.779914 | 125916.152644 | 170178.369648 |183703.408893 | 119789.430132std | 0.233539 | 34173.892702 | 35928.031650 | 10875.563591 |4551.588572 | 18961.322917 | 31036.465825 | 69509.462130 |100488.407413 | 5784.275516min | 0.000000 | 0.000000 | 25.000000 | 4292.000000 | 23779.000000| 4674.000000 | 117879.000000 | 4673.000000 | 3130.000000 |117880.00000025% | 1.000000 | 20299.000000 | 4566.000000 | 117961.000000 |118102.000000 | 118395.000000 | 118274.000000 | 117906.000000 |118363.000000 | 118232.00000050% | 1.000000 | 35376.000000 | 13545.000000 | 117961.000000 |118300.000000 | 118921.000000 | 118568.000000 | 128696.000000 |119006.000000 | 118570.00000075% | 1.000000 | 74189.000000 | 42034.000000 | 117961.000000 |118386.000000 | 120535.000000 | 120006.000000 | 235280.000000 |290919.000000 | 119348.000000max | 1.000000 | 312153.000000 | 311696.000000 | 311178.000000 |286791.000000 | 286792.000000 | 311867.000000 | 311867.000000 |308574.000000 | 270691.000000
查看各变量上不同编号的种类数。可以发现,在30000多个样本中,RESOURCE、MGR_ID和ROLE_FAMILY上编号种类数较多,其他变量上编号种类数较少。值得注意的是,ROLE_TITLE和ROLE_CODE种类数一致。
f = lambda x: x.unique().size
data.apply(f)
ACTION 2
RESOURCE 7518
MGR_ID 4243
ROLE_ROLLUP_1 128
ROLE_ROLLUP_2 177
ROLE_DEPTNAME 449
ROLE_TITLE 343
ROLE_FAMILY_DESC 2358
ROLE_FAMILY 67
ROLE_CODE 343
dtype: int64
3.2 变量间的对应关系探索
3.2.1 ROLE_TITLE与ROLE_CODE
画出ROLE_TITLE和ROLE_CODE变量的散点图,存在明显的正相关关系。
fig,ax = plt.subplots(nrows=1,ncols=1,figsize=(8,5))
plt.scatter(data.ROLE_TITLE,data.ROLE_CODE)
<matplotlib.collections.PathCollection at 0xabb0c50>
将两个变量的值合并,编号的种类数目仍为343,
TITLE_CODE = data.ROLE_TITLE*1000000+data.ROLE_CODE
TITLE_CODE.unique().size
343
# 定义f2,用来计算交叉表每一行或每一列中非0值的个数
f2 = lambda x: x[x!=0].count()
# 画出两个变量间的交叉表
TICO = pd.crosstab(data.ROLE_TITLE,data.ROLE_CODE)
# 观察交叉表中ROLE_CODE变量对应的ROLE_TITLE变量个数
TICO.apply(f2).plot()
# 在变量ROLE_CODE上,对应的ROLE_TITLE个数为0,说明两个变量间至少存在一对多的对应关系
TICO.apply(f2)[TICO.apply(f2)>1]
Series([], dtype: int64)
观察交叉表中ROLE_TITLE变量对应的ROLE_CODE变量个数,也为0,说明两个变量间存在一一对应的关系
TICO.apply(f2,axis=1).plot()
TICO.apply(f2,axis=1)[TICO.apply(f2,axis=1)>1]
Series([], dtype: int64)
3.2.2 ROLE_ROLLUP_1与ROLE_DEPTNAME
# 将两个变量的值合并,编号的种类数目发生了较大的变化,但仍可发现,存在一定的对应关系
RO1_DEP= data.ROLE_ROLLUP_1*10000000+data.ROLE_DEPTNAME
data.ROLE_ROLLUP_1.unique().size, data.ROLE_DEPTNAME.unique().size, RO1_DEP.unique().size
# ctRO1DEP = pd.crosstab(data.ROLE_ROLLUP_1,data.ROLE_DEPTNAME)
(128, 449, 1185)
3.2.3 ROLE_ROLLUP_2与ROLE_DEPTNAME
# 将两个变量的值合并,编号的种类数目发生了较大的变化,但仍可发现,存在一定的对应关系
RO2_DEP= data.ROLE_ROLLUP_2*10000000+data.ROLE_DEPTNAME
data.ROLE_ROLLUP_2.unique().size, data.ROLE_DEPTNAME.unique().size, RO2_DEP.unique().size
(177, 449, 1398)
3.2.4 ROLE_ROLLUP_1与ROLE_ROLLUP_2
# 将两个变量合并,编号的唯一值数目变化不大,说明两者之间存在很强的对应关系
RO1_RO2= data.ROLE_ROLLUP_1*10000000+data.ROLE_ROLLUP_2
data.ROLE_ROLLUP_1.unique().size, data.ROLE_ROLLUP_2.unique().size, RO1_RO2.unique().size
(128, 177, 187)
#画出两个变量间的交叉表
ctRO12 = pd.crosstab(data.ROLE_ROLLUP_1,data.ROLE_ROLLUP_2)
# 观察交叉表中ROLE_ROLLUP_2变量对应的ROLE_ROLLUP_1变量个数
ctRO12.apply(f2).plot()
# 在变量ROLE_ROLLUP_2上,只有三个值对应的ROLE_ROLLUP_1个数大于1(非一一对应关系),说明两个变量间有很强的一对多的对应关系
ctRO12.apply(f2)[ctRO12.apply(f2)>1]
ROLE_ROLLUP_2
118164 2
118178 2
119256 9
dtype: int64
# 统计ROLE_ROLLUP_2编号为118164、118178和119356样本的数目,样本数目的变量并不多,但总体上,未通过授权的比率比平均高
a = data.ROLE_ROLLUP_2[(data.ROLE_ROLLUP_2==118164) | (data.ROLE_ROLLUP_2==118178)| (data.ROLE_ROLLUP_2==119256)].count()
b = data.ACTION[(data.ROLE_ROLLUP_2==118164) | (data.ROLE_ROLLUP_2==118178)| (data.ROLE_ROLLUP_2==119256)].value_counts()
b,a
(1 380
0 36
Name: ACTION, dtype: int64, 416)
# 观察交叉表中ROLE_ROLLUP_1变量对应的ROLE_ROLLUP_2变量个数
# ctRO12.apply(f,axis=1).plot()
# 在变量ROLE_ROLLUP_1上,有32个值对应的ROLE_ROLLUP_2个数大于1
ctRO12.apply(f2,axis=1)[ctRO12.apply(f2,axis=1)>1].count()
32
3.2.5 ROLE_FAMILY与ROLE_FAMILY_DESC
# 将两个变量合并,编号的唯一值数目变化不大,说明两者之间存在很强的对应关系
FA_DESC= data.ROLE_FAMILY_DESC*1000000+data.ROLE_FAMILY
data.ROLE_FAMILY_DESC.unique().size,data.ROLE_FAMILY.unique().size, FA_DESC.unique().size
(2358, 67, 2586)
#画出两个变量间的交叉表
ctFAFA = pd.crosstab(data.ROLE_FAMILY,data.ROLE_FAMILY_DESC)
# 在变量ROLE_FAMILY_DESC上,有170个值对应的ROLE_FAMILY个数大于1,
# 在变量ROLE_FAMILY上,有59个值对应的ROLE_FAMILY_DESC个数大于1,说明两个变量间有较强的一对多的对应关系
3.2.6 ROLE_TITLE和ROLE_FAMILY
# 将两个变量合并,唯一值没有发生变化,说明两者之间可能存在一对多关系
TIFA = data.ROLE_TITLE*1000000+data.ROLE_FAMILY
data.ROLE_TITLE.unique().size, data.ROLE_FAMILY.unique().size, TIFA.unique().size
(343, 67, 343)
#画出两个变量间的交叉表
ctTIFA = pd.crosstab(data.ROLE_TITLE,data.ROLE_FAMILY)
# 观察交叉表中ROLE_TITLE变量对应的ROLE_FAMILY变量个数
ctTIFA.apply(f2,axis=1).plot()
# 可以发现,ROLE_TITLE 与ROLE_FAMILY之间存在着一对多的关系,
ctTIFA.apply(f2,axis=1)[ctTIFA.apply(f2,axis=1)>1].count()
0
3.3 变量分布探索
# 画出变量ACTION的条形图,大部分的申请都被授权
fig,ax = plt.subplots(figsize=(8,5))
data.ACTION.value_counts().plot(kind="bar",color="lightblue")
ax.set_xticklabels(("Accessed","Not Accessed"), rotation= "horizontal" )
ax.set_title("Bar plot of Action")
<matplotlib.text.Text at 0xd173080>
# 画出其余变量的分布直方图,RESOURCE和MGR_ID变量的编号大多分布在0-1000000上,且分布相对离散,其余变量分布都集中在一定的值和区域内。
# 如变量ROLE_ROLLUP_1上,有21407个样本编号为117961;在ROLE_FAMILY上有10980个样本的编号为290919。
# data.ROLE_ROLLUP_1.value_counts(),data.ROLE_FAMILY.value_counts()
fig,ax = plt.subplots(nrows=4,ncols=2,figsize=(20,40))
data.RESOURCE.hist(ax=ax[0,0],bins=100)
ax[0,0].set_title("Hist plot of RESOURCE")
data.MGR_ID.hist(ax=ax[0,1],bins=100)
ax[0,1].set_title("Hist plot of MGR_ID")
data.ROLE_ROLLUP_1.hist(ax=ax[1,0],bins=100)
ax[1,0].set_title("Hist plot of ROLE_ROLLUP_1")
data.ROLE_ROLLUP_2.hist(ax=ax[1,1],bins=100)
ax[1,1].set_title("Hist plot of ROLE_ROLLUP_2")
data.ROLE_DEPTNAME.hist(ax=ax[2,0],bins=100)
ax[2,0].set_title("ROLE_DEPTNAME")
data.ROLE_TITLE.hist(ax=ax[2,1],bins=100)
ax[2,1].set_title("Hist plot of ROLE_TITLE")
data.ROLE_FAMILY_DESC.hist(ax=ax[3,0],bins=100)
ax[3,0].set_title("Hist plot of ROLE_FAMILY_DESC")
data.ROLE_FAMILY.hist(ax=ax[3,1],bins=100)
ax[3,1].set_title("Hist plot of ROLE_FAMILY")
<matplotlib.text.Text at 0xdb2fa90>
# 画出变量间相关系数矩阵图,变量编号的值之间并没有明显的线性关系
cm = np.corrcoef(data.values.T)
sns.set(font_scale=1)
cols = data.columns
hm = sns.heatmap(cm,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 10},
yticklabels=cols,
xticklabels=cols)
plt.tight_layout()
plt.show()
四、特征工程
4.1 降维
# 由于ROLE_CODE和ROLE_FAMILY与ROLE_TITLE存在一对一和一对多的关系,认为他不能包含更多的信息,删去这两个变量
data = amazon
del data["ROLE_CODE"]
del data["ROLE_FAMILY"]
amazon = pd.read_csv("C:/Users/cs/Desktop/Amazon/train.csv")
4.2 新增单变量的频率
# 利用循环,得到每个自变量出现的频率,赋值到新的列中。
one= ["RESOURCE","MGR_ID","ROLE_ROLLUP_1","ROLE_ROLLUP_2","ROLE_DEPTNAME","ROLE_TITLE","ROLE_FAMILY_DESC"]
for i in range(0,len(one)):
a=data[one[i]]
b=data[one[i]].value_counts()/32769
a=a.map(b)
data[one[i]+"_prob"]=a
data.head()
| ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 |ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob |MGR_ID_prob | ROLE_ROLLUP_1_prob | ROLE_ROLLUP_2_prob |
ROLE_DEPTNAME_probROLE_TITLE_probROLE_FAMILY_DESC_prob
0139353
1179060.0000920.001678
0.1093410.210443
1117183
1185360.0009150.000305
0.0024720.000366
2136724
2679520.0000610.000092
0.0383290.001007
3136135
2409830.0000310.001892
0.1418720.037963
4142680
1239320.0002440.000275
0.0022890.000580
4.3 新增双变量的频率
# 利用循环,得到每两个自变量同时出现的频率,赋值到新的列中。
two = ["RESOURCE","MGR_ID","ROLE_ROLLUP_1","ROLE_ROLLUP_2","ROLE_DEPTNAME","ROLE_TITLE","ROLE_FAMILY_DESC"]
for i in range(0,len(two)):
for j in range(i+1,len(two)):
a=data[two[i]]+data[two[j]]*1000000
b=a.value_counts()/32769
a=a.map(b)
data[two[i]+"_"+two[j]+"_prob"]=a
data.head()
| ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 |ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob |MGR_ID_prob | ... | ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob |ROLE_ROLLUP_1_ROLE_DEPTNAME_prob | ROLE_ROLLUP_1_ROLE_TITLE_prob |ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob | ROLE_ROLLUP_2_ROLE_DEPTNAME_prob |ROLE_ROLLUP_2_ROLE_TITLE_prob | ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob |ROLE_DEPTNAME_ROLE_TITLE_prob | ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |ROLE_TITLE_ROLE_FAMILY_DESC_prob---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 |117906 | 0.000092 | 0.001678 | ... | 0.135006 | 0.002014 |0.089200 | 0.180659 | 0.002014 | 0.013855 | 0.033233 | 0.000671 |0.001678 | 0.0795571 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 |118536 | 0.000915 | 0.000305 | ... | 0.120388 | 0.003815 |0.002472 | 0.000366 | 0.003754 | 0.000580 | 0.000153 | 0.000153 |0.000153 | 0.0003662 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 |267952 | 0.000061 | 0.000092 | ... | 0.005615 | 0.000397 |0.001556 | 0.000061 | 0.000397 | 0.001556 | 0.000061 | 0.005615 |0.000061 | 0.0000613 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 |240983 | 0.000031 | 0.001892 | ... | 0.120388 | 0.005401 |0.125057 | 0.036956 | 0.005035 | 0.022460 | 0.007782 | 0.003052 |0.001770 | 0.0162044 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 |123932 | 0.000244 | 0.000275 | ... | 0.004211 | 0.000671 |0.000488 | 0.000305 | 0.000549 | 0.000244 | 0.000244 | 0.000183 |0.000183 | 0.000519
5 rows × 36 columns
4.4 新增三变量的频率
# 利用循环,得到每三个自变量同时出现的频率,赋值到新的列中。
three = ["RESOURCE","MGR_ID","ROLE_ROLLUP_1","ROLE_ROLLUP_2","ROLE_DEPTNAME","ROLE_TITLE","ROLE_FAMILY_DESC"]
for i in range(0,len(three)):
for j in range(i+1,len(three)):
for k in range(j+1,len(three)):
a = data[three[i]]*100000*100000+data[three[j]]*1000000+data[three[k]]
b = a.value_counts()/91690
a = a.map(b)
data[three[i]+"_"+three[j]+"_"+three[k]+"_"+"prob"]=a
data.head()
| ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 |ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob |MGR_ID_prob | ... | ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob |ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_TITLE_prob |ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob |ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_TITLE_prob |ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |ROLE_ROLLUP_1_ROLE_TITLE_ROLE_FAMILY_DESC_prob |ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_TITLE_prob |ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |ROLE_ROLLUP_2_ROLE_TITLE_ROLE_FAMILY_DESC_prob |ROLE_DEPTNAME_ROLE_TITLE_ROLE_FAMILY_DESC_prob---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 |117906 | 0.000092 | 0.001678 | ... | 0.000720 | 0.004951 |0.011877 | 0.000185 | 0.000556 | 0.023220 | 0.000185 | 0.000556 |0.003937 | 0.0002181 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 |118536 | 0.000915 | 0.000305 | ... | 0.001341 | 0.000207 |0.000055 | 0.000055 | 0.000055 | 0.000131 | 0.000055 | 0.000055 |0.000055 | 0.0000552 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 |267952 | 0.000061 | 0.000092 | ... | 0.000142 | 0.000556 |0.000022 | 0.000055 | 0.000022 | 0.000022 | 0.000055 | 0.000022 |0.000022 | 0.0000223 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 |240983 | 0.000031 | 0.001892 | ... | 0.001800 | 0.008027 |0.002781 | 0.001091 | 0.000633 | 0.005682 | 0.000971 | 0.000534 |0.001451 | 0.0005454 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 |123932 | 0.000244 | 0.000275 | ... | 0.000196 | 0.000087 |0.000087 | 0.000044 | 0.000044 | 0.000109 | 0.000022 | 0.000022 |0.000087 | 0.000065
5 rows × 71 columns
4.5 新增四变量的频率
# 利用循环,得到每三个自变量和RESOURCE同时出现的频率,赋值到新的列中。
four = ["RESOURCE","MGR_ID","ROLE_ROLLUP_1","ROLE_ROLLUP_2","ROLE_DEPTNAME","ROLE_TITLE","ROLE_FAMILY_DESC"]
for i in range(1,len(four)):
for j in range(i+1,len(four)):
for k in range(j+1,len(four)):
a = data[four[0]]*100000*100000+data[four[i]]*1000000+data[four[j]]+data[four[k]]*0.000001
b = a.value_counts()/32769
a = a.map(b)
data[four[0]+"_"+four[i]+"_"+four[j]+"_"+four[k]+"_"+"prob"]=a
data.head()
| ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 |ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob |MGR_ID_prob | ... |RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob |RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_TITLE_prob |RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob |RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_TITLE_prob |RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_ROLE_FAMILY_DESC_prob |RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_TITLE_prob |RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_ROLE_FAMILY_DESC_prob |RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_ROLE_FAMILY_DESC_prob---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 |117906 | 0.000092 | 0.001678 | ... | 0.000092 | 0.000092 |0.000092 | 0.000031 | 0.000031 | 0.000061 | 0.000031 | 0.000031 |0.000061 | 0.0000311 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 |118536 | 0.000915 | 0.000305 | ... | 0.000336 | 0.000305 |0.000153 | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 |0.000031 | 0.0000312 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 |267952 | 0.000061 | 0.000092 | ... | 0.000061 | 0.000061 |0.000031 | 0.000061 | 0.000031 | 0.000031 | 0.000061 | 0.000031 |0.000031 | 0.0000313 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 |240983 | 0.000031 | 0.001892 | ... | 0.000031 | 0.000031 |0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 | 0.000031 |0.000031 | 0.0000314 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 |123932 | 0.000244 | 0.000275 | ... | 0.000061 | 0.000061 |0.000061 | 0.000061 | 0.000061 | 0.000092 | 0.000031 | 0.000031 |0.000061 | 0.000061
5 rows × 91 columns
4.6 新增各变量出现频率的条件概率
# RESOURCE 确定时其他单个变量同时发生的概率
resourcetwo = ['RESOURCE_MGR_ID_prob','RESOURCE_ROLE_ROLLUP_1_prob', 'RESOURCE_ROLE_ROLLUP_2_prob','RESOURCE_ROLE_DEPTNAME_prob',
'RESOURCE_ROLE_TITLE_prob','RESOURCE_ROLE_FAMILY_DESC_prob']
for i in range(0,len(resourcetwo)):
a = data[resourcetwo[i]]/data.RESOURCE_prob
data[resourcetwo[i]+"_"+"probre"]=a
data.head()
| ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 |ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob |MGR_ID_prob | ... | RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_TITLE_prob| RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob |RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_ROLE_FAMILY_DESC_prob |RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_ROLE_FAMILY_DESC_prob |RESOURCE_MGR_ID_prob_probre | RESOURCE_ROLE_ROLLUP_1_prob_probre |RESOURCE_ROLE_ROLLUP_2_prob_probre | RESOURCE_ROLE_DEPTNAME_prob_probre |
RESOURCE_ROLE_TITLE_prob_probreRESOURCE_ROLE_FAMILY_DESC_prob_probre
01
1179060.000092
0.0000610.000031
0.6666671.000000
11
1185360.000915
0.0000310.000031
0.0333330.033333
21
2679520.000061
0.0000310.000031
1.0000000.500000
31
2409830.000031
0.0000310.000031
1.0000001.000000
41
1239320.000244
0.0000610.000061
0.5000000.375000
5 rows × 97 columns
# 其他单个变量确定时RESOURCE变量同时发生的概率
resourcetwo = ['RESOURCE_MGR_ID_prob','RESOURCE_ROLE_ROLLUP_1_prob', 'RESOURCE_ROLE_ROLLUP_2_prob','RESOURCE_ROLE_DEPTNAME_prob',
'RESOURCE_ROLE_TITLE_prob','RESOURCE_ROLE_FAMILY_DESC_prob']
resourceone = [ 'MGR_ID_prob', 'ROLE_ROLLUP_1_prob','ROLE_ROLLUP_2_prob', 'ROLE_DEPTNAME_prob', 'ROLE_TITLE_prob','ROLE_FAMILY_DESC_prob']
for i in range(0,len(resourcetwo)):
a = data[resourcetwo[i]]/data[resourceone[i]]
data[resourcetwo[i]+"_"+"proboth"]=a
data.head()
| ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 |ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob |MGR_ID_prob | ... | RESOURCE_ROLE_ROLLUP_2_prob_probre |RESOURCE_ROLE_DEPTNAME_prob_probre | RESOURCE_ROLE_TITLE_prob_probre |RESOURCE_ROLE_FAMILY_DESC_prob_probre | RESOURCE_MGR_ID_prob_proboth |RESOURCE_ROLE_ROLLUP_1_prob_proboth | RESOURCE_ROLE_ROLLUP_2_prob_proboth |RESOURCE_ROLE_DEPTNAME_prob_proboth | RESOURCE_ROLE_TITLE_prob_proboth |RESOURCE_ROLE_FAMILY_DESC_prob_proboth---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 |117906 | 0.000092 | 0.001678 | ... | 1.000000 | 0.333333 |0.666667 | 1.000000 | 0.054545 | 0.000140 | 0.000678 | 0.013889 |0.000558 | 0.0004351 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 |118536 | 0.000915 | 0.000305 | ... | 0.366667 | 0.033333 |0.033333 | 0.033333 | 0.100000 | 0.001215 | 0.002788 | 0.006289 |0.012346 | 0.0833332 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 |267952 | 0.000061 | 0.000092 | ... | 1.000000 | 1.000000 |1.000000 | 0.500000 | 0.333333 | 0.010870 | 0.010870 | 0.003663 |0.001592 | 0.0303033 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 |240983 | 0.000031 | 0.001892 | ... | 1.000000 | 1.000000 |1.000000 | 1.000000 | 0.016129 | 0.000047 | 0.000253 | 0.005263 |0.000215 | 0.0008044 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 |123932 | 0.000244 | 0.000275 | ... | 0.250000 | 0.250000 |0.500000 | 0.375000 | 0.222222 | 0.010870 | 0.014493 | 0.044444 |0.053333 | 0.157895
5 rows × 103 columns
# RESOURCE 确定时其他两个变量同时发生的概率
resourcethree = [ 'RESOURCE_MGR_ID_ROLE_ROLLUP_1_prob','RESOURCE_MGR_ID_ROLE_ROLLUP_2_prob', 'RESOURCE_MGR_ID_ROLE_DEPTNAME_prob',
'RESOURCE_MGR_ID_ROLE_TITLE_prob','RESOURCE_MGR_ID_ROLE_FAMILY_DESC_prob', 'RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob',
'RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_prob','RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_prob','RESOURCE_ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob',
'RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob','RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_prob', 'RESOURCE_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob',
'RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_prob', 'RESOURCE_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob','RESOURCE_ROLE_TITLE_ROLE_FAMILY_DESC_prob']
for i in range(0,len(resourcethree)):
a = data[resourcethree[i]]/data.RESOURCE_prob
data[resourcethree[i]+"_"+"probre"]=a
data.head()
| ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 |ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob |MGR_ID_prob | ... | RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob_probre |RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_prob_probre |RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_prob_probre |RESOURCE_ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob_probre |RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob_probre |RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_prob_probre |RESOURCE_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob_probre |RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_prob_probre |RESOURCE_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob_probre |RESOURCE_ROLE_TITLE_ROLE_FAMILY_DESC_prob_probre---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 |117906 | 0.000092 | 0.001678 | ... | 0.357389 | 0.119130 |0.238259 | 0.357389 | 0.119130 | 0.238259 | 0.357389 | 0.119130 |0.119130 | 0.2382591 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 |118536 | 0.000915 | 0.000305 | ... | 0.131043 | 0.011913 |0.011913 | 0.011913 | 0.011913 | 0.011913 | 0.011913 | 0.011913 |0.011913 | 0.0119132 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 |267952 | 0.000061 | 0.000092 | ... | 0.357389 | 0.357389 |0.357389 | 0.178695 | 0.357389 | 0.357389 | 0.178695 | 0.357389 |0.178695 | 0.1786953 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 |240983 | 0.000031 | 0.001892 | ... | 0.357389 | 0.357389 |0.357389 | 0.357389 | 0.357389 | 0.357389 | 0.357389 | 0.357389 |0.357389 | 0.3573894 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 |123932 | 0.000244 | 0.000275 | ... | 0.089347 | 0.089347 |0.134021 | 0.134021 | 0.044674 | 0.089347 | 0.089347 | 0.089347 |0.089347 | 0.134021
5 rows × 118 columns
# 其他两个变量确定时RESOURCE变量同时发生的概率
resourcethree = [ 'RESOURCE_MGR_ID_ROLE_ROLLUP_1_prob','RESOURCE_MGR_ID_ROLE_ROLLUP_2_prob', 'RESOURCE_MGR_ID_ROLE_DEPTNAME_prob',
'RESOURCE_MGR_ID_ROLE_TITLE_prob','RESOURCE_MGR_ID_ROLE_FAMILY_DESC_prob', 'RESOURCE_ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob',
'RESOURCE_ROLE_ROLLUP_1_ROLE_DEPTNAME_prob','RESOURCE_ROLE_ROLLUP_1_ROLE_TITLE_prob','RESOURCE_ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob',
'RESOURCE_ROLE_ROLLUP_2_ROLE_DEPTNAME_prob','RESOURCE_ROLE_ROLLUP_2_ROLE_TITLE_prob', 'RESOURCE_ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob',
'RESOURCE_ROLE_DEPTNAME_ROLE_TITLE_prob', 'RESOURCE_ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob','RESOURCE_ROLE_TITLE_ROLE_FAMILY_DESC_prob']
othertwo = ['MGR_ID_ROLE_ROLLUP_1_prob','MGR_ID_ROLE_ROLLUP_2_prob','MGR_ID_ROLE_DEPTNAME_prob', 'MGR_ID_ROLE_TITLE_prob',
'MGR_ID_ROLE_FAMILY_DESC_prob', 'ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob', 'ROLE_ROLLUP_1_ROLE_DEPTNAME_prob', 'ROLE_ROLLUP_1_ROLE_TITLE_prob',
'ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob', 'ROLE_ROLLUP_2_ROLE_DEPTNAME_prob', 'ROLE_ROLLUP_2_ROLE_TITLE_prob',
'ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob', 'ROLE_DEPTNAME_ROLE_TITLE_prob','ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob',
'ROLE_TITLE_ROLE_FAMILY_DESC_prob']
for i in range(0,len(resourcethree)):
a = data[resourcethree[i]]/data[othertwo[i]]
data[othertwo[i]+"_"+"proboth"]=a
data.head()
| ACTION | RESOURCE | MGR_ID | ROLE_ROLLUP_1 | ROLE_ROLLUP_2 |ROLE_DEPTNAME | ROLE_TITLE | ROLE_FAMILY_DESC | RESOURCE_prob |MGR_ID_prob | ... | ROLE_ROLLUP_1_ROLE_ROLLUP_2_prob_proboth |ROLE_ROLLUP_1_ROLE_DEPTNAME_prob_proboth |ROLE_ROLLUP_1_ROLE_TITLE_prob_proboth |ROLE_ROLLUP_1_ROLE_FAMILY_DESC_prob_proboth |ROLE_ROLLUP_2_ROLE_DEPTNAME_prob_proboth |ROLE_ROLLUP_2_ROLE_TITLE_prob_proboth |ROLE_ROLLUP_2_ROLE_FAMILY_DESC_prob_proboth |ROLE_DEPTNAME_ROLE_TITLE_prob_proboth |ROLE_DEPTNAME_ROLE_FAMILY_DESC_prob_proboth |ROLE_TITLE_ROLE_FAMILY_DESC_prob_proboth---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---0 | 1 | 39353 | 85475 | 117961 | 118300 | 123472 | 117905 |117906 | 0.000092 | 0.001678 | ... | 0.000242 | 0.005415 |0.000245 | 0.000181 | 0.005415 | 0.001574 | 0.000985 | 0.016245 |0.006498 | 0.0002741 | 1 | 17183 | 1540 | 117961 | 118343 | 123125 | 118536 |118536 | 0.000915 | 0.000305 | ... | 0.000997 | 0.002859 |0.004412 | 0.029782 | 0.002906 | 0.018810 | 0.071478 | 0.071478 |0.071478 | 0.0297822 | 1 | 36724 | 14457 | 118219 | 118220 | 117884 | 117879 |267952 | 0.000061 | 0.000092 | ... | 0.003885 | 0.054983 |0.014015 | 0.178695 | 0.054983 | 0.014015 | 0.178695 | 0.003885 |0.178695 | 0.1786953 | 1 | 36135 | 5396 | 117961 | 118343 | 119993 | 118321 |240983 | 0.000031 | 0.001892 | ... | 0.000091 | 0.002019 |0.000087 | 0.000295 | 0.002166 | 0.000486 | 0.001402 | 0.003574 |0.006162 | 0.0006734 | 1 | 42680 | 5905 | 117929 | 117930 | 119569 | 119323 |123932 | 0.000244 | 0.000275 | ... | 0.005180 | 0.032490 |0.067010 | 0.107217 | 0.019855 | 0.089347 | 0.089347 | 0.119130 |0.119130 | 0.063069
5 rows × 133 columns
五、模型建立
# 划分测试集与训练集
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, roc_curve,roc_auc_score,classification_report
y = data.ACTION
X = data
del X["ACTION"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
# 利用以上处理所得的共133个自变量建立随机森林模型
forest = RandomForestClassifier(criterion='entropy',
n_estimators=1000,
random_state=1,
n_jobs=2)
RFfit = forest.fit(X_train , y_train)
六、模型预测与评价
# 利用模型进行预测
preds = RFfit.predict(X_test)
# 得到模型的混淆矩阵如下所示
confusion_matrix(y_test,preds)
array([[ 138, 420],
[ 59, 9214]])
# 得到模型的ROC_AUC得分如下所示
pre = RFfit.predict_proba(X_test)
roc_auc_score(y_test,pre[:,1])
0.8639483844684166
# 得到摸型的ROC曲线如下所示
fpr,tpr,thresholds = roc_curve(y_test,pre[:,1])
fig,ax = plt.subplots(figsize=(8,5))
plt.plot(fpr,tpr)
ax.set_title("Roc of Logistic Randomforest")
<matplotlib.text.Text at 0x26395198>
利用Kaggle测试集得分为0.89,说明模型具有一定的效果。
网友评论