美文网首页data science
2.优秀员工离职原因分析与预测-R(一)

2.优秀员工离职原因分析与预测-R(一)

作者: 1998962ab991 | 来源:发表于2017-01-30 10:59 被阅读1111次

    一、背景

    1.公司大量优秀且有经验的员工过早的离开

    2.数据来源:kaggle

    3.变量

    satisfaction: Employee satisfaction level
    evaluation: Last evaluation
    project: Number of projects
    hours: Average monthly hours
    years: Time spent at the company
    accident: Whether they have had a work accident
    promotion: Whether they have had a promotion in the last 5 years
    sales: Department
    salary: Salary
    left: Whether the employee has left

    4.分析目的与衡量标准:

    (1)分析并得出优秀员工离职的主要可能的原因
    (2)构建预测模型,预测下一位将会离开的优秀员工是谁

    二、数据分析

    所需包导入

    library(readr)
    library(dplyr)
    library(ggplot2)
    library(gmodels)

    (一)导入数据并查看

    ## 1.1 数据导入

    library(readr)
    hr <- read_csv("HR_comma_sep.csv")
    hr <- tbl_df(hr)
    View(hr)
    str(hr)

    Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 14999 obs. of  10 variables:

    $ satisfaction_level  : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
    $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
    $ number_project      : int  2 5 7 5 2 2 6 5 5 2 ...
    $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
    $ time_spend_company  : int  3 6 4 5 3 3 4 5 5 3 ...
    $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
    $ left                : int  1 1 1 1 1 1 1 1 1 1 ...
    $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
    $ sales                : chr  "sales" "sales" "sales" "sales" ...
    $ salary              : chr  "low" "medium" "medium" "low" ...

    ## 1.2 变量重命名

    hr_good <- filter(hr, evaluation>=0.75 & year>=4 & project>= 4)

    colnames(hr)  <- c("satisfaction","evaluation","project","hours","years","accident","left","promotion","sales","salary")

    ## 1.3 因子化

    hr$sales <- factor(hr$sales)
    hr$salary <- factor(hr$salary, levels=c("low","medium","high"))

    ## 1.4 查看数据

    sum(is.na(hr))

    # [1] 0

    summary(hr)

    satisfaction_level last_evaluation  number_project  average_montly_hours
    Min.  :0.0900    Min.  :0.3600  Min.  :2.000  Min.  : 96.0
    1st Qu.:0.4400    1st Qu.:0.5600  1st Qu.:3.000  1st Qu.:156.0
    Median :0.6400    Median :0.7200  Median :4.000  Median :200.0
    Mean  :0.6128    Mean  :0.7161  Mean  :3.803  Mean  :201.1
    3rd Qu.:0.8200    3rd Qu.:0.8700  3rd Qu.:5.000  3rd Qu.:245.0
    Max.  :1.0000    Max.  :1.0000  Max.  :7.000  Max.  :310.0

    time_spend_company Work_accident        left        promotion_last_5years
    Min.  : 2.000    Min.  :0.0000  Min.  :0.0000  Min.  :0.00000
    1st Qu.: 3.000    1st Qu.:0.0000  1st Qu.:0.0000  1st Qu.:0.00000
    Median : 3.000    Median :0.0000  Median :0.0000  Median :0.00000
    Mean  : 3.498    Mean  :0.1446  Mean  :0.2381  Mean  :0.02127
    3rd Qu.: 4.000    3rd Qu.:0.0000  3rd Qu.:0.0000  3rd Qu.:0.00000
    ax.  :10.000    Max.  :1.0000  Max.  :1.0000  Max.  :1.00000

    sales        salary
    sales      :4140  low  :7316
    technical  :2720  medium:6446
    support    :2229  high  :1237
    IT        :1227
    product_mng: 902
    marketing  : 858
    (Other)    :2923

    (二)根据定义选取优秀员工的子集,并做初步分析

    优秀员工的定义:
    (1)评价(evaluation)>=0.75
    (2)项目数量(project)>=4
    (3)有经验的(year)>=4

    ## 2.1 根据定义选取子集

    hr_good <- filter(hr, evaluation>=0.75 & years>=4 & project>= 4)

    ## 2.2 对比总体与选取子集中离职员工的占比情况


    【结论】:优秀员工离职情况非常严重
    1.在总离职员工中,优秀员工的数量占了
    (1778/3571=) 50%;
    2.在优秀员工子集中,离职的数量高达
    (1778/2753=) 64%;


    CrossTable(hr$left)

    Total Observations in Table:  14999

    |      stay |      left |
    |-----------|-----------|
    |    11428 |      3571 |
    |    0.762 |    0.238 |
    |-----------|-----------|

    CrossTable(hr_good$left)

    Total Observations in Table:  2763
    |      stay |      left |
    |-----------|-----------|
    |      985 |      1778 |
    |    0.356 |    0.644 |
    |-----------|-----------|

    ## 2.3 了解优秀员工子集的统计量

    summary(hr_good)

    ## 2.4 了解优秀员工子集中各变量之间的相关性


    【结论】:离职与满意度呈负相关,且相关度最高


    hr_good_corr <- select(hr, -sales,-salary) %>% cor()
    corrplot(hr_good_corr, method="circle", tl.col="black",title="离职与满意度呈负相关,且相关度最高",mar=c(1,1,3,1))

    (三)逐个变量分析员工离职、满意度与其他变量之间的关系

    ## 3.1 查看满意度的分布图

    hr_good$left <- factor(hr_good$left, levels=c(0,1), labels=c("stay", "left"))

    ggplot(hr_good, aes(satisfaction, fill=left)) + geom_histogram(position="dodge") + scale_x_continuous(breaks=c(0.1,0.13,0.25,0.50,0.73,0.75,0.92,1.00)) + theme3 + theme(axis.text.x=element_text(angle=90)) + labs(title="满意度在[0.1,0.13]与[0.73,0.92]两个区间离职人数非常多")

    ## 3.2 收入、工作时间、满意度之间的关系


    【结论】:工作超长时间的员工满意度低,且离职率高
    1.低&中等薪资水平中较高工作时间的员工大量离职
    2.超长工作时间的员工满意度都很低且几乎都已离职


    ggplot(hr_good, aes(salary, hours, alpha=satisfaction, color=left)) + geom_jitter() + theme3 + labs(title=paste("低&中等薪资水平中较高工作时间的员工大量离职","\n","超长工作时间的员工满意度都很低且几乎都已离职"))  

    ## 3.3 晋升、满意度与离职的关系


    【结论】:高评价的员工几乎没有人晋升,离职人员也主要集中在未晋升中


    ggplot(hr_good, aes(promotion, evaluation, color=left)) + geom_jitter() + theme3 + scale_x_discrete(limits=c(0,1)) + labs(title=paste("高评价的员工几乎没有人晋升","\n","离职人员也主要集中在未晋升中"))

    ## 3.4 工作年限、满意度与离职的关系


    【结论】:
    1.低满意度(0.1)水平下,4年司龄的员工大量离职
    2.大量高满意度员工在第5年与第6年离职
    3.7年以上的员工没有人离职


    ggplot(hr_good, aes(years,satisfaction, color=left)) + geom_jitter() + scale_x_discrete(limits=c(4,5,6,7,8,9,10)) + theme3 + labs(title=paste("低满意度(","0.1)","水平下,4年司龄的员工大量离职","\n","大量高满意度员工在第5年与第6年离职"))

    ## 3.5 部门、项目数与离职的关系


    【结论】:对于6个以上的项目无论在哪个部门离职率都非常高


    ggplot(hr_good, aes(sales,fill=left)) + geom_bar(position="fill") + facet_wrap(~factor(project),ncol=1) + theme3 + theme(axis.text.x=element_text(angle=270)) + labs(y="number projcet",title="对于6个以上的项目无论在哪个部门离职率都非常高")

    ## 3.6 部门与离职的关系


    【结论】:各部门离职人员均高于在职人员,管理部门除外


    ggplot(hr_good, aes(sales, fill=left)) + geom_bar(position="dodge") + coord_flip() + scale_x_discrete(limits=c("management","RandD","hr","accounting","marketing","product_mng","IT","support","technical","sales")) + labs(title="各部门离职人员均高于在职人员,管理部门除外") + theme3

    三、构建预测模型1:分类

    ## 数据分割

    library(caret)
    set.seed(0001)
    train <- createDataPartition(hr_good$left, p=0.75, list=FALSE)
    hr_good_train <- hr_good[train, ]
    hr_good_test <- hr_good[-train, ]

    (一)Logistic回归

    ## 1. 构建逻辑回归并验证

    ctrl <- trainControl(method="cv",number=5)
    logit <- train(left~., hr_good_train, method="LogitBoost", trControl=ctrl)
    logit.pred <- predict(logit, hr_good_test, tyep="response")
    confusionMatrix( hr_good_test$left, logit.pred)

    Confusion Matrix and Statistics

                     Reference
    Prediction stay left
            stay  213  33
            left    8  436

    Accuracy : 0.9406
    95% CI : (0.9203, 0.957)
    No Information Rate : 0.6797
    P-Value [Acc > NIR] : < 2.2e-16
    Kappa : 0.8675
    Mcnemar's Test P-Value : 0.0001781
    Sensitivity : 0.9638
    Specificity : 0.9296
    Pos Pred Value : 0.8659
    Neg Pred Value : 0.9820
    Prevalence : 0.3203
    Detection Rate : 0.3087
    Detection Prevalence : 0.3565
    Balanced Accuracy : 0.9467
    'Positive' Class : stay

    ## 2. 评价模型,绘制ROC/AUC曲线

    library(pROC)
    roc(as.numeric(hr_good_test$left), as.numeric(logit.pred), plot=TRUE, print.thres=TRUE, print.auc=TRUE, col="black")

    (二)决策树

    ## 1. 构建决策树

    library(rpart)
    dtree <- rpart(left~., hr_good_train, method="class",parms=list(split="information"))
    dtree$cptable

    CP nsplit  rel error    xerror      xstd

    1 0.55209743      0 1.00000000 1.00000000 0.02950911
    2 0.12855210      1 0.44790257 0.44790257 0.02256805
    3 0.08254398      3 0.19079838 0.19079838 0.01551204
    4 0.02300406      4 0.10825440 0.10825440 0.01186737
    5 0.01000000      5 0.08525034 0.08525034 0.01057607

    plotcp(dtree)

    dtree.pruned <- prune(dtree, cp=0.01)
    library(partykit)
    library(grid)
    plot(as.party(dtree.pruned),main="Decision Tree")

    dtree.pruned.pred <- predict(dtree.pruned, hr_good_test, type="class")
    confusionMatrix(hr_good_test$left, dtree.pruned.pred)

    Confusion Matrix and Statistics

                     Reference
    Prediction stay left
            stay  236  10
          left  13  431

    Accuracy : 0.9667
    95% CI : (0.9504, 0.9788)
    No Information Rate : 0.6391
    P-Value [Acc > NIR] : <2e-16
    Kappa : 0.9275
    Mcnemar's Test P-Value : 0.6767
    Sensitivity : 0.9478
    Specificity : 0.9773
    Pos Pred Value : 0.9593
    Neg Pred Value : 0.9707
    Prevalence : 0.3609
    Detection Rate : 0.3420
    Detection Prevalence : 0.3565
    Balanced Accuracy : 0.9626
    'Positive' Class : stay

    ## 2. 评价模型,绘制ROC/AUC曲线

    roc(as.numeric(hr_good_test$left),as.numeric(dtree.pruned.pred), plot=TRUE, print.thres=TRUE, print.auc=TRUE,col="blue")

    (三)随机森林

    ## 1. 构建随机森林

    library(randomForest)
    set.seed(0002)
    forest <- randomForest(left~., hr_good_train, importance=TRUE, na.action=na.roughfix)
    forest

    Call:

    randomForest(formula = left ~ ., data = hr_good_train, importance = TRUE,      na.action = na.roughfix)

    Type of random forest: classification
    Number of trees: 500
    No. of variables tried at each split: 3
    OOB estimate of  error rate: 1.35%
              Confusion matrix:
             stay left class.error
    stay  731    8  0.01082544
    left  20 1314  0.01499250

    importance(forest, type=2)

         MeanDecreaseGini

    satisfaction      315.455428
    evaluation          63.101527
    project            51.659717
    hours              307.201129
    years              168.530576
    accident            6.232136
    promotion            1.897123
    sales              20.666829
    salary              12.338717

    forest.pred <- predict(forest, hr_good_test)
    confusionMatrix(hr_good_test$left, forest.pred)

    Confusion Matrix and Statistics

                        Reference
    Prediction stay left
           stay  241    5
           left    4  440

    Accuracy : 0.987
    95% CI : (0.9754, 0.994)
    No Information Rate : 0.6449
    P-Value [Acc > NIR] : <2e-16
    Kappa : 0.9715
    Mcnemar's Test P-Value : 1
    Sensitivity : 0.9837
    Specificity : 0.9888
    Pos Pred Value : 0.9797
    Neg Pred Value : 0.9910
    Prevalence : 0.3551
    Detection Rate : 0.3493
    Detection Prevalence : 0.3565
    Balanced Accuracy : 0.9862
    'Positive' Class : stay

    ## 2. 评价模型,绘制ROC/AUC曲线

    roc(as.numeric(hr_good_test$left), as.numeric(forest.pred), plot=TRUE, print.thres=TRUE, print.auc=T, col="green")

    (四)支持向量机SVM

    ## 1. 构建SVM

    library(e1071)
    set.seed(0003)
    svm <- svm(left~., hr_good_train)
    svm.pred <- predict(svm, na.omit(hr_good_test))
    confusionMatrix(na.omit(hr_good_test)$left, svm.pred)

    Confusion Matrix and Statistics

                     Reference

    Prediction stay left
           stay  211  35
            left  11  433

    Accuracy : 0.9333
    95% CI : (0.9121, 0.9508)
    No Information Rate : 0.6783
    P-Value [Acc > NIR] : < 2.2e-16
    Kappa : 0.8515
    Mcnemar's Test P-Value : 0.000696
    Sensitivity : 0.9505
    Specificity : 0.9252
    Pos Pred Value : 0.8577
    Neg Pred Value : 0.9752
    Prevalence : 0.3217
    Detection Rate : 0.3058
    Detection Prevalence : 0.3565
    Balanced Accuracy : 0.9378
    'Positive' Class : stay

    ## 2. 评价模型,绘制ROC/AUC曲线

    roc(as.numeric(na.omit(hr_good_test)$left), as.numeric(svm.pred), plot=T, print.thres=T, print.auc=T, col="orange")

    (五)对比模型,选择准确性最高的模型


    【结论】:随机森林的拟合度最高,选择该模型为预测模型


    roc(as.numeric(hr_good_test$left), as.numeric(logit.pred), plot=TRUE,col="black",main=paste("ROC曲线:","Logitis(black)","dtree(blue)","randomForest(green)","SVM(orange)",sep=" "))
    roc(as.numeric(hr_good_test$left),as.numeric(dtree.pruned.pred), plot=TRUE, col="blue", add=T)
    roc(as.numeric(hr_good_test$left), as.numeric(forest.pred), plot=TRUE, col="green", add=T)
    roc(as.numeric(na.omit(hr_good_test)$left), as.numeric(svm.pred), plot=T, col="orange", add=T)

    (六)模型应用

    importance(forest,type=2)

                       MeanDecreaseGini

    satisfaction      315.455428
    evaluation          63.101527
    project            51.659717
    hours              307.201129
    years              168.530576
    accident            6.232136
    promotion            1.897123
    sales              20.666829
    salary              12.338717

    ## 1. 剔除明显不重要的因子,重新构建模型

    forest2 <- randomForest(left~.-promotion-accident-salary-sales, hr_good, na.action=na.roughfix, importance=TRUE)
    importance(forest2, type=2)

                   MeanDecreaseGini

    satisfaction        462.97925
    evaluation          72.49430
    project              71.09463
    hours              417.63782
    years              231.64080

    forest2.pred <- predict(forest2, hr_good_test)
    confusionMatrix(hr_good_test$left, forest2.pred, positive="left")

    Confusion Matrix and Statistics

                     Reference

    Prediction stay left
            stay  244    2
            left    1  443

    Accuracy : 0.9957
    95% CI : (0.9873, 0.9991)
    No Information Rate : 0.6449
    P-Value [Acc > NIR] : <2e-16
    Kappa : 0.9905
    Mcnemar's Test P-Value : 1
    Sensitivity : 0.9959
    Specificity : 0.9955
    Pos Pred Value : 0.9919
    Neg Pred Value : 0.9977
    Prevalence : 0.3551
    Detection Rate : 0.3536
    Detection Prevalence : 0.3565
    Balanced Accuracy : 0.9957
    'Positive' Class : stay

    ## 2. 评价模型,绘制ROC/AUC曲线

    roc(as.numeric(hr_good_test$left), as.numeric(forest2.pred), plot=TRUE, print.thres=T, print.auc=T, main="Random Forest", col="green")

    (七)结论

    1.调整后的随机森林预测模型员工离职的准确性达99.5%;其中离职的员工被正确预测的概率为99.5%,被预测离职的员工中,实际离职的概率为99.8%;

    2.剔除不重要的变量(promotion,accident,sales,salary)并不会对模型造成影响;

    3.满意度(satisfaction)、月平均工作时间(hours)、工作年限(years)是影响优秀员工离职的主要三个变量



    相关文章

      网友评论

      本文标题:2.优秀员工离职原因分析与预测-R(一)

      本文链接:https://www.haomeiwen.com/subject/nwqcittx.html