美文网首页职业力提升
【Kaggle】泰坦尼克号生存训练

【Kaggle】泰坦尼克号生存训练

作者: 奔跑的蜈蚣 | 来源:发表于2017-11-08 16:51 被阅读69次

    就是这篇文章,知乎称“您的帐号由于存在异常行为暂时被知乎反作弊系统限制使用”,然后任凭我申诉多久,都恢复不了了!!!最可恶的是,在你发布文章的时候一点儿提示都没有,显示发布成功,但就是擅自删除,辛苦写的文章就找不到了!!!

    1.引言

    本文数据来自Kaggle中知名的数据集Titanic Machine Learning from Disaster,是利用训练集训练模型,来预测测试集中的乘客能否在沉船事件存活。

    说明:本文借鉴了 Megan L. Risdal 的文章,在细节处略有改动。

    2.数据导入与观察

    加载需要用到的包:

    > pkgs <- c("dplyr","ggplot2","ggthemes","scales","mice","randomForest")
    > install.packages(pkgs,dependencies = TRUE)
    > library(dplyr) # 操作数据的包
    > library(ggplot2) # 绘图包
    > library(ggthemes)  # ggplot2的主题修改包
    > library(scales)  # 可视化的包
    > library(VIM)  # 查看缺失数据的包
    > library(mice) # 插补数据的包
    > library(randomForest)  # 随机森林
    

    导入并初步观察数据:

    > train <- read.csv(file.choose(),stringsAsFactors = FALSE)
    > test <- read.csv(file.choose(),stringsAsFactors = FALSE)
    > str(train)
    略
    > str(test)
    略
    

    两个数据集除Survived字段不同外,其他字段均相同。为了方便数据清洗,我们合并训练集与测试集。

    > full  <- bind_rows(train, test) 
    > str(full)
    'data.frame':   1309 obs. of  12 variables:
     $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
     $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
     $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
     $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
     $ Sex        : chr  "male" "female" "female" "female" ...
     $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
     $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
     $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
     $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
     $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
     $ Cabin      : chr  "" "C85" "" "C123" ...
     $ Embarked   : chr  "S" "C" "S" "S" ...
    > summary(full)
      PassengerId      Survived          Pclass          Name               Sex                 Age       
     Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309        Length:1309        Min.   : 0.17  
     1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character   Class :character   1st Qu.:21.00  
     Median : 655   Median :0.0000   Median :3.000   Mode  :character   Mode  :character   Median :28.00  
     Mean   : 655   Mean   :0.3838   Mean   :2.295                                         Mean   :29.88  
     3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                                         3rd Qu.:39.00  
     Max.   :1309   Max.   :1.0000   Max.   :3.000                                         Max.   :80.00  
                    NA's   :418                                                            NA's   :263    
         SibSp            Parch          Ticket               Fare            Cabin             Embarked        
     Min.   :0.0000   Min.   :0.000   Length:1309        Min.   :  0.000   Length:1309        Length:1309       
     1st Qu.:0.0000   1st Qu.:0.000   Class :character   1st Qu.:  7.896   Class :character   Class :character  
     Median :0.0000   Median :0.000   Mode  :character   Median : 14.454   Mode  :character   Mode  :character  
     Mean   :0.4989   Mean   :0.385                      Mean   : 33.295                                        
     3rd Qu.:1.0000   3rd Qu.:0.000                      3rd Qu.: 31.275                                        
     Max.   :8.0000   Max.   :9.000                      Max.   :512.329                                        
                                                         NA's   :1     
    

    合并后的数据中,共有1309个观测,其中训练集891个,测试集418个,生存情况(Survived)中缺失值418个(正常,需要预测的部分),年龄(Age)中缺失值263个,船票费用(Fare)中缺失值1个。

    解释一下各个变量对应的含义:

    3.数据清洗

    3.1 观察姓名变量

    我们注意到乘客名字(Name)有一个非常显著的特点:每个名字当中都包含了具体的称谓或者头衔。我们可以将这部分信息提取出来,或许可以作为非常有用的新变量。

    > # 乘客姓名提取头衔
    > full$Title <- gsub("(.*, )|(\\..*)", "", full$Name)
    > # 按照性别划分头衔数量
    > table(full$Sex, full$Title)
    
    Capt Col Don Dona  Dr Jonkheer Lady Major Master Miss Mlle Mme  Mr Mrs  Ms Rev Sir the Countess
      female    0   0   0    1   1        0    1     0      0  260    2   1   0 197   2   0   0            1
      male      1   4   1    0   7        1    0     2     61    0    0   0 757   0   0   8   1            0
    > # 对于那些出现频率较低的头衔合并为一类  
    > rare_title <- c("Capt","Col","Don","Dona","Dr", "Jonkheer", "Lady","Major", "Rev", "Sir","the Countess")
    > # 数据量不多,我想试试手工调整头衔
    > filter(full,full$Title %in% rare_title)%>%
    +   select(Name,Sex,Age,SibSp,Parch,Title)%>%
    +   arrange(Title)
                                                                    Name    Sex Age SibSp Parch        Title
    1                                       Crosby, Capt. Edward Gifford   male  70     1     1         Capt
    2                                Simonius-Blumer, Col. Oberst Alfons   male  56     0     0          Col
    3                                                    Weir, Col. John   male  60     0     0          Col
    4                                          Gracie, Col. Archibald IV   male  53     0     0          Col
    5                                             Astor, Col. John Jacob   male  47     1     0          Col
    6                                           Uruchurtu, Don. Manuel E   male  40     0     0          Don
    7                                       Oliva y Ocana, Dona. Fermina female  39     0     0         Dona
    8                                        Minahan, Dr. William Edward   male  44     2     0           Dr
    9                                               Moraweck, Dr. Ernest   male  54     0     0           Dr
    10                                                  Pain, Dr. Alfred   male  23     0     0           Dr
    11                                         Stahelin-Maeglin, Dr. Max   male  32     0     0           Dr
    12                                     Frauenthal, Dr. Henry William   male  50     2     0           Dr
    13                                         Brewe, Dr. Arthur Jackson   male  NA     0     0           Dr
    14                                       Leader, Dr. Alice (Farnham) female  49     0     0           Dr
    15                                             Dodge, Dr. Washington   male  53     1     1           Dr
    16                                   Reuchlin, Jonkheer. John George   male  38     0     0     Jonkheer
    17 Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan") female  48     1     0         Lady
    18                                    Peuchen, Major. Arthur Godfrey   male  52     0     0        Major
    19                                 Butt, Major. Archibald Willingham   male  45     0     0        Major
    20                                 Byles, Rev. Thomas Roussel Davids   male  42     0     0          Rev
    21                                        Bateman, Rev. Robert James   male  51     0     0          Rev
    22                                     Carter, Rev. Ernest Courtenay   male  54     1     0          Rev
    23                                    Kirkland, Rev. Charles Leonard   male  57     0     0          Rev
    24                                                 Harper, Rev. John   male  28     0     1          Rev
    25                                             Montvila, Rev. Juozas   male  27     0     0          Rev
    26                                            Lahtinen, Rev. William   male  30     1     1          Rev
    27                                     Peruschitz, Rev. Joseph Maria   male  41     0     0          Rev
    28                      Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan")   male  49     1     0          Sir
    29          Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) female  33     0     0 the Countess
    > # 我将Lady调整为Mrs,Sir调整为Mr
    > rare_title2 <- c("Capt","Col","Don","Dona","Dr", "Jonkheer","Major", "Rev","the Countess")
    > # 对于一些称呼进行重新指定(按含义) 如mlle, ms指小姐, mme 指女士
    > full$Title[full$Title %in% c("Mlle","Ms")]<- "Miss" 
    > full$Title[full$Title %in% c("Mme","Lady")]<- "Mrs"
    > full$Title[full$Title =="Sir"]<- "Mr"
    > full$Title[full$Title %in% rare_title2]  <- "Rare Title"
    > # 查看重新调整后的情况
    > table(full$Sex, full$Title)
            
             Master Miss  Mr Mrs Rare Title
      female      0  264   0 199          3
      male       61    0 758   0         24
    > # 从乘客姓名中,提取姓氏,作为家庭变量
    > full$Surname <- sapply(full$Name, function(x) strsplit(x, split = '[,.]')[[1]][1])
    > length(unique(full$Surname))
    [1] 875
    

    Megan L. Risdal 的文章中,将乘客姓氏也提取出来,提示可以发掘乘客姓氏之间的联系,但没有进行进一步操,我们这里也就不探讨了。

    3.2 观察家庭情况

    处理完乘客姓名这一变量,我们再考虑衍生一些家庭相关的变量。基于已有变量SubSp和Parch生成家庭人数family size 这一变量。

    > # 生成家庭人数变量,包括自己在内
    > full$Fsize <- full$SibSp + full$Parch + 1
    > # 生成一个家庭变量:格式为姓_家庭人数
    > full$Family <- paste(full$Surname, full$Fsize, sep="_")
    > # 使用 ggplot2 绘制家庭人数与生存情况之间的关系
    > ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +
    +   geom_bar(stat="count", position="dodge") +
    +   scale_x_continuous(breaks=c(1:11)) +
    +   labs(x = "Family Size") +
    +   theme_few()
    

    通过图形我们可以明显发现以下特点:

    • 个人和家庭人数>4的家庭,存活人数小于死亡人数
    • 家庭人数在[2:4]的家庭,存活人数大于死亡人数

    因此,我们可以将家庭人数变量进行分段合并,明显的可以分为3段:个人,小家庭,大家庭,由此生成新变量。

    > # 离散化
    > full$FsizeD[full$Fsize == 1] <- "single"
    > full$FsizeD[full$Fsize < 5 & full$Fsize > 1]<- "small"
    > full$FsizeD[full$Fsize > 4] <- "large"
    > # 通过马赛克图查看家庭规模与生存情况之间的关系
    > mosaicplot(table(full$FsizeD,full$Survived), main="Family Size by Survival", shade=TRUE)
    

    显而易见,个人与大家庭不利于在沉船事故中生存,而小家庭当中生存率相对较高。

    3.3 尝试生成更多变量

    在乘客客舱变量Cabin中,也存在一些有价值的信息,如客舱层数deck。

    > # 可以看出这一变量有很多缺失值,有单个客户多个客舱,格式基本为字母+数字
    > head(full$Cabin,30)
     [1] ""            "C85"         ""            "C123"        ""            ""            "E46"        
     [8] ""            ""            ""            "G6"          "C103"        ""            ""           
    [15] ""            ""            ""            ""            ""            ""            ""           
    [22] "D56"         ""            "A6"          ""            ""            ""            "C23 C25 C27"
    [29] ""            ""           
    > # 假设第一个字母即为客舱层数,建立一个层数变量(Deck),取值范围A - z:
    > full$Deck<-factor(sapply(full$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
    > summary(full$Deck)
       A    B    C    D    E    F    G    T NA's 
      22   65   94   46   41   21    5    1 1014 
    

    上面只是对变量进行了基本处理,还有很多可以进一步操作的地方,如有些乘客名下包含很多间房 (如28行, "C23 C25 C27"), 但是这一变量有1014 个缺失值,占比太高了。 后面就不再进一步考虑。

    4.处理缺失值

    我们先找到所有的缺失数据看看情况:

    • 数值型数据:age缺失263个,fare缺失1个
    • 字符型数据:cabin缺失1014个,embarked缺失2个
    • 因子型数据:deck缺失1014个
    > sapply(full,function(x) {sum(is.na(x))})
    PassengerId    Survived      Pclass        Name         Sex         Age       SibSp       Parch      Ticket 
              0         418           0           0           0         263           0           0           0 
           Fare       Cabin    Embarked       Title     Surname       Fsize      Family      FsizeD        Deck 
              1           0           0           0           0           0           0           0        1014 
    > sapply(full,function(x) {sum(x == "",na.rm=TRUE)})
    PassengerId    Survived      Pclass        Name         Sex         Age       SibSp       Parch      Ticket 
              0           0           0           0           0           0           0           0           0 
           Fare       Cabin    Embarked       Title     Surname       Fsize      Family      FsizeD        Deck 
              0        1014           2           0           0           0           0           0           0 
    

    处理缺失值的方法有很多种,考虑到数据集本身较小,样本数也不多,因而不能直接整行或者整列删除缺失值样本。我们考虑对于缺失值较少的,用均值或中位数填补,缺失值较多的通过现有数据和变量进行预估填补。

    4.1 登船港口缺失

    找到缺失数据在哪一行,观察情况。

    > filter(full,full$Embarked=="")
      PassengerId Survived Pclass                                      Name    Sex Age SibSp Parch Ticket Fare
    1          62        1      1                       Icard, Miss. Amelie female  38     0     0 113572   80
    2         830        1      1 Stone, Mrs. George Nelson (Martha Evelyn) female  62     0     0 113572   80
      Cabin Embarked Title Surname Fsize  Family FsizeD Deck
    1   B28           Miss   Icard     1 Icard_1 single    B
    2   B28            Mrs   Stone     1 Stone_1 single    B
    

    我们可以看到他们支付的票价都是$ 80,舱位等级都是1,我们可以大胆推论有相同舱位等级(passenger class)和票价(Fare)的乘客,也许有着相同的登船港口?

    > # 去除缺失值乘客的ID
    > embark_fare <- full %>% filter(PassengerId != 62 & PassengerId != 830)
    > # 用 ggplot2 绘制embarkment, passenger class, & median fare 三者关系图
    > ggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
    +   geom_boxplot() +
    +   geom_hline(aes(yintercept=80), 
    +              colour="red", linetype="dashed", lwd=2) +
    +   scale_y_continuous(labels=dollar_format()) +
    +   theme_few()
    

    很明显,港口C出发的头等舱支付的票价中位数为80,因此我们可以把两个缺失值替换为C。

    > full$Embarked[c(62, 830)] <- "C"
    

    4.1 票价缺失

    找到缺失数据在哪一行,观察情况。

    > filter(full,is.na(full$Fare))
      PassengerId Survived Pclass               Name  Sex  Age SibSp Parch Ticket Fare Cabin Embarked Title Surname
    1        1044       NA      3 Storey, Mr. Thomas male 60.5     0     0   3701   NA              S    Mr  Storey
      Fsize   Family FsizeD Deck
    1     1 Storey_1 single <NA>
    

    这是从港口S出发的三等舱乘客,根据上面登船港口缺失值填补的推论,我们可以作图看看:

    > ggplot(full[full$Pclass == "3" & full$Embarked == "S", ], 
    +        aes(x = Fare)) +
    +   geom_density(fill = "#99d6ff", alpha=0.4) +  
    +   geom_vline(aes(xintercept=median(Fare, na.rm=T)),colour="red", linetype="dashed", lwd=1) +
    +   scale_x_continuous(labels=dollar_format()) + 
    +   theme_few()
    

    从得到的图形上看,将缺失值用中位数进行替换是合理的。替换数值为$8.05。

    > full$Fare[1044] <- median(full[full$Pclass==3&full$Embarked=="S",]$Fare,na.rm=TRUE)
    > full$Fare[1044]
    [1] 8.05
    

    4.3 年龄缺失

    用户年龄(Age) 中有大量缺失存在,简单用中位数或均值肯定不妥,这里我们用mice包的随机插补,将基于其他变量构建一个预测模型对年龄缺失值进行填补。

    > # 使变量因子化
    > factor_vars <- c("PassengerId","Pclass","Sex","Embarked",
    +                  "Title","Surname","Family","FsizeD")
    > full[factor_vars] <- lapply(full[factor_vars],function(x) as.factor(x))
    > # 设置随机种子
    > set.seed(123)
    > # 执行多重插补法,剔除一些没什么用的变量
    > mice_mod <- mice(full[, !names(full) %in% 
    +                         c("PassengerId","Name","Ticket","Cabin",
    +                           "Family","Surname","Survived")], 
    +                  method="rf") 
    
     iter imp variable
      1   1  Age  Deck
      1   2  Age  Deck
      1   3  Age  Deck
      1   4  Age  Deck
      1   5  Age  Deck
      2   1  Age  Deck
      2   2  Age  Deck
      2   3  Age  Deck
      2   4  Age  Deck
      2   5  Age  Deck
      3   1  Age  Deck
      3   2  Age  Deck
      3   3  Age  Deck
      3   4  Age  Deck
      3   5  Age  Deck
      4   1  Age  Deck
      4   2  Age  Deck
      4   3  Age  Deck
      4   4  Age  Deck
      4   5  Age  Deck
      5   1  Age  Deck
      5   2  Age  Deck
      5   3  Age  Deck
      5   4  Age  Deck
      5   5  Age  Deck
    > # 保存完整输出 
    > mice_output <- complete(mice_mod)
    > # 绘制年龄分布图
    > par(mfrow=c(1,2))
    > hist(full$Age, freq=F, main="Age: Original Data", 
    +      col="darkgreen", ylim=c(0,0.04))
    > hist(mice_output$Age, freq=F, main="Age: MICE Output", 
    +      col="lightgreen", ylim=c(0,0.04))
    

    从上面的图来看,数据填补前后,并没有发生明显的偏移,随机插补应该是有效的,那么下面可以用mice模型的结果对原年龄数据进行替换。

    > full$Age <- mice_output$Age
    > # 检查缺失值是否被完全替换了
    > sum(is.na(full$Age))
    [1] 0
    

    5.特征工程

    现在我们知道每一位乘客的年龄,那么基于“妇女与儿童优先”的原则,我们可以生成一些变量,如儿童(Child)和 母亲(Mother)。
    划分标准:

    • 儿童 : 年龄Age < 18
    • 母亲 : 女性+年龄 > 18+拥有超过1个子女+头衔不是'Miss'
    > # 首先我们来看年龄与生存情况之间的关系
    > ggplot(full[1:891,], aes(Age, fill = factor(Survived))) + 
    +   geom_histogram() + 
    +   # 分性别来看,因为我们知道性别对于生存情况有重要影响
    +   facet_grid(.~Sex) + 
    +   theme_few()
    

    生成child变量, 并且基于此划分儿童child与成人adult。

    > full$child[full$Age < 18] <- "Child"
    > full$child[full$Age >= 18] <- "Adult"
    > # 展示对应人数
    > table(full$child, full$Survived)
           
              0   1
      Adult 479 270
      Child  70  72
    

    下面来生成母亲这个变量。

    > #增加mother变量
    > full$Mother <- "NotMother"
    > full$Mother[full$Sex == "female" & full$Parch > 0 & full$Age > 18 & full$Title != "Miss"] <- "Mother"
    > table(full$Mother, full$Survived)
               
                  0   1
      Mother     15  39
      NotMother 534 303
    

    将child和mother变量转化成因子类型

    > full$child <- factor(full$child)
    > full$Mother <- factor(full$Mother)
    

    至此,所有我们需要的变量都已经生成,并且其中没有缺失值。 为了保险起见,我们进行二次确认。

    > sapply(full,function(x){sum(is.na(x))})
    PassengerId    Survived      Pclass        Name         Sex         Age       SibSp       Parch      Ticket 
              0         418           0           0           0           0           0           0           0 
           Fare       Cabin    Embarked       Title     Surname       Fsize      Family      FsizeD        Deck 
              0           0           0           0           0           0           0           0        1014 
          child      Mother 
              0           0 
    > sapply(full,function(x){sum(x=="",na.rm=T)})
    PassengerId    Survived      Pclass        Name         Sex         Age       SibSp       Parch      Ticket 
              0           0           0           0           0           0           0           0           0 
           Fare       Cabin    Embarked       Title     Surname       Fsize      Family      FsizeD        Deck 
              0        1014           0           0           0           0           0           0           0 
          child      Mother 
              0           0
    

    6.模型设定与预测

    在完成上面的工作之后,我们进入到最后一步:预测泰坦尼克号上乘客的生存状况。 在这里我们使用随机森林分类算法(The RandomForest Classification Algorithm) 。
    第一步,拆分训练集与测试集

    > train <- full[1:891,]
    > test <- full[892:1309,]
    

    第二步, 建立模型

    > # 设置随机种子
    > set.seed(1234)
    > # 建立模型 (注意: 不是所有可用变量全部加入)
    > rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp +Parch + 
    +                            Fare + Embarked + Title + FsizeD + child + Mother, 
    +                          data = train)
    > # 显示模型误差
    > plot(rf_model, ylim=c(0,0.36))
    > legend("topright", colnames(rf_model$err.rate), col=1:3, fill=1:3)
    

    黑色那条线表示:整体误差率(the overall error rate)保持在20% 左右
    红色和绿色线分别表示:遇难与生还的误差率,红线保持在10%,绿线保持在30%左右。
    我们的模型,在预测死亡情况时更准确一些。

    第三步, 查看变量重要性

    > importance <- importance(rf_model)
    > varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,"MeanDecreaseGini"],2))
    > rankImportance <- varImportance %>% 
    +   mutate(Rank = paste0("#",dense_rank(desc(Importance))))
    > # 作图
    > ggplot(rankImportance, aes(x = reorder(Variables, Importance), y = Importance, fill = Importance)) +             
    +   geom_bar(stat="identity") +  
    +   geom_text(aes(x = Variables, y = 0.5, label = Rank), hjust=0, vjust=0.55, size = 4, colour = "red") + labs(x = "Variables") +  
    +   coord_flip()
    

    我们从图上可以看出,头衔和票价对于生存情况影响最大,其次是性别和年龄,而乘客舱位排第五。 而母亲和孩子对于生存与否的影响最小,真是有点出乎意料。

    第四步, 预测

    > # 将模型带入测试集
    > prediction <- predict(rf_model, test)
    > # 保存结果
    > solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)
    > # 输出结果到CSV文件格式
    > write.csv(solution, file = "rf_mod_Solution.csv", row.names = F)
    

    7.总结

    本次数据分析是基于Megan L. Risdal 的文章,算是一次深度模仿,而且对于其中的一些操作还有不太理解的部分,还应该多看多学,争取早日独立完成一项数据分析工作。

    相关文章

      网友评论

      本文标题:【Kaggle】泰坦尼克号生存训练

      本文链接:https://www.haomeiwen.com/subject/lahymxtx.html