美文网首页
《Machine Learning with R Cookboo

《Machine Learning with R Cookboo

作者: zd200572 | 来源:发表于2021-08-21 21:25 被阅读0次

挖掘RMS Titanic数据集

六个阶段

提出问题、数据采集、数据清洗、基础数据分析、高级分析和模型评估
直接上代码呀!
数据下载,需要科学地上网下载地址

# 数据下载
train.data <- read.csv("train.csv", na.strings = c("NA", ""))
# 类型转换,分类变量转因子
train.data$Survived <- factor(train.data$Survived)
train.data$Pclass <- factor(train.data$Pclass)
# 检测缺失
sum(is.na(train.data$Age) == TRUE)/length(train.data$Age)
sapply(train.data, function(df){
  sum(is.na(df)==TRUE)/length(df)
})
# 缺失数据可视化
# install.packages("Amelia")
require(Amelia)
missmap(train.data, main = "缺失数据图")
AmeliaView()

AmeliaView有交互式的GUI,赞一个!


table(train.data$Embarked,useNA = "always")
# C    Q    S <NA> 
 168   77  644    2
# 将缺失处理为理可能结果
train.data$Embarked[which(is.na(train.data$Embarked))] <- 'S'
table(train.data$Embarked,useNA = "always")
#  C    Q    S <NA> 
 168   77  646    0
# 获得不同称呼类别
train.data$Name <- as.character(train.data$Name)
# 先用空白标记
table_words <- table(unlist(strsplit(train.data$Name, "\\s+")))
str(table_words)
# 'table' int [1:1673(1d)] 1 1 1 1 1 1 1 1 1 1 ...
# - attr(*, "dimnames")=List of 1
# ..$ : chr [1:1673] "\"Andy\"" "\"Annie" "\"Annie\"" "\"Archie\"" ...
sort(table_words [grep('\\.',names(table_words))],
     decreasing = TRUE)
# Mr.     Miss.      Mrs.   Master.       Dr.      Rev. 
      517       182       125        40         7         6 
     Col.    Major.     Mlle.     Capt. Countess.      Don. 
        2         2         2         1         1         1 
Jonkheer.        L.     Lady.      Mme.       Ms.      Sir. 
        1         1         1         1         1         1 

识别和可视化















决策树构建

     Conditional inference tree with 6 terminal nodes

Response:  Survived 
Inputs:  Pclass, Sex, Age, SibSp, Fare, Parch, Embarked 
Number of observations:  623 

1) Sex == {female}; criterion = 1, statistic = 180.207
  2) Pclass == {3}; criterion = 1, statistic = 59.081
    3)*  weights = 92 
  2) Pclass == {1, 2}
    4)*  weights = 111 
1) Sex == {male}
  5) Pclass == {1}; criterion = 1, statistic = 21.732
    6) Age <= 38; criterion = 0.99, statistic = 10.105
      7)*  weights = 57 
    6) Age > 38
      8)*  weights = 39 
  5) Pclass == {2, 3}
    9) Age <= 3; criterion = 1, statistic = 22.354
      10)*  weights = 12 
    9) Age > 3
      11)*  weights = 312

party包的决策树,与rpart包的相比,可以避免rpart包在变量选择时的倾斜,并且更倾向于选择能够产生更多分支或缺失值多的变量。

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 146  57
         1   7  57
                                          
               Accuracy : 0.7603          
                 95% CI : (0.7045, 0.8102)
    No Information Rate : 0.573           
    P-Value [Acc > NIR] : 1.279e-10       
                                          
                  Kappa : 0.4811          
                                          
 Mcnemar's Test P-Value : 9.068e-10       
                                          
            Sensitivity : 0.9542          
            Specificity : 0.5000          
         Pos Pred Value : 0.7192          
         Neg Pred Value : 0.8906          
             Prevalence : 0.5730          
         Detection Rate : 0.5468          
   Detection Prevalence : 0.7603          
      Balanced Accuracy : 0.7271          
                                          
       'Positive' Class : 0 

相关文章

网友评论

      本文标题:《Machine Learning with R Cookboo

      本文链接:https://www.haomeiwen.com/subject/gpxhmltx.html