美文网首页
机器学习示例数据集

机器学习示例数据集

作者: 小贝学生信 | 来源:发表于2021-10-30 01:30 被阅读0次

    R-拼图系列-基础函数 - 简书 (jianshu.com)
    R-拼图系列-ggplot2之patchwork - 简书 (jianshu.com)

    如下四个数据集(房屋售价、员工离职、手写数字识别、顾客购买物品)是之后演示机器学习算法常用示例数据,在此记录一下~

    1、房价数据

    • 爱荷华州艾姆斯镇的房屋属性以及售价数据;
    • 适用于机器学习中:有监督学习的回归算法,以预测房屋价格
    • 数据规模:2930个房屋样本数据;包含80个房屋特征属性(feature)与售价列信息(response variable)
    ames <- AmesHousing::make_ames()
    dim(ames)
    ## [1] 2930   81
    
    set.seed(123)
    library(rsample)
    split <- initial_split(ames, prop = 0.7, 
                           strata = "Sale_Price")
    ames_train  <- training(split)
    # [1] 2049   81
    ames_test   <- testing(split)
    # [1] 881  81
    
    # 第一个房屋的所有信息,其中售价Sale_Price在倒数第三列;其它特征信息有分类信息,也有数值类信息
    t(ames[1,])
    #                    [,1]                                 
    # MS_SubClass        "One_Story_1946_and_Newer_All_Styles"
    # MS_Zoning          "Residential_Low_Density"            
    # Lot_Frontage       "141"                                
    # Lot_Area           "31770"                              
    # Street             "Pave"                               
    # Alley              "No_Alley_Access"                    
    # Lot_Shape          "Slightly_Irregular"                 
    # Land_Contour       "Lvl"                                
    # Utilities          "AllPub"                             
    # Lot_Config         "Corner"                             
    # Land_Slope         "Gtl"                                
    # Neighborhood       "North_Ames"                         
    # Condition_1        "Norm"                               
    # Condition_2        "Norm"                               
    # Bldg_Type          "OneFam"                             
    # House_Style        "One_Story"                          
    # Overall_Qual       "Above_Average"                      
    # Overall_Cond       "Average"                            
    # Year_Built         "1960"                               
    # Year_Remod_Add     "1960"                               
    # Roof_Style         "Hip"                                
    # Roof_Matl          "CompShg"                            
    # Exterior_1st       "BrkFace"                            
    # Exterior_2nd       "Plywood"                            
    # Mas_Vnr_Type       "Stone" 
    # Mas_Vnr_Area       "112"                                
    # Exter_Qual         "Typical"                            
    # Exter_Cond         "Typical"                            
    # Foundation         "CBlock"                             
    # Bsmt_Qual          "Typical"                            
    # Bsmt_Cond          "Good"                               
    # Bsmt_Exposure      "Gd"                                 
    # BsmtFin_Type_1     "BLQ"                                
    # BsmtFin_SF_1       "2"                                  
    # BsmtFin_Type_2     "Unf"                                
    # BsmtFin_SF_2       "0"                                  
    # Bsmt_Unf_SF        "441"                                
    # Total_Bsmt_SF      "1080"                               
    # Heating            "GasA"                               
    # Heating_QC         "Fair"                               
    # Central_Air        "Y"                                  
    # Electrical         "SBrkr"                              
    # First_Flr_SF       "1656"                               
    # Second_Flr_SF      "0"                                  
    # Low_Qual_Fin_SF    "0"                                  
    # Gr_Liv_Area        "1656"                               
    # Bsmt_Full_Bath     "1"                                  
    # Bsmt_Half_Bath     "0"                                  
    # Full_Bath          "1"  
    # Half_Bath          "0"                                  
    # Bedroom_AbvGr      "3"                                  
    # Kitchen_AbvGr      "1"                                  
    # Kitchen_Qual       "Typical"                            
    # TotRms_AbvGrd      "7"                                  
    # Functional         "Typ"                                
    # Fireplaces         "2"                                  
    # Fireplace_Qu       "Good"                               
    # Garage_Type        "Attchd"                             
    # Garage_Finish      "Fin"                                
    # Garage_Cars        "2"                                  
    # Garage_Area        "528"                                
    # Garage_Qual        "Typical"                            
    # Garage_Cond        "Typical"                            
    # Paved_Drive        "Partial_Pavement"                   
    # Wood_Deck_SF       "210"                                
    # Open_Porch_SF      "62"                                 
    # Enclosed_Porch     "0"                                  
    # Three_season_porch "0"                                  
    # Screen_Porch       "0"                                  
    # Pool_Area          "0"                                  
    # Pool_QC            "No_Pool"                            
    # Fence              "No_Fence"                           
    # Misc_Feature       "None"                               
    # Misc_Val           "0"                                  
    # Mo_Sold            "5"     
    # Year_Sold          "2010"                               
    # Sale_Type          "WD "                                
    # Sale_Condition     "Normal"                             
    # Sale_Price         "215000"                             
    # Longitude          "-93.61975"                          
    # Latitude           "42.05403"
    

    2、员工离职数据

    • 记录了员工的基本信息、工作相关信息以及是否离职
    • 适合机器学习中:有监督学习的二分类算法,以预测员工是否会离职
    • 数据规模:1470个员工的30项个人、工作等信息(feature),以及最终是否离职信息(response variable)
    library(modeldata)
    data(attrition)
    # initial dimension
    dim(attrition)
    ## [1] 1470   31
    
    #第一个员工的所有信息,其中是否离职信息在第二列(attrition)
    t(attrition[1,])
    #                           1                
    # Age                      "41"             
    # Attrition                "Yes"            
    # BusinessTravel           "Travel_Rarely"  
    # DailyRate                "1102"           
    # Department               "Sales"          
    # DistanceFromHome         "1"              
    # Education                "College"        
    # EducationField           "Life_Sciences"  
    # EnvironmentSatisfaction  "Medium"         
    # Gender                   "Female"         
    # HourlyRate               "94"             
    # JobInvolvement           "High"           
    # JobLevel                 "2"              
    # JobRole                  "Sales_Executive"
    # JobSatisfaction          "Very_High"      
    # MaritalStatus            "Single"         
    # MonthlyIncome            "5993"           
    # MonthlyRate              "19479"          
    # NumCompaniesWorked       "8"              
    # OverTime                 "Yes"            
    # PercentSalaryHike        "11"             
    # PerformanceRating        "Excellent"      
    # RelationshipSatisfaction "Low"            
    # StockOptionLevel         "0"              
    # TotalWorkingYears        "8"              
    # TrainingTimesLastYear    "0"              
    # WorkLifeBalance          "Bad"            
    # YearsAtCompany           "6"              
    # YearsInCurrentRole       "4"              
    # YearsSinceLastPromotion  "0"              
    # YearsWithCurrManager     "5"
    

    3、手写数字数据

    • AT&T贝尔实验室提供的手写数字图片像素值数据(0~9)
    • 适合机器学习中:有监督学习的多分类算法,以预测图片的数字
    • 数据规模:该数据集已经自动分为60000个训练集、10000个测试集;其中每个样本图片的特征数为784(28*28)
    mnist <- dslabs::read_mnist()
    names(mnist)
    ## [1] "train" "test"
    
    dim(mnist$train$images)
    ## [1] 60000   784
    
    #训练集里,前6个样本的前10个特征我的值
    mnist$train$images[1:6,1:10]
    #       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
    # [1,]    0    0    0    0    0    0    0    0    0     0
    # [2,]    0    0    0    0    0    0    0    0    0     0
    # [3,]    0    0    0    0    0    0    0    0    0     0
    # [4,]    0    0    0    0    0    0    0    0    0     0
    # [5,]    0    0    0    0    0    0    0    0    0     0
    # [6,]    0    0    0    0    0    0    0    0    0     0
    # 训练集里,前6个样本的标签(手写数字的真实值)
    head(mnist$train$labels, 6)
    ## [1] 5 0 4 1 9 2
    

    4、超市顾客购买物品内容

    • 数据集记录了每个超市消费者一个购物篮里的物品内容
    • 适合机器学习中:无监督学习的聚类算法,以将顾客分为若干有意义的类
    • 数据规模:2000位顾客对于42件商品的购买情况(没买就是标记为0)
    url <- "https://koalaverse.github.io/homlr/data/my_basket.csv"
    my_basket <- readr::read_csv(url)
    
    dim(my_basket)
    ## [1] 2000   42
    
    #第一位顾客的购物情况
    t(my_basket[1,])
    #                 [,1]
    # 7up               0
    # lasagna           0
    # pepsi             0
    # yop               0
    # red.wine          0
    # cheese            0
    # bbq               0
    # bulmers           0
    # mayonnaise        0
    # horlics           0
    # chicken.tikka     0
    # milk              0
    # mars              2
    # coke              0
    # lottery           0
    # bread             1
    # pizza             0
    # sunny.delight     0
    # ham               1
    # lettuce           0
    # kronenbourg       0
    # leeks             0
    # fanta             0
    # tea               0
    # whiskey           0
    # peas              0
    # newspaper         2
    # muesli            0
    # white.wine        0
    # carrots           0
    # spinach           0
    # pate              0
    # instant.coffee    0
    # twix              0
    # potatoes          0
    # fosters           0
    # soup              0
    # toad.in.hole      0
    # coco.pops         0
    # kitkat            1
    # broccoli          0
    # cigarettes        0
    

    相关文章

      网友评论

          本文标题:机器学习示例数据集

          本文链接:https://www.haomeiwen.com/subject/nxkznltx.html