机器学习示例数据集

作者: 小贝学生信 | 来源:发表于2021-10-30 01:30 被阅读0次

R-拼图系列-基础函数 - 简书 (jianshu.com)
R-拼图系列-ggplot2之patchwork - 简书 (jianshu.com)

如下四个数据集(房屋售价、员工离职、手写数字识别、顾客购买物品)是之后演示机器学习算法常用示例数据，在此记录一下~

1、房价数据

爱荷华州艾姆斯镇的房屋属性以及售价数据；
适用于机器学习中：有监督学习的回归算法，以预测房屋价格
数据规模：2930个房屋样本数据；包含80个房屋特征属性(feature)与售价列信息（response variable）

ames <- AmesHousing::make_ames()
dim(ames)
## [1] 2930   81

set.seed(123)
library(rsample)
split <- initial_split(ames, prop = 0.7, 
                       strata = "Sale_Price")
ames_train  <- training(split)
# [1] 2049   81
ames_test   <- testing(split)
# [1] 881  81

# 第一个房屋的所有信息，其中售价Sale_Price在倒数第三列；其它特征信息有分类信息，也有数值类信息
t(ames[1,])
#                    [,1]                                 
# MS_SubClass        "One_Story_1946_and_Newer_All_Styles"
# MS_Zoning          "Residential_Low_Density"            
# Lot_Frontage       "141"                                
# Lot_Area           "31770"                              
# Street             "Pave"                               
# Alley              "No_Alley_Access"                    
# Lot_Shape          "Slightly_Irregular"                 
# Land_Contour       "Lvl"                                
# Utilities          "AllPub"                             
# Lot_Config         "Corner"                             
# Land_Slope         "Gtl"                                
# Neighborhood       "North_Ames"                         
# Condition_1        "Norm"                               
# Condition_2        "Norm"                               
# Bldg_Type          "OneFam"                             
# House_Style        "One_Story"                          
# Overall_Qual       "Above_Average"                      
# Overall_Cond       "Average"                            
# Year_Built         "1960"                               
# Year_Remod_Add     "1960"                               
# Roof_Style         "Hip"                                
# Roof_Matl          "CompShg"                            
# Exterior_1st       "BrkFace"                            
# Exterior_2nd       "Plywood"                            
# Mas_Vnr_Type       "Stone" 
# Mas_Vnr_Area       "112"                                
# Exter_Qual         "Typical"                            
# Exter_Cond         "Typical"                            
# Foundation         "CBlock"                             
# Bsmt_Qual          "Typical"                            
# Bsmt_Cond          "Good"                               
# Bsmt_Exposure      "Gd"                                 
# BsmtFin_Type_1     "BLQ"                                
# BsmtFin_SF_1       "2"                                  
# BsmtFin_Type_2     "Unf"                                
# BsmtFin_SF_2       "0"                                  
# Bsmt_Unf_SF        "441"                                
# Total_Bsmt_SF      "1080"                               
# Heating            "GasA"                               
# Heating_QC         "Fair"                               
# Central_Air        "Y"                                  
# Electrical         "SBrkr"                              
# First_Flr_SF       "1656"                               
# Second_Flr_SF      "0"                                  
# Low_Qual_Fin_SF    "0"                                  
# Gr_Liv_Area        "1656"                               
# Bsmt_Full_Bath     "1"                                  
# Bsmt_Half_Bath     "0"                                  
# Full_Bath          "1"  
# Half_Bath          "0"                                  
# Bedroom_AbvGr      "3"                                  
# Kitchen_AbvGr      "1"                                  
# Kitchen_Qual       "Typical"                            
# TotRms_AbvGrd      "7"                                  
# Functional         "Typ"                                
# Fireplaces         "2"                                  
# Fireplace_Qu       "Good"                               
# Garage_Type        "Attchd"                             
# Garage_Finish      "Fin"                                
# Garage_Cars        "2"                                  
# Garage_Area        "528"                                
# Garage_Qual        "Typical"                            
# Garage_Cond        "Typical"                            
# Paved_Drive        "Partial_Pavement"                   
# Wood_Deck_SF       "210"                                
# Open_Porch_SF      "62"                                 
# Enclosed_Porch     "0"                                  
# Three_season_porch "0"                                  
# Screen_Porch       "0"                                  
# Pool_Area          "0"                                  
# Pool_QC            "No_Pool"                            
# Fence              "No_Fence"                           
# Misc_Feature       "None"                               
# Misc_Val           "0"                                  
# Mo_Sold            "5"     
# Year_Sold          "2010"                               
# Sale_Type          "WD "                                
# Sale_Condition     "Normal"                             
# Sale_Price         "215000"                             
# Longitude          "-93.61975"                          
# Latitude           "42.05403"

2、员工离职数据

记录了员工的基本信息、工作相关信息以及是否离职
适合机器学习中：有监督学习的二分类算法，以预测员工是否会离职
数据规模：1470个员工的30项个人、工作等信息(feature)，以及最终是否离职信息(response variable)

library(modeldata)
data(attrition)
# initial dimension
dim(attrition)
## [1] 1470   31

#第一个员工的所有信息，其中是否离职信息在第二列(attrition)
t(attrition[1,])
#                           1                
# Age                      "41"             
# Attrition                "Yes"            
# BusinessTravel           "Travel_Rarely"  
# DailyRate                "1102"           
# Department               "Sales"          
# DistanceFromHome         "1"              
# Education                "College"        
# EducationField           "Life_Sciences"  
# EnvironmentSatisfaction  "Medium"         
# Gender                   "Female"         
# HourlyRate               "94"             
# JobInvolvement           "High"           
# JobLevel                 "2"              
# JobRole                  "Sales_Executive"
# JobSatisfaction          "Very_High"      
# MaritalStatus            "Single"         
# MonthlyIncome            "5993"           
# MonthlyRate              "19479"          
# NumCompaniesWorked       "8"              
# OverTime                 "Yes"            
# PercentSalaryHike        "11"             
# PerformanceRating        "Excellent"      
# RelationshipSatisfaction "Low"            
# StockOptionLevel         "0"              
# TotalWorkingYears        "8"              
# TrainingTimesLastYear    "0"              
# WorkLifeBalance          "Bad"            
# YearsAtCompany           "6"              
# YearsInCurrentRole       "4"              
# YearsSinceLastPromotion  "0"              
# YearsWithCurrManager     "5"

3、手写数字数据

AT&T贝尔实验室提供的手写数字图片像素值数据（0~9）
适合机器学习中：有监督学习的多分类算法，以预测图片的数字
数据规模：该数据集已经自动分为60000个训练集、10000个测试集；其中每个样本图片的特征数为784（28*28）

mnist <- dslabs::read_mnist()
names(mnist)
## [1] "train" "test"

dim(mnist$train$images)
## [1] 60000   784

#训练集里，前6个样本的前10个特征我的值
mnist$train$images[1:6,1:10]
#       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    0    0    0    0    0    0    0    0    0     0
# [2,]    0    0    0    0    0    0    0    0    0     0
# [3,]    0    0    0    0    0    0    0    0    0     0
# [4,]    0    0    0    0    0    0    0    0    0     0
# [5,]    0    0    0    0    0    0    0    0    0     0
# [6,]    0    0    0    0    0    0    0    0    0     0
# 训练集里，前6个样本的标签(手写数字的真实值)
head(mnist$train$labels, 6)
## [1] 5 0 4 1 9 2

4、超市顾客购买物品内容

数据集记录了每个超市消费者一个购物篮里的物品内容
适合机器学习中：无监督学习的聚类算法，以将顾客分为若干有意义的类
数据规模：2000位顾客对于42件商品的购买情况（没买就是标记为0）

url <- "https://koalaverse.github.io/homlr/data/my_basket.csv"
my_basket <- readr::read_csv(url)

dim(my_basket)
## [1] 2000   42

#第一位顾客的购物情况
t(my_basket[1,])
#                 [,1]
# 7up               0
# lasagna           0
# pepsi             0
# yop               0
# red.wine          0
# cheese            0
# bbq               0
# bulmers           0
# mayonnaise        0
# horlics           0
# chicken.tikka     0
# milk              0
# mars              2
# coke              0
# lottery           0
# bread             1
# pizza             0
# sunny.delight     0
# ham               1
# lettuce           0
# kronenbourg       0
# leeks             0
# fanta             0
# tea               0
# whiskey           0
# peas              0
# newspaper         2
# muesli            0
# white.wine        0
# carrots           0
# spinach           0
# pate              0
# instant.coffee    0
# twix              0
# potatoes          0
# fosters           0
# soup              0
# toad.in.hole      0
# coco.pops         0
# kitkat            1
# broccoli          0
# cigarettes        0

机器学习示例数据集

1、房价数据

2、员工离职数据

3、手写数字数据

4、超市顾客购买物品内容

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读