R-拼图系列-基础函数 - 简书 (jianshu.com)
R-拼图系列-ggplot2之patchwork - 简书 (jianshu.com)
如下四个数据集(房屋售价、员工离职、手写数字识别、顾客购买物品)是之后演示机器学习算法常用示例数据,在此记录一下~
1、房价数据
- 爱荷华州艾姆斯镇的房屋属性以及售价数据;
- 适用于机器学习中:有监督学习的回归算法,以预测房屋价格
- 数据规模:2930个房屋样本数据;包含80个房屋特征属性(feature)与售价列信息(response variable)
ames <- AmesHousing::make_ames()
dim(ames)
## [1] 2930 81
set.seed(123)
library(rsample)
split <- initial_split(ames, prop = 0.7,
strata = "Sale_Price")
ames_train <- training(split)
# [1] 2049 81
ames_test <- testing(split)
# [1] 881 81
# 第一个房屋的所有信息,其中售价Sale_Price在倒数第三列;其它特征信息有分类信息,也有数值类信息
t(ames[1,])
# [,1]
# MS_SubClass "One_Story_1946_and_Newer_All_Styles"
# MS_Zoning "Residential_Low_Density"
# Lot_Frontage "141"
# Lot_Area "31770"
# Street "Pave"
# Alley "No_Alley_Access"
# Lot_Shape "Slightly_Irregular"
# Land_Contour "Lvl"
# Utilities "AllPub"
# Lot_Config "Corner"
# Land_Slope "Gtl"
# Neighborhood "North_Ames"
# Condition_1 "Norm"
# Condition_2 "Norm"
# Bldg_Type "OneFam"
# House_Style "One_Story"
# Overall_Qual "Above_Average"
# Overall_Cond "Average"
# Year_Built "1960"
# Year_Remod_Add "1960"
# Roof_Style "Hip"
# Roof_Matl "CompShg"
# Exterior_1st "BrkFace"
# Exterior_2nd "Plywood"
# Mas_Vnr_Type "Stone"
# Mas_Vnr_Area "112"
# Exter_Qual "Typical"
# Exter_Cond "Typical"
# Foundation "CBlock"
# Bsmt_Qual "Typical"
# Bsmt_Cond "Good"
# Bsmt_Exposure "Gd"
# BsmtFin_Type_1 "BLQ"
# BsmtFin_SF_1 "2"
# BsmtFin_Type_2 "Unf"
# BsmtFin_SF_2 "0"
# Bsmt_Unf_SF "441"
# Total_Bsmt_SF "1080"
# Heating "GasA"
# Heating_QC "Fair"
# Central_Air "Y"
# Electrical "SBrkr"
# First_Flr_SF "1656"
# Second_Flr_SF "0"
# Low_Qual_Fin_SF "0"
# Gr_Liv_Area "1656"
# Bsmt_Full_Bath "1"
# Bsmt_Half_Bath "0"
# Full_Bath "1"
# Half_Bath "0"
# Bedroom_AbvGr "3"
# Kitchen_AbvGr "1"
# Kitchen_Qual "Typical"
# TotRms_AbvGrd "7"
# Functional "Typ"
# Fireplaces "2"
# Fireplace_Qu "Good"
# Garage_Type "Attchd"
# Garage_Finish "Fin"
# Garage_Cars "2"
# Garage_Area "528"
# Garage_Qual "Typical"
# Garage_Cond "Typical"
# Paved_Drive "Partial_Pavement"
# Wood_Deck_SF "210"
# Open_Porch_SF "62"
# Enclosed_Porch "0"
# Three_season_porch "0"
# Screen_Porch "0"
# Pool_Area "0"
# Pool_QC "No_Pool"
# Fence "No_Fence"
# Misc_Feature "None"
# Misc_Val "0"
# Mo_Sold "5"
# Year_Sold "2010"
# Sale_Type "WD "
# Sale_Condition "Normal"
# Sale_Price "215000"
# Longitude "-93.61975"
# Latitude "42.05403"
2、员工离职数据
- 记录了员工的基本信息、工作相关信息以及是否离职
- 适合机器学习中:有监督学习的二分类算法,以预测员工是否会离职
- 数据规模:1470个员工的30项个人、工作等信息(feature),以及最终是否离职信息(response variable)
library(modeldata)
data(attrition)
# initial dimension
dim(attrition)
## [1] 1470 31
#第一个员工的所有信息,其中是否离职信息在第二列(attrition)
t(attrition[1,])
# 1
# Age "41"
# Attrition "Yes"
# BusinessTravel "Travel_Rarely"
# DailyRate "1102"
# Department "Sales"
# DistanceFromHome "1"
# Education "College"
# EducationField "Life_Sciences"
# EnvironmentSatisfaction "Medium"
# Gender "Female"
# HourlyRate "94"
# JobInvolvement "High"
# JobLevel "2"
# JobRole "Sales_Executive"
# JobSatisfaction "Very_High"
# MaritalStatus "Single"
# MonthlyIncome "5993"
# MonthlyRate "19479"
# NumCompaniesWorked "8"
# OverTime "Yes"
# PercentSalaryHike "11"
# PerformanceRating "Excellent"
# RelationshipSatisfaction "Low"
# StockOptionLevel "0"
# TotalWorkingYears "8"
# TrainingTimesLastYear "0"
# WorkLifeBalance "Bad"
# YearsAtCompany "6"
# YearsInCurrentRole "4"
# YearsSinceLastPromotion "0"
# YearsWithCurrManager "5"
3、手写数字数据
- AT&T贝尔实验室提供的手写数字图片像素值数据(0~9)
- 适合机器学习中:有监督学习的多分类算法,以预测图片的数字
- 数据规模:该数据集已经自动分为60000个训练集、10000个测试集;其中每个样本图片的特征数为784(28*28)
mnist <- dslabs::read_mnist()
names(mnist)
## [1] "train" "test"
dim(mnist$train$images)
## [1] 60000 784
#训练集里,前6个样本的前10个特征我的值
mnist$train$images[1:6,1:10]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 0 0 0 0 0 0 0 0 0 0
# [2,] 0 0 0 0 0 0 0 0 0 0
# [3,] 0 0 0 0 0 0 0 0 0 0
# [4,] 0 0 0 0 0 0 0 0 0 0
# [5,] 0 0 0 0 0 0 0 0 0 0
# [6,] 0 0 0 0 0 0 0 0 0 0
# 训练集里,前6个样本的标签(手写数字的真实值)
head(mnist$train$labels, 6)
## [1] 5 0 4 1 9 2
4、超市顾客购买物品内容
- 数据集记录了每个超市消费者一个购物篮里的物品内容
- 适合机器学习中:无监督学习的聚类算法,以将顾客分为若干有意义的类
- 数据规模:2000位顾客对于42件商品的购买情况(没买就是标记为0)
url <- "https://koalaverse.github.io/homlr/data/my_basket.csv"
my_basket <- readr::read_csv(url)
dim(my_basket)
## [1] 2000 42
#第一位顾客的购物情况
t(my_basket[1,])
# [,1]
# 7up 0
# lasagna 0
# pepsi 0
# yop 0
# red.wine 0
# cheese 0
# bbq 0
# bulmers 0
# mayonnaise 0
# horlics 0
# chicken.tikka 0
# milk 0
# mars 2
# coke 0
# lottery 0
# bread 1
# pizza 0
# sunny.delight 0
# ham 1
# lettuce 0
# kronenbourg 0
# leeks 0
# fanta 0
# tea 0
# whiskey 0
# peas 0
# newspaper 2
# muesli 0
# white.wine 0
# carrots 0
# spinach 0
# pate 0
# instant.coffee 0
# twix 0
# potatoes 0
# fosters 0
# soup 0
# toad.in.hole 0
# coco.pops 0
# kitkat 1
# broccoli 0
# cigarettes 0
网友评论