官网介绍
step_upsample creates a specification of a recipe step that will replicate rows of a data set to make the occurrence of levels in a specific factor level equal.
重要参数
over_ratio
A numeric value for the ratio of the majority-to-minority frequencies. The default value (1) means that all other levels are sampled up to have the same frequency as the most occurring level. A value of 0.5 would mean that the minority levels will have (at most) (approximately) half as many rows than the majority level.
比例设定,如果不设定,所有样本均会和最常见样本的频率相同,赶紧还是设定以下比较好,一般设定范围为0.3-0.7
skip
A logical. Should the step be skipped when the recipe is baked by [bake()](https://recipes.tidymodels.org/reference/bake.html)
? While all operations are baked when [prep()](https://recipes.tidymodels.org/reference/prep.html)
is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE
as it may affect the computations for subsequent operations.
这个参数默认是True了,在默认设置的时候bake新的数据集将不会起效。因为上采样的方法作者不建议用在测试集了,所有默认跳过了,只是建模集了
#原来官网教程网址:https://themis.tidymodels.org/reference/step_upsample.html
# 1. 上采样数据分析----------------------------------------------------------------------
#载入包及数据集
library(tidymodels)
library(modeldata)
library(themis)#step_upsample为这个包的函数了
#构建原始数据集
data(hpc_data)
hpc_data0 <- hpc_data %>%
select(-protocol, -day)
#查看原始数据结果变量的分布
orig <- count(hpc_data0, class, name = "orig")
orig
#利用step_upsample函数进行上采样,
up_rec <- recipe(class ~ ., data = hpc_data0) %>%
# Bring the minority levels up to about 1000 each
# 1000/2211 is approx 0.4523
step_upsample(class, over_ratio = 0.4523) %>%
prep()
#对原始的up_rec才会起效,因为step_upsample里面的参数skip默认为True
training <- up_rec %>%
bake(new_data = NULL) %>%
count(class, name = "training")
training
#对新数据up_rec不会起效,因为step_upsample里面的参数skip默认为True
#它会在baked的过程中跳过了
baked <- up_rec %>%
bake(new_data = hpc_data0) %>%
count(class, name = "baked")
baked
# 查看原始数据集和BAKE后的结果了
orig %>%
left_join(training, by = "class") %>%
left_join(baked, by = "class")
# 2.绘图展示 ------------------------------------------------------------------
library(ggplot2)
data(circle_example)
#查看结果分布
circle_example%>%
count(class)%>%
mutate(prop = n/sum(n))
#首先绘制未上采样的散点图
ggplot(circle_example, aes(x, y, color = class)) +
geom_point() +
labs(title = "Without upsample")
#晋商上采样获得新数据集
recipe(class ~ x + y, data = circle_example) %>%
step_upsample(class) %>%
prep() %>%
bake(new_data = NULL) ->updata
#查看新数据集分类分布情况
updata%>%
count(class)%>%
mutate(prop = n/sum(n))
#绘制上采样后的数据分布情况
updata%>%
ggplot(aes(x, y, color = class)) +
geom_jitter(width = 0.1, height = 0.1) +
labs(title = "With upsample (with jittering)")
未上采样样本分布
重采样样本
网友评论