tidymodels包的上采样功能up sampleing

作者: 灵活胖子的进步之路 | 来源:发表于2022-08-03 10:32 被阅读0次

tidymodels包的上采样功能up sampleing
采样
R机器学习的Tidymodel流水线编程
R | tidymodels 之 infer
下采样，上采样
上采样
tidymodels包的利用KNN进行插补
tidymodels包学习实录-2（ preprocess yo
upsampling（上采样）& downsampled（降采样
2020-12-09

官网介绍

step_upsample creates a specification of a recipe step that will replicate rows of a data set to make the occurrence of levels in a specific factor level equal.

重要参数

over_ratio

A numeric value for the ratio of the majority-to-minority frequencies. The default value (1) means that all other levels are sampled up to have the same frequency as the most occurring level. A value of 0.5 would mean that the minority levels will have (at most) (approximately) half as many rows than the majority level.
比例设定，如果不设定，所有样本均会和最常见样本的频率相同，赶紧还是设定以下比较好，一般设定范围为0.3-0.7

skip

A logical. Should the step be skipped when the recipe is baked by [bake()](https://recipes.tidymodels.org/reference/bake.html)? While all operations are baked when [prep()](https://recipes.tidymodels.org/reference/prep.html) is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.
这个参数默认是True了，在默认设置的时候bake新的数据集将不会起效。因为上采样的方法作者不建议用在测试集了，所有默认跳过了，只是建模集了

#原来官网教程网址：https://themis.tidymodels.org/reference/step_upsample.html


# 1. 上采样数据分析----------------------------------------------------------------------

#载入包及数据集
library(tidymodels)
library(modeldata)
library(themis)#step_upsample为这个包的函数了


#构建原始数据集
data(hpc_data)
hpc_data0 <- hpc_data %>%
  select(-protocol, -day)

#查看原始数据结果变量的分布
orig <- count(hpc_data0, class, name = "orig")
orig

#利用step_upsample函数进行上采样，
up_rec <- recipe(class ~ ., data = hpc_data0) %>%
  # Bring the minority levels up to about 1000 each
  # 1000/2211 is approx 0.4523
  step_upsample(class, over_ratio = 0.4523) %>%
  prep()

#对原始的up_rec才会起效，因为step_upsample里面的参数skip默认为True
training <- up_rec %>%
  bake(new_data = NULL) %>%
  count(class, name = "training")
training


#对新数据up_rec不会起效，因为step_upsample里面的参数skip默认为True
#它会在baked的过程中跳过了
baked <- up_rec %>%
  bake(new_data = hpc_data0) %>%
  count(class, name = "baked")
baked

# 查看原始数据集和BAKE后的结果了
orig %>%
  left_join(training, by = "class") %>%
  left_join(baked, by = "class")


# 2.绘图展示 ------------------------------------------------------------------

library(ggplot2)
data(circle_example)

#查看结果分布
circle_example%>%
  count(class)%>%
  mutate(prop = n/sum(n))

#首先绘制未上采样的散点图
ggplot(circle_example, aes(x, y, color = class)) +
  geom_point() +
  labs(title = "Without upsample")

#晋商上采样获得新数据集
recipe(class ~ x + y, data = circle_example) %>%
  step_upsample(class) %>%
  prep() %>%
  bake(new_data = NULL) ->updata

#查看新数据集分类分布情况
updata%>%
  count(class)%>%
  mutate(prop = n/sum(n))

#绘制上采样后的数据分布情况
updata%>%
  ggplot(aes(x, y, color = class)) +
  geom_jitter(width = 0.1, height = 0.1) +
  labs(title = "With upsample (with jittering)")

未上采样样本分布

重采样样本

网友评论

本文标题：tidymodels包的上采样功能up sampleing

本文链接：https://www.haomeiwen.com/subject/kwpswrtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

tidymodels包的上采样功能up sampleing

官网介绍

重要参数

over_ratio

skip

相关文章

tidymodels包的上采样功能up sampleing

采样

R机器学习的Tidymodel流水线编程

R | tidymodels 之 infer

下采样，上采样

上采样

tidymodels包的利用KNN进行插补

tidymodels包学习实录-2（ preprocess yo

upsampling（上采样）& downsampled（降采样

2020-12-09

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读