美文网首页
tidymodels包的上采样功能up sampleing

tidymodels包的上采样功能up sampleing

作者: 灵活胖子的进步之路 | 来源:发表于2022-08-03 10:32 被阅读0次

    官网介绍

    step_upsample creates a specification of a recipe step that will replicate rows of a data set to make the occurrence of levels in a specific factor level equal.

    重要参数

    over_ratio

    A numeric value for the ratio of the majority-to-minority frequencies. The default value (1) means that all other levels are sampled up to have the same frequency as the most occurring level. A value of 0.5 would mean that the minority levels will have (at most) (approximately) half as many rows than the majority level.
    比例设定,如果不设定,所有样本均会和最常见样本的频率相同,赶紧还是设定以下比较好,一般设定范围为0.3-0.7

    skip

    A logical. Should the step be skipped when the recipe is baked by [bake()](https://recipes.tidymodels.org/reference/bake.html)? While all operations are baked when [prep()](https://recipes.tidymodels.org/reference/prep.html) is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.
    这个参数默认是True了,在默认设置的时候bake新的数据集将不会起效。因为上采样的方法作者不建议用在测试集了,所有默认跳过了,只是建模集了

    #原来官网教程网址:https://themis.tidymodels.org/reference/step_upsample.html
    
    
    # 1. 上采样数据分析----------------------------------------------------------------------
    
    #载入包及数据集
    library(tidymodels)
    library(modeldata)
    library(themis)#step_upsample为这个包的函数了
    
    
    #构建原始数据集
    data(hpc_data)
    hpc_data0 <- hpc_data %>%
      select(-protocol, -day)
    
    #查看原始数据结果变量的分布
    orig <- count(hpc_data0, class, name = "orig")
    orig
    
    #利用step_upsample函数进行上采样,
    up_rec <- recipe(class ~ ., data = hpc_data0) %>%
      # Bring the minority levels up to about 1000 each
      # 1000/2211 is approx 0.4523
      step_upsample(class, over_ratio = 0.4523) %>%
      prep()
    
    #对原始的up_rec才会起效,因为step_upsample里面的参数skip默认为True
    training <- up_rec %>%
      bake(new_data = NULL) %>%
      count(class, name = "training")
    training
    
    
    #对新数据up_rec不会起效,因为step_upsample里面的参数skip默认为True
    #它会在baked的过程中跳过了
    baked <- up_rec %>%
      bake(new_data = hpc_data0) %>%
      count(class, name = "baked")
    baked
    
    # 查看原始数据集和BAKE后的结果了
    orig %>%
      left_join(training, by = "class") %>%
      left_join(baked, by = "class")
    
    
    # 2.绘图展示 ------------------------------------------------------------------
    
    library(ggplot2)
    data(circle_example)
    
    #查看结果分布
    circle_example%>%
      count(class)%>%
      mutate(prop = n/sum(n))
    
    #首先绘制未上采样的散点图
    ggplot(circle_example, aes(x, y, color = class)) +
      geom_point() +
      labs(title = "Without upsample")
    
    #晋商上采样获得新数据集
    recipe(class ~ x + y, data = circle_example) %>%
      step_upsample(class) %>%
      prep() %>%
      bake(new_data = NULL) ->updata
    
    #查看新数据集分类分布情况
    updata%>%
      count(class)%>%
      mutate(prop = n/sum(n))
    
    #绘制上采样后的数据分布情况
    updata%>%
      ggplot(aes(x, y, color = class)) +
      geom_jitter(width = 0.1, height = 0.1) +
      labs(title = "With upsample (with jittering)")
    
    未上采样样本分布 重采样样本

    相关文章

      网友评论

          本文标题:tidymodels包的上采样功能up sampleing

          本文链接:https://www.haomeiwen.com/subject/kwpswrtx.html