Tidyverse Tips||reorder重排列

作者: 周运来就是我 | 来源:发表于2020-11-10 00:58 被阅读0次

Tidyverse Tips||reorder重排列
2020-11-02Tidyverse Tips
LeetCode 分类刷题 —— Backtracking
Leetcode PHP题解--D54 937. Reorder
R语言技巧每日分享day5-对因子水平快速排序reorder
Lintcode507 Wiggle Sort II solut
leetcode 92
LeetCode 937.Reorder Log Files 重
ARTS Week 1
R语言学习指南(3) tidyverse的基础使用

 tidyverse_logo()
* __  _    __   .    o           *  . 
 / /_(_)__/ /_ ___  _____ _______ ___ 
/ __/ / _  / // / |/ / -_) __(_-</ -_)
\__/_/\_,_/\_, /|___/\__/_/ /___/\__/ 
     *  . /___/      o      .       *

我发现以下命令在任何数据科学项目的EDA部分中都非常有用。我们将使用tidyverse包，实际上只需要dplyr和ggplot2和用iris数据集，来演示它们。

select_if | rename_if

select_if函数属于dply，在需要根据某些条件选择某些列时非常有用。我们还可以添加一个应用于列名的函数。

library(tidyverse)

iris%>%select_if(is.numeric,  list(~ paste0("numeric_", .)))%>%head()

  numeric_Sepal.Length numeric_Sepal.Width numeric_Petal.Length numeric_Petal.Width
1                  5.1                 3.5                  1.4                 0.2
2                  4.9                 3.0                  1.4                 0.2
3                  4.7                 3.2                  1.3                 0.2
4                  4.6                 3.1                  1.5                 0.2
5                  5.0                 3.6                  1.4                 0.2
6                  5.4                 3.9                  1.7                 0.4

注意，我们也可以以同样的方式使用rename_if。需要注意的是rename_if()、rename_at()和rename_all()已经被rename_with()取代了。匹配的select语句已被select() + rename_with()的组合取代。

这些函数被替换，因为mutate_if()和friends被across()取代。select_if()和rename_if()已经使用了整齐选择，所以它们不能被across()替换，相反，我们需要一个新函数。

everything

在许多数据科学项目中，我们希望一个特定的列(通常是因变量y)出现在数据集中的第一个或最后一个。我们可以使用dplyr包中的everything()来实现这一点。

iris%>%select(Species, everything()) %>%head()
  Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1  setosa          5.1         3.5          1.4         0.2
2  setosa          4.9         3.0          1.4         0.2
3  setosa          4.7         3.2          1.3         0.2
4  setosa          4.6         3.1          1.5         0.2
5  setosa          5.0         3.6          1.4         0.2
6  setosa          5.4         3.9          1.7         0.4

mydataset <- iris%>%select(Species, everything())
mydataset%>%select(-Species, everything())%>%head() 

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

relocate

relocate()是dplyr 1.0.0中新增的一个功能。您可以指定将.before或.after的列位置。

iris%>%relocate(Petal.Width, .after=Sepal.Width)%>%head()
  Sepal.Length Sepal.Width Petal.Width Petal.Length Species
1          5.1         3.5         0.2          1.4  setosa
2          4.9         3.0         0.2          1.4  setosa
3          4.7         3.2         0.2          1.3  setosa
4          4.6         3.1         0.2          1.5  setosa
5          5.0         3.6         0.2          1.4  setosa
6          5.4         3.9         0.4          1.7  setosa

iris%>%relocate(Petal.Width, .after=last_col())%>%head()
  Sepal.Length Sepal.Width Petal.Length Species Petal.Width
1          5.1         3.5          1.4  setosa         0.2
2          4.9         3.0          1.4  setosa         0.2
3          4.7         3.2          1.3  setosa         0.2
4          4.6         3.1          1.5  setosa         0.2
5          5.0         3.6          1.4  setosa         0.2
6          5.4         3.9          1.7  setosa         0.4

pull

当我们处理数据帧时，我们选择单个列，有时我们输出为as.vector。我们可以使用作为dplyr一部分的pull()来实现这一点

setosa_sepal_length<-iris%>%filter(Species=='setosa')%>%select(Sepal.Length)%>%pull()
setosa_sepal_length

 [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8
[32] 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
virginica_sepal_length<-iris%>%filter(Species=='virginica')%>%select(Sepal.Length)%>%pull()

t.test(setosa_sepal_length,virginica_sepal_length)

    Welch Two Sample t-test

data:  setosa_sepal_length and virginica_sepal_length
t = -15.386, df = 76.516, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.78676 -1.37724
sample estimates:
mean of x mean of y

reorder

当您使用ggplot2时，有时会感到沮丧，因为您必须根据某些条件重新排序这些因素。假设我们想要展示萼片的箱形图。宽度的物种。

iris%>%ggplot(aes(x=Species, y=Sepal.Width))+geom_boxplot() + theme_bw()

iris%>%ggplot(aes(x=reorder(Species,Sepal.Width, FUN = median), y=Sepal.Width))+geom_boxplot()+xlab("Species") + theme_bw()

across

令人惊讶的是，使across()起作用的关键思想来自于我们在vctrs包上的底层工作，在那里我们了解到数据帧的一列本身就是数据帧


mpg %>% head()
# A tibble: 6 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class  
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compact
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compact
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compact
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compact
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compact
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compact


mpg %>%
    group_by(class) %>%
    summarise(
        across(
            c(cty, hwy),
            .fns = list(
                "mean"     = ~ mean(.x),
                "range lo" = ~ (mean(.x) - 2*sd(.x)),
                "range hi" = ~ (mean(.x) + 2*sd(.x))
            ),
            .names = "{.fn} {.col}"
        ),
        .groups = "drop"
    ) %>%
    rename_with(.fn = str_to_upper)


# A tibble: 7 x 7
  CLASS      `MEAN CTY` `RANGE LO CTY` `RANGE HI CTY` `MEAN HWY` `RANGE LO HWY` `RANGE HI HWY`
  <chr>           <dbl>          <dbl>          <dbl>      <dbl>          <dbl>          <dbl>
1 2seater          15.4          14.3            16.5       24.8           22.2           27.4
2 compact          20.1          13.4            26.9       28.3           20.7           35.9
3 midsize          18.8          14.9            22.6       27.3           23.0           31.6
4 minivan          15.8          12.2            19.5       22.4           18.2           26.5
5 pickup           13             8.91           17.1       16.9           12.3           21.4
6 subcompact       20.4          11.2            29.6       28.1           17.4           38.9
7 suv              13.5           8.66           18.3       18.1           12.2           24.1

https://www.r-bloggers.com/2020/11/tidyverse-tips/
https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/