本节来介绍dplyr中的重要函数count,summarise,group_by
count
count 统计观察次数
library(tidyverse)
msleep %>%
count(order, sort = TRUE)
order n
<chr> <int>
1 Rodentia 22
2 Carnivora 12
3 Primates 12
4 Artiodactyla 6
5 Soricomorpha 5
也可以在一个count()语句中添加多个变量
msleep %>%
count(order, vore, sort = TRUE)
order vore n
<chr> <chr> <int>
1 Rodentia herbi 16
2 Carnivora carni 12
3 Primates omni 10
4 Artiodactyla herbi 5
summarize
dplyr 中的summarize函数使用直观易读的代码对统计数据进行汇总
msleep %>%
summarise(n = n(), average = mean(sleep_total), maximum = max(sleep_total))
## # A tibble: 1 x 3
## n average maximum
## <int> <dbl> <dbl>
## 1 83 10.4 19.9
group_by( )按分组进行汇总
msleep %>%
group_by(vore) %>%
summarise(n = n(), average = mean(sleep_total), maximum = max(sleep_total))
## # A tibble: 5 x 4
## vore n average maximum
## <chr> <int> <dbl> <dbl>
## 1 carni 19 10.4 19.4
## 2 herbi 32 9.51 16.6
## 3 insecti 5 14.9 19.9
## 4 omni 20 10.9 18.0
## 5 <NA> 7 10.2 13.7
summarise( )几乎适用于任何聚合函数,并允许进行额外的算术运算:
- n() - 给出观察次数
- n_distinct(var) - 给出唯一值的数量 var
- sum(var), max(var), min(var), ...
- mean(var), median(var), sd(var), IQR(var)
将平均 sleep_total 并除以 24,以获得一天的睡眠量
msleep %>%
group_by(vore) %>%
summarise(avg_sleep_day = mean(sleep_total)/24)
## # A tibble: 5 x 2
## vore avg_sleep_day
## <chr> <dbl>
## 1 carni 0.432
## 2 herbi 0.396
## 3 insecti 0.622
## 4 omni 0.455
## 5 <NA> 0.424
summarise_all()需要一个函数作为参数,它将应用于所有列;示例代码计算每列的平均值
msleep %>%
group_by(vore) %>%
summarise_all(mean, na.rm=TRUE)
## # A tibble: 5 x 11
## vore name genus order conservation sleep_total sleep_rem sleep_cycle
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 carni NA NA NA NA 10.4 2.29 0.373
## 2 herbi NA NA NA NA 9.51 1.37 0.418
## 3 insecti NA NA NA NA 14.9 3.52 0.161
给每列的值加5
msleep %>%
group_by(vore) %>%
summarise_all(~mean(., na.rm = TRUE) + 5)
## vore name genus order conservation sleep_total sleep_rem sleep_cycle
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 carni NA NA NA NA 15.4 7.29 5.37
## 2 herbi NA NA NA NA 14.5 6.37 5.42
## 3 insecti NA NA NA NA 19.9 8.52 5.16
summarise_if()
计算所有数字列的平均值
msleep %>%
group_by(vore) %>%
summarise_if(is.numeric, mean, na.rm=TRUE)
rename_if( )对列进行重命名
msleep %>%
group_by(vore) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
rename_if(is.numeric, ~paste0("avg_", .))
## vore avg_sleep_total avg_sleep_rem avg_sleep_cycle avg_awake
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 carni 10.4 2.29 0.373 13.6
## 2 herbi 9.51 1.37 0.418 14.5
## 3 insecti 14.9 3.52 0.161 9.06
summarise_at()
下面的代码将返回平均含有单词“睡眠”的所有列,并且还它们重命名为“AVG_ VAR"
msleep %>%
group_by(vore) %>%
summarise_at(vars(contains("sleep")), mean, na.rm=TRUE) %>%
rename_at(vars(contains("sleep")), ~paste0("avg_", .))
top_n( )
保留值最高的5个
msleep %>%
group_by(order) %>%
summarise(average = mean(sleep_total)) %>%
top_n(5)
保留值最低的5个
msleep %>%
group_by(order) %>%
summarise(average = mean(sleep_total)) %>%
top_n(-5)
示例代码将保留average_sleep 的5 个最高值
msleep %>%
group_by(order) %>%
summarise(average_sleep = mean(sleep_total), max_sleep = max(sleep_total)) %>%
top_n(5, average_sleep)
## order average_sleep max_sleep
## <chr> <dbl> <dbl>
## 1 Afrosoricida 15.6 15.6
## 2 Chiroptera 19.8 19.9
## 3 Cingulata 17.8 18.1
## 4 Didelphimorphia 18.7 19.4
sample_frac()允许随机选择一部分行(此处为 10%)
msleep %>% sample_frac(.1)
喜欢的小伙伴欢迎关注我的公众号
R语言数据分析指南,持续分享数据可视化的经典案例及一些生信知识,希望对大家有所帮助
网友评论