19.关于summarise()之逐步汇总和ungrouping

作者: 心惊梦醒 | 来源:发表于2021-08-02 21:08 被阅读0次

20.mutate()和filter()与group_by()的
19.关于summarise()之逐步汇总和ungrouping
5.数据转换（三）
R 数据处理（十）—— dplyr
group_by和summrise连用后，分组计算就很方便！
R for data Science（五）
13.07
补交2018年9月5日学习致良知《寄杨䆳庵阁老书》心得
心学之我悟：天理
0.PTE Listening Notes

【上一篇：18.关于summarise()之三】
【下一篇：20.mutate()和filter()与group_by()的使用】

前面，我们已经知道，用group_by()函数可以实现以组的方式进行汇总，单个变量、多个变量都可以作为一组。我们也知道，使用group_by()之后返回的新的数据框中会有Groups的标识，显示是已经分组的数据框（如果需要取消分组，可以用ungroup()函数，用法的话看文章最后）。不知道你有没有注意到分好组的数据框中Groups标识行在summarise前后有什么变化呢？看个例子吧：

> daily <- group_by(flights, year, month, day)
> daily
# A tibble: 336,776 x 19
# 只执行group_by()命令之后显示按照year,month,day分组
# Groups:   year, month, day [365]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>
#基于全年汇总每天的航班总数
> (per_day   <- summarise(daily, flights = n()))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# A tibble: 365 x 4
#执行完summarise()之后变成了year,month
# Groups:   year, month [12]
    year month   day flights
   <int> <int> <int>   <int>
 1  2013     1     1     842
 2  2013     1     2     943
 3  2013     1     3     914
 4  2013     1     4     915
 5  2013     1     5     720
 6  2013     1     6     832
 7  2013     1     7     933
 8  2013     1     8     899
 9  2013     1     9     902
10  2013     1    10     932
# ... with 355 more rows

相信你看到不同了，这是因为每执行一次summarise()，就会剥离group的一个level。因此可以实现逐步汇总。我们再汇总一次：

#基于per_day再次汇总
> (per_month <- summarise(per_day, flights = sum(flights)))
`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
# A tibble: 12 x 3
#再次汇总后，Groups只有year了
# Groups:   year [1]
    year month flights
   <int> <int>   <int>
 1  2013     1   27004
 2  2013     2   24951
 3  2013     3   28834
 4  2013     4   28330
 5  2013     5   28796
 6  2013     6   28243
 7  2013     7   29425
 8  2013     8   29327
 9  2013     9   27574
10  2013    10   28889
11  2013    11   27268
12  2013    12   28135

在逐步汇总时要小心:对于求和和计数来说是可以的，但是需要考虑加权均值和方差。对于像中位数这样的基于排名的统计数据，这是不可能做到的。换句话说，组和的和是总和，但组中位数的中位数不是总中位数。

#取消分组
daily %>% 
    ungroup() %>%            
    summarise(flights = n())

这一篇本身就是书中summarise()的最后部分，应用面不广，只要知道就好了。
练习题
1.头脑风暴至少5种不同的方法来评估一组航班的典型延误特征。考虑以下场景:
1）航班提前15分钟是50%，迟到15分钟是50%。
2）飞机总是晚点10分钟。
3）提前30分钟是50%，晚30分钟是50%
4）航班99%的时间都是准点的。1%的时间晚了2个小时。
到达延误和出发延误哪个更重要?
2.改变写法，输出与not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance)相同的结果，不要使用count()

not_cancelled %>% group_by(dest) %>% summarise(n=n())
not_cancelled %>% group_by(tailnum) %>% summarise(sum(distance))

3.对航班取消的定义是： (is.na(dep_delay) | is.na(arr_delay) )，不是最优的，为什么？哪一列最重要？（不知道）
4.看看每天被取消的航班数量。有什么模式吗?航班被取消的比例是否与平均延误有关?
5.哪家航空公司的航班延误最严重?挑战:你能区分坏机场和坏航空公司的影响吗?为什么/为什么不?（提示：think about flights %>% group_by(carrier, dest) %>% summarise(n())）
6. count()函数有个sort参数，是做什么的？什么时候可能用到它？

【上一篇：18.关于summarise()之三】
【下一篇：20.mutate()和filter()与group_by()的使用】