学习tidyverse - 数据转换(3)

作者: DumplingLucky | 来源:发表于2021-05-10 23:24 被阅读0次

学习tidyverse - 数据转换(3)
学习tidyverse - 数据转换(2)
学习tidyverse - 数据转换(1)
DAY7+ 学习笔记 by 康康
DataCamp课程 Chapter.4
DataCamp课程 Chapter.1
DataCamp课程 Chapter.2
DataCamp课程 Chapter.3
学习tidyverse - 数据可视化(2)
R语言学习指南(3) tidyverse的基础使用

可视化是一个重要工具，但是我们需要把数据整理成正确的形式来进行可视化。通常，需要创建一些新的变量或摘要，或者重命名变量或对观察值进行重新排序，以使数据可视化起来更容易一些。接下来我们主要会使用到dplyr包，本节内容涉及到summarise()函数。

`summarise()`提取数据总结

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 1 x 1
#>   delay
#>   <dbl>
#> 1  12.6

summarise（）并不是非常有用，除非将其与group_by（）配对。这会将分析单位从完整的数据集更改为各个组。例如，如果将完全相同的代码应用于按日期分组的数据框，则将获得每个日期的平均延迟：

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups:   year, month [12]
#>    year month   day delay
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1 11.5 
#> 2  2013     1     2 13.9 
#> 3  2013     1     3 11.0 
#> 4  2013     1     4  8.95
#> 5  2013     1     5  5.73
#> 6  2013     1     6  7.15
#> # … with 359 more rows

group_by（）和summarise（）是使用dplyr时最常用的工具之一：分组摘要。通过管道运算符可以让结果更进一步。

多种操作与管道结合起来

如果想探索每个位置的距离和平均延迟之间的关系，可以通过如下代码：

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
#> `summarise()` ungrouping output (override with `.groups` argument)
delay <- filter(delay, count > 20, dest != "HNL")

# It looks like delays increase with distance up to ~750 miles 
# and then decrease. Maybe as flights get longer there's more 
# ability to make up delays in the air?
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

管道％>％可以解决上面代码中产生很多不必要的中间数据框的问题：

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")
#> `summarise()` ungrouping output (override with `.groups` argument)

缺失值：
所有聚合函数都有一个na.rm参数，该参数会在计算之前删除丢失的值：

flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay, na.rm = TRUE))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups:   year, month [12]
#>    year month   day  mean
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1 11.5 
#> 2  2013     1     2 13.9 
#> 3  2013     1     3 11.0 
#> 4  2013     1     4  8.95
#> 5  2013     1     5  5.73
#> 6  2013     1     6  7.15
#> # … with 359 more rows

在这种情况下，缺失值表示已取消的航班，我们还可以通过先删除已取消的航班来解决该问题。保存此数据集，以便在接下来的几个示例中可以重复使用。

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups:   year, month [12]
#>    year month   day  mean
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1 11.4 
#> 2  2013     1     2 13.7 
#> 3  2013     1     3 10.9 
#> 4  2013     1     4  8.97
#> 5  2013     1     5  5.73
#> 6  2013     1     6  7.15
#> # … with 359 more rows

计数：
每当您进行任何汇总时，最好都包含一个计数（n（））或一个非缺失值的计数（sum（！is.na（x）））。这样一来，就可以检查自己是否基于少量数据得出结论。例如，让我们看一下平均延误最高的飞机（由其机尾号确定）：

delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay)
  )
#> `summarise()` ungrouping output (override with `.groups` argument)

ggplot(data = delays, mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10)

如果我们绘制航班数与平均延误的散点图，则可以得到更多的见解：

delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )
#> `summarise()` ungrouping output (override with `.groups` argument)

ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10)

查看此类图时，通常可以将观察次数最少的组过滤掉，这样您就可以在最小的组中看到更多的模式而更少的极端变化。

delays %>% 
  filter(n > 25) %>% 
  ggplot(mapping = aes(x = n, y = delay)) + 
    geom_point(alpha = 1/10)

在这里，我使用来自Lahman软件包的数据来计算每个职业棒球大联盟球员的击球平均值（命中次数/尝试次数）。

# Convert to a tibble so it prints nicely
batting <- as_tibble(Lahman::Batting)

batters <- batting %>% 
  group_by(playerID) %>% 
  summarise(
    ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    ab = sum(AB, na.rm = TRUE)
  )
#> `summarise()` ungrouping output (override with `.groups` argument)

batters %>% 
  filter(ab > 100) %>% 
  ggplot(mapping = aes(x = ab, y = ba)) +
    geom_point() + 
    geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

有用的总结函数

位置测量：我们使用了均值（x），但中位数（x）也很有用。平均值是总和除以长度；中位数是x的50％在其上方而x的50％在其下方的值。

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    avg_delay1 = mean(arr_delay),
    avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
  )
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 5
#> # Groups:   year, month [12]
#>    year month   day avg_delay1 avg_delay2
#>   <int> <int> <int>      <dbl>      <dbl>
#> 1  2013     1     1      12.7        32.5
#> 2  2013     1     2      12.7        32.0
#> 3  2013     1     3       5.73       27.7
#> 4  2013     1     4      -1.93       28.3
#> 5  2013     1     5      -1.53       22.6
#> 6  2013     1     6       4.24       24.4
#> # … with 359 more rows

传播程度：sd（x），IQR（x），mad（x）。均方根偏差或标准偏差sd（x）是扩展的标准度量。四分位间距IQR（x）和中位数绝对偏差mad（x）是鲁棒的等效项，如果您有异常值，则可能会更有用。

# Why is distance to some destinations more variable than to others?
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(distance_sd = sd(distance)) %>% 
  arrange(desc(distance_sd))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 104 x 2
#>   dest  distance_sd
#>   <chr>       <dbl>
#> 1 EGE         10.5 
#> 2 SAN         10.4 
#> 3 SFO         10.2 
#> 4 HNL         10.0 
#> 5 SEA          9.98
#> 6 LAS          9.91
#> # … with 98 more rows

等级的度量：min（x），分位数（x，0.25），max（x）。分位数是中位数的概括。例如，分位数（x，0.25）将找到一个x值，该值大于该值的25％，小于其余的75％。

# When do the first and last flights leave each day?
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first = min(dep_time),
    last = max(dep_time)
  )
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 5
#> # Groups:   year, month [12]
#>    year month   day first  last
#>   <int> <int> <int> <int> <int>
#> 1  2013     1     1   517  2356
#> 2  2013     1     2    42  2354
#> 3  2013     1     3    32  2349
#> 4  2013     1     4    25  2358
#> 5  2013     1     5    14  2357
#> 6  2013     1     6    16  2355
#> # … with 359 more rows

位置度量：first（x），nth（x，2），last（x）。它们的工作方式类似于x [1]，x [2]和x [length（x）]，但是如果该位置不存在，则让您设置默认值（即，您尝试从该组中获取第三个元素只有两个元素）。例如，我们可以找到每天的首班车和末班车：

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first_dep = first(dep_time), 
    last_dep = last(dep_time)
  )
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 5
#> # Groups:   year, month [12]
#>    year month   day first_dep last_dep
#>   <int> <int> <int>     <int>    <int>
#> 1  2013     1     1       517     2356
#> 2  2013     1     2        42     2354
#> 3  2013     1     3        32     2349
#> 4  2013     1     4        25     2358
#> 5  2013     1     5        14     2357
#> 6  2013     1     6        16     2355
#> # … with 359 more rows

这些功能是对等级过滤的补充。过滤为您提供了所有变量，每个观察值都位于单独的行中：

not_cancelled %>% 
  group_by(year, month, day) %>% 
  mutate(r = min_rank(desc(dep_time))) %>% 
  filter(r %in% range(r))
#> # A tibble: 770 x 20
#> # Groups:   year, month, day [365]
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1     2356           2359        -3      425            437
#> 3  2013     1     2       42           2359        43      518            442
#> 4  2013     1     2     2354           2359        -5      413            437
#> 5  2013     1     3       32           2359        33      504            442
#> 6  2013     1     3     2349           2359       -10      434            445
#> # … with 764 more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, r <int>

计数：您已经看到n（），它不带任何参数，并返回当前组的大小。要计算非缺失值的数量，请使用sum（！is.na（x））。要计算不同（唯一）值的数量，请使用n_distinct（x）。

# Which destinations have the most carriers?
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(carriers = n_distinct(carrier)) %>% 
  arrange(desc(carriers))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 104 x 2
#>   dest  carriers
#>   <chr>    <int>
#> 1 ATL          7
#> 2 BOS          7
#> 3 CLT          7
#> 4 ORD          7
#> 5 TPA          7
#> 6 AUS          6
#> # … with 98 more rows

计数非常有用，如果您只想计数，dplyr会提供一个简单的帮助器：

not_cancelled %>% 
  count(dest)
#> # A tibble: 104 x 2
#>   dest      n
#>   <chr> <int>
#> 1 ABQ     254
#> 2 ACK     264
#> 3 ALB     418
#> 4 ANC       8
#> 5 ATL   16837
#> 6 AUS    2411
#> # … with 98 more rows

您可以选择提供一个权重变量。例如，您可以使用它来“计算”（总和）飞机飞行的总英里数：

not_cancelled %>% 
  count(tailnum, wt = distance)
#> # A tibble: 4,037 x 2
#>   tailnum      n
#>   <chr>    <dbl>
#> 1 D942DN    3418
#> 2 N0EGMQ  239143
#> 3 N10156  109664
#> 4 N102UW   25722
#> 5 N103US   24619
#> 6 N104UW   24616
#> # … with 4,031 more rows

逻辑值的计数和比例：sum（x> 10），mean（y == 0）。与数值函数一起使用时，TRUE转换为1，FALSE转换为0。这使sum（）和mean（）非常有用：sum（x）给出x中TRUE的数量，而mean（x）给出比例。

# How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(n_early = sum(dep_time < 500))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups:   year, month [12]
#>    year month   day n_early
#>   <int> <int> <int>   <int>
#> 1  2013     1     1       0
#> 2  2013     1     2       3
#> 3  2013     1     3       4
#> 4  2013     1     4       3
#> 5  2013     1     5       3
#> 6  2013     1     6       2
#> # … with 359 more rows

# What proportion of flights are delayed by more than an hour?
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(hour_prop = mean(arr_delay > 60))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups:   year, month [12]
#>    year month   day hour_prop
#>   <int> <int> <int>     <dbl>
#> 1  2013     1     1    0.0722
#> 2  2013     1     2    0.0851
#> 3  2013     1     3    0.0567
#> 4  2013     1     4    0.0396
#> 5  2013     1     5    0.0349
#> 6  2013     1     6    0.0470
#> # … with 359 more rows

Grouping by multiple variables

当您按多个变量分组时，每个摘要都会剥离分组的一个级别。这样可以轻松地逐步汇总数据集：

daily <- group_by(flights, year, month, day)
(per_day   <- summarise(daily, flights = n()))
#> `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
#> # A tibble: 365 x 4
#> # Groups:   year, month [12]
#>    year month   day flights
#>   <int> <int> <int>   <int>
#> 1  2013     1     1     842
#> 2  2013     1     2     943
#> 3  2013     1     3     914
#> 4  2013     1     4     915
#> 5  2013     1     5     720
#> 6  2013     1     6     832
#> # … with 359 more rows
(per_month <- summarise(per_day, flights = sum(flights)))
#> `summarise()` regrouping output by 'year' (override with `.groups` argument)
#> # A tibble: 12 x 3
#> # Groups:   year [1]
#>    year month flights
#>   <int> <int>   <int>
#> 1  2013     1   27004
#> 2  2013     2   24951
#> 3  2013     3   28834
#> 4  2013     4   28330
#> 5  2013     5   28796
#> 6  2013     6   28243
#> # … with 6 more rows
(per_year  <- summarise(per_month, flights = sum(flights)))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 1 x 2
#>    year flights
#>   <int>   <int>
#> 1  2013  336776

逐步汇总时要小心：总计和计数是可以的，但是您需要考虑加权均值和方差，并且不可能对中位数等基于排名的统计数据进行精确的加权。换句话说，按组和的和是总和，但是按组中和的中值不是整体中值。

Ungrouping

如果需要删除分组并返回对未分组数据的操作，请使用ungroup（）。

daily %>% 
  ungroup() %>%             # no longer grouped by date
  summarise(flights = n())  # all flights
#> # A tibble: 1 x 1
#>   flights
#>     <int>
#> 1  336776

Grouped mutates (and filters)

与summarise（）结合使用时，分组最有用，但是您也可以使用mutate（）和filter（）进行便捷的操作：

flights_sml %>% 
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay)) < 10)
#> # A tibble: 3,306 x 7
#> # Groups:   year, month, day [365]
#>    year month   day dep_delay arr_delay distance air_time
#>   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
#> 1  2013     1     1       853       851      184       41
#> 2  2013     1     1       290       338     1134      213
#> 3  2013     1     1       260       263      266       46
#> 4  2013     1     1       157       174      213       60
#> 5  2013     1     1       216       222      708      121
#> 6  2013     1     1       255       250      589      115
#> # … with 3,300 more rows

查找大于阈值的所有组：

popular_dests <- flights %>% 
  group_by(dest) %>% 
  filter(n() > 365)
popular_dests
#> # A tibble: 332,577 x 19
#> # Groups:   dest [77]
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 332,571 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

标准化以计算每组指标：

popular_dests %>% 
  filter(arr_delay > 0) %>% 
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
  select(year:day, dest, arr_delay, prop_delay)
#> # A tibble: 131,106 x 6
#> # Groups:   dest [77]
#>    year month   day dest  arr_delay prop_delay
#>   <int> <int> <int> <chr>     <dbl>      <dbl>
#> 1  2013     1     1 IAH          11  0.000111 
#> 2  2013     1     1 IAH          20  0.000201 
#> 3  2013     1     1 MIA          33  0.000235 
#> 4  2013     1     1 ORD          12  0.0000424
#> 5  2013     1     1 FLL          19  0.0000938
#> 6  2013     1     1 ORD           8  0.0000283
#> # … with 131,100 more rows

标准化以计算每组指标：

popular_dests %>% 
  filter(arr_delay > 0) %>% 
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
  select(year:day, dest, arr_delay, prop_delay)
#> # A tibble: 131,106 x 6
#> # Groups:   dest [77]
#>    year month   day dest  arr_delay prop_delay
#>   <int> <int> <int> <chr>     <dbl>      <dbl>
#> 1  2013     1     1 IAH          11  0.000111 
#> 2  2013     1     1 IAH          20  0.000201 
#> 3  2013     1     1 MIA          33  0.000235 
#> 4  2013     1     1 ORD          12  0.0000424
#> 5  2013     1     1 FLL          19  0.0000938
#> 6  2013     1     1 ORD           8  0.0000283
#> # … with 131,100 more rows

参考：https://r4ds.had.co.nz/transform.html

学习tidyverse - 数据转换(3)
可视化是一个重要工具，但是我们需要把数据整理成正确的形式来进行可视化。通常，需要创建一些新的变量或摘要，或者重命...
学习tidyverse - 数据转换(2)
可视化是一个重要工具，但是我们需要把数据整理成正确的形式来进行可视化。通常，需要创建一些新的变量或摘要，或者重命...
学习tidyverse - 数据转换(1)
可视化是一个重要工具，但是我们需要把数据整理成正确的形式来进行可视化。通常，需要创建一些新的变量或摘要，或者重命名...
DAY7+ 学习笔记 by 康康
《R与tidyverse——数据分析入门》学习笔记 R与tidyverse——数据分析入门[https://tia...
DataCamp课程 Chapter.4
Tidyverse课程目录 Chapter 1. 数据整形Chapter 2. 数据可视化Chapter 3. 分...
DataCamp课程 Chapter.1
Tidyverse课程目录 Chapter 1. 数据整形Chapter 2. 数据可视化Chapter 3. 分...
DataCamp课程 Chapter.2
Tidyverse课程目录 Chapter 1. 数据整形Chapter 2. 数据可视化Chapter 3. 分...
DataCamp课程 Chapter.3
Tidyverse课程目录 Chapter 1. 数据整形Chapter 2. 数据可视化Chapter 3. 分...
学习tidyverse - 数据可视化(2)
学习tidyverse - 数据可视化(1)[https://www.jianshu.com/p/50690064...
R语言学习指南(3) tidyverse的基础使用
tidyverse[https://www.tidyverse.org/packages]是为数据科学设计的R软件...