学习tidyverse - 数据转换(1)

作者: DumplingLucky | 来源:发表于2022-02-07 13:05 被阅读0次

学习tidyverse - 数据转换(1)
学习tidyverse - 数据转换(3)
学习tidyverse - 数据转换(2)
DAY7+ 学习笔记 by 康康
学习tidyverse - 数据可视化(2)
数据处理神器tidyverse（2）ggplot2
学习tidyverse - 数据可视化(1)
单细胞分析
2021-04-01
R语言学习指南(3) tidyverse的基础使用

可视化是一个重要工具，但是我们需要把数据整理成正确的形式来进行可视化。通常，需要创建一些新的变量或摘要，或者重命名变量或对观察值进行重新排序，以使数据可视化起来更容易一些。
接下来我们主要会使用到dplyr包，主要涉及到filter(), arrange(), select()三个函数。

1. Prerequisites

library(nycflights13)
library(tidyverse)

2. nycflights13

我们使用nycflights13 :: flights来探索dplyr的基本数据操作。此数据框包含2013年从纽约市出发的所有336,776个航班。

flights
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

3. dplyr basics

（1）使用 `filter()` 过滤行

filter()基于观测值将其子集化。第一个参数是数据框的名称。第二个和后续参数是用于过滤数据的匹配表达式。例如，我们可以选择1月1日的所有航班：

filter(flights, month == 1, day == 1)
#> # A tibble: 842 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 836 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

如果要保存结果，则需要使用赋值运算符<-

jan1 <- filter(flights, month == 1, day == 1)

R可以打印出结果，也可以将它们保存到变量中。如果想两者都做，可以将赋值用括号()括起来：

(dec25 <- filter(flights, month == 12, day == 25))
#> # A tibble: 719 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013    12    25      456            500        -4      649            651
#> 2  2013    12    25      524            515         9      805            814
#> 3  2013    12    25      542            540         2      832            850
#> 4  2013    12    25      546            550        -4     1022           1027
#> 5  2013    12    25      556            600        -4      730            745
#> 6  2013    12    25      557            600        -3      743            752
#> # … with 713 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Comparisons
为了有效地使用过滤，必须知道如何使用比较运算符。 R提供了标准比较运算符：>，>=，<，<=，!= 和 ==。使用==浮点数时，要注意以下问题，将会判断为FALSE。

sqrt(2) ^ 2 == 2
#> [1] FALSE
1 / 49 * 49 == 1
#> [1] FALSE

这种情况应该使用near()

near(sqrt(2) ^ 2,  2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE

Logical operators
filter()的多个参数与and组合：每个表达式都必须为true，以便将一行包含在输出中。 ＆为并，|是或，!是非。如下显示了完整的布尔运算集。

以下代码查找11月或12月出发的所有航班：

filter(flights, month == 11 | month == 12)

x ％in％ y: 其中x是y之一。我们可以用它来重写上面的代码：

nov_dec <- filter(flights, month %in% c(11, 12))

如果要查找不延迟（到达或离开）两个小时以上的航班，则可以使用以下两个过滤器之一：

filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

Missing values

NA > 5
#> [1] NA
10 == NA
#> [1] NA
NA + 10
#> [1] NA
NA / 2
#> [1] NA
NA == NA
#> [1] NA
is.na(x)
#> [1] TRUE

filter（）仅包含条件为TRUE的行；它不包括FALSE和NA值。如果要保留缺失值，需要明确指定：

df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
#> # A tibble: 1 x 1
#>       x
#>   <dbl>
#> 1     3
filter(df, is.na(x) | x > 1)
#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1    NA
#> 2     3

（2）使用 `arrange()` 重排行

arrange()与filter()的工作原理类似，不同之处在于，它不选择行，而是更改行的顺序。它需要一个数据框和一组列名进行排序。

arrange(flights, year, month, day)
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

降序排列

arrange(flights, desc(dep_delay))
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     9      641            900      1301     1242           1530
#> 2  2013     6    15     1432           1935      1137     1607           2120
#> 3  2013     1    10     1121           1635      1126     1239           1810
#> 4  2013     9    20     1139           1845      1014     1457           2210
#> 5  2013     7    22      845           1600      1005     1044           1815
#> 6  2013     4    10     1100           1900       960     1342           2211
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

缺失值总是排在最后：

df <- tibble(x = c(5, 2, NA))
arrange(df, x)
#> # A tibble: 3 x 1
#>       x
#>   <dbl>
#> 1     2
#> 2     5
#> 3    NA
arrange(df, desc(x))
#> # A tibble: 3 x 1
#>       x
#>   <dbl>
#> 1     5
#> 2     2
#> 3    NA

3. 使用 `select()` 选择列

select（）使用基于变量名称的操作来快速提取有用的子集。

# Select columns by name
select(flights, year, month, day)
#> # A tibble: 336,776 x 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> 5  2013     1     1
#> 6  2013     1     1
#> # … with 336,770 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
#> # A tibble: 336,776 x 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> 5  2013     1     1
#> 6  2013     1     1
#> # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
#> # A tibble: 336,776 x 16
#>   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#>      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
#> 1      517            515         2      830            819        11 UA     
#> 2      533            529         4      850            830        20 UA     
#> 3      542            540         2      923            850        33 AA     
#> 4      544            545        -1     1004           1022       -18 B6     
#> 5      554            600        -6      812            837       -25 DL     
#> 6      554            558        -4      740            728        12 UA     
#> # … with 336,770 more rows, and 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

在select()中可以使用许多辅助函数：
starts_with("abc")：匹配以“ abc”开头的名称。
ends_with("xyz")：匹配以“ xyz”结尾的名称。
contains("ijk")：匹配包含“ ijk”的名称。
matchs("(.. \\ 1")：选择与正则表达式匹配的变量。该变量与任何包含重复字符的变量匹配。
num_range("x", 1:3)：匹配x1，x2和x3。

使用rename()更改列名。

rename(flights, tail_num = tailnum)
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

另一个选择是将select()与everything()结合使用。比如想将几个变量移到数据框的开头。

select(flights, time_hour, air_time, everything())
#> # A tibble: 336,776 x 19
#>   time_hour           air_time  year month   day dep_time sched_dep_time
#>   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
#> 1 2013-01-01 05:00:00      227  2013     1     1      517            515
#> 2 2013-01-01 05:00:00      227  2013     1     1      533            529
#> 3 2013-01-01 05:00:00      160  2013     1     1      542            540
#> 4 2013-01-01 05:00:00      183  2013     1     1      544            545
#> 5 2013-01-01 06:00:00      116  2013     1     1      554            600
#> 6 2013-01-01 05:00:00      150  2013     1     1      554            558
#> # … with 336,770 more rows, and 12 more variables: dep_delay <dbl>,
#> #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>

参考：https://r4ds.had.co.nz/transform.html

学习tidyverse - 数据转换(1)
可视化是一个重要工具，但是我们需要把数据整理成正确的形式来进行可视化。通常，需要创建一些新的变量或摘要，或者重命名...
学习tidyverse - 数据转换(3)
可视化是一个重要工具，但是我们需要把数据整理成正确的形式来进行可视化。通常，需要创建一些新的变量或摘要，或者重命...
学习tidyverse - 数据转换(2)
可视化是一个重要工具，但是我们需要把数据整理成正确的形式来进行可视化。通常，需要创建一些新的变量或摘要，或者重命...
DAY7+ 学习笔记 by 康康
《R与tidyverse——数据分析入门》学习笔记 R与tidyverse——数据分析入门[https://tia...
学习tidyverse - 数据可视化(2)
学习tidyverse - 数据可视化(1)[https://www.jianshu.com/p/50690064...
数据处理神器tidyverse（2）ggplot2
数据处理神器tidyverse（1）dplyr 数据处理神器tidyverse（2）ggplot2 这样输出的是空...
学习tidyverse - 数据可视化(1)
我学习的主要是这本书R for Data Science[https://r4ds.had.co.nz/index...
单细胞分析
1 如何将10X矩阵转换成常规矩阵 library(Seurat) library(tidyverse) PRO<...
2021-04-01
数据导入和读取 tidyverse读取数据原始数据 test1 <- read_csv("diffmiRNA.t...
R语言学习指南(3) tidyverse的基础使用
tidyverse[https://www.tidyverse.org/packages]是为数据科学设计的R软件...