美文网首页
08数据整理与转换:tidyr和dplyr应用

08数据整理与转换:tidyr和dplyr应用

作者: Jachin111 | 来源:发表于2020-11-14 12:21 被阅读0次

tidy data规则
tidyverse 生态链中最常用的数据类型 tibble,tibble 默认每行为一个观测值,每列为一个变量,而 tidyverse 的操作都基于 tibble,一个整齐的数据集(tidy dataset)需要满足以下三个规则:
每个变量有单独的列。
每个观测值有单独的行。
每个值有单独的单元格。

不整齐的数据集(untidy dataset)可能遭遇了以下问题:
一个变量被分散到了多列。
一个观测值被分散到了多行。

> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
√ ggplot2 3.3.1     √ purrr   0.3.4
√ tibble  3.0.1     √ dplyr   0.8.5
√ tidyr   1.1.0     √ stringr 1.4.0
√ readr   1.3.1     √ forcats 0.5.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Warning messages:
1: 程辑包‘tibble’是用R版本3.6.3 来建造的 
2: 程辑包‘tidyr’是用R版本3.6.3 来建造的 
3: 程辑包‘purrr’是用R版本3.6.3 来建造的 
4: 程辑包‘dplyr’是用R版本3.6.3 来建造的 
5: 程辑包‘forcats’是用R版本3.6.3 来建造的 
> tidytable <- as.tibble(tidytable)
Warning messages:
1: `as.tibble()` is deprecated as of tibble 2.0.0.
Please use `as_tibble()` instead.
The signature and semantics have changed, see `?as_tibble`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
2: The `x` argument of `as_tibble.matrix()` must have column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
> tidytable <- readLines("http://labfile.oss.aliyuncs.com/courses/982/tidytable.csv")
> tidytable <- unlist(strsplit(tidytable, split=","))
> tidytable <- matrix(tidytable, ncol=4, byrow=T)
> tidytable <- as_tibble(tidytable)
> colnames(tidytable) <- tidytable[1,]
> tidytable <- tidytable[-1,]
> tidytable$year <- as.integer(tidytable$year)
> tidytable$EF <- as.double(tidytable$EF)
> tidytable$GDP <- as.double(tidytable$GDP)
> tidytable
# A tibble: 8 x 4
  country      year     EF    GDP
  <chr>       <int>  <dbl>  <dbl>
1 Afghanistan  2009   20.9   1.47
2 Afghanistan  2013   27.3   1.97
3 Brazil       2009  532   205   
4 Brazil       2013  616   241   
5 China        2009 4253   551   
6 China        2013 5010   777   
7 Denmark      2009   38.1  31.5 
8 Denmark      2013   34.4  32.2 

生态足迹(EF, Ecology Footprint)衡量一国维持当前生活水平所需要的自然资源,EF 越小说明该地区自然环境所承受的压力越小。国内生产总值(GDP, Gross Domestic Product)衡量一国的发展水平,GDP 越大说明该地区(从总量上来说)越富有。

gather函数
gather 函数用于处理一个变量被分散到多列的情形

> eg1 <- readLines("http://labfile.oss.aliyuncs.com/courses/982/eg1.csv")
> eg1 <- unlist(strsplit(eg1, split=","))
> eg1 <- matrix(eg1, ncol=3, byrow=T)
> eg1 <- as_tibble(eg1)
> colnames(eg1) <- eg1[1,]
> eg1 <- eg1[-1,]
> eg1$`2009` <- as.double(eg1$`2009`)
> eg1$`2013` <- as.double(eg1$`2013`)
> eg1
# A tibble: 4 x 3
  country     `2009` `2013`
  <chr>        <dbl>  <dbl>
1 Afghanistan   1.47   1.97
2 Brazil      205    241   
3 China       551    777   
4 Denmark      31.5   32.2 
> eg1 %>% gather(key=year, value=GDP, `2009`:`2013`)
# A tibble: 8 x 3
  country     year     GDP
  <chr>       <chr>  <dbl>
1 Afghanistan 2009    1.47
2 Brazil      2009  205   
3 China       2009  551   
4 Denmark     2009   31.5 
5 Afghanistan 2013    1.97
6 Brazil      2013  241   
7 China       2013  777   
8 Denmark     2013   32.2 

> gather(eg1, key=year, value=GDP, `2009`:`2013`)
# A tibble: 8 x 3
  country     year     GDP
  <chr>       <chr>  <dbl>
1 Afghanistan 2009    1.47
2 Brazil      2009  205   
3 China       2009  551   
4 Denmark     2009   31.5 
5 Afghanistan 2013    1.97
6 Brazil      2013  241   
7 China       2013  777   
8 Denmark     2013   32.2 

spread函数
于处理一个观测值被分散到多行的情形

> eg2 <- readLines("http://labfile.oss.aliyuncs.com/courses/982/eg2.csv")
> eg2 <- unlist(strsplit(eg2, split=","))
> eg2 <- matrix(eg2, ncol=4, byrow=T)
> eg2 <- as_tibble(eg2)
> colnames(eg2) <- eg2[1,]
> eg2 <- eg2[-1,]
> eg2$year <- as.integer(eg2$year)
> eg2$count <- as.double(eg2$count)
> eg2
# A tibble: 16 x 4
   country      year type    count
   <chr>       <int> <chr>   <dbl>
 1 Afghanistan  2009 GDP      1.47
 2 Afghanistan  2009 EF      20.9 
 3 Afghanistan  2013 GDP      1.97
 4 Afghanistan  2013 EF      27.3 
 5 Brazil       2009 GDP    205   
 6 Brazil       2009 EF     532   
 7 Brazil       2013 GDP    241   
 8 Brazil       2013 EF     616   
 9 China        2009 GDP    551   
10 China        2009 EF    4253   
11 China        2013 GDP    777   
12 China        2013 EF    5010   
13 Denmark      2009 GDP     31.5 
14 Denmark      2009 EF      38.1 
15 Denmark      2013 GDP     32.2 
16 Denmark      2013 EF      34.4 
> eg2 %>% spread(key=type, value=count)
# A tibble: 8 x 4
  country      year     EF    GDP
  <chr>       <int>  <dbl>  <dbl>
1 Afghanistan  2009   20.9   1.47
2 Afghanistan  2013   27.3   1.97
3 Brazil       2009  532   205   
4 Brazil       2013  616   241   
5 China        2009 4253   551   
6 China        2013 5010   777   
7 Denmark      2009   38.1  31.5 
8 Denmark      2013   34.4  32.2 
> spread(eg2, key=type, value=count)
# A tibble: 8 x 4
  country      year     EF    GDP
  <chr>       <int>  <dbl>  <dbl>
1 Afghanistan  2009   20.9   1.47
2 Afghanistan  2013   27.3   1.97
3 Brazil       2009  532   205   
4 Brazil       2013  616   241   
5 China        2009 4253   551   
6 China        2013 5010   777   
7 Denmark      2009   38.1  31.5 
8 Denmark      2013   34.4  32.2 
> eg2 %>% spread(type, count)
# A tibble: 8 x 4
  country      year     EF    GDP
  <chr>       <int>  <dbl>  <dbl>
1 Afghanistan  2009   20.9   1.47
2 Afghanistan  2013   27.3   1.97
3 Brazil       2009  532   205   
4 Brazil       2013  616   241   
5 China        2009 4253   551   
6 China        2013 5010   777   
7 Denmark      2009   38.1  31.5 
8 Denmark      2013   34.4  32.2 

separate函数

> eg3 <- readLines("http://labfile.oss.aliyuncs.com/courses/982/eg3.csv")
> eg3 <- unlist(strsplit(eg3, split=","))
> eg3 <- matrix(eg3, ncol=3, byrow=T)
> eg3 <- as.tibble(eg3)
> colnames(eg3) <- eg3[1,]
> eg3 <- eg3[-1,]
> eg3$year <- as.integer(eg3$year)
> eg3
# A tibble: 8 x 3
  country      year rate     
  <chr>       <int> <chr>    
1 Afghanistan  2009 20.9/1.47
2 Afghanistan  2013 27.3/1.97
3 Brazil       2009 532./205 
4 Brazil       2013 616/241  
5 China        2009 4253/551 
6 China        2013 5010/777 
7 Denmark      2009 38.1/31.5
8 Denmark      2013 34.4/32.2
> eg3 %>% separate(col=rate, into=c("EF", "GDP"), sep="/", convert=T)
# A tibble: 8 x 4
  country      year     EF    GDP
  <chr>       <int>  <dbl>  <dbl>
1 Afghanistan  2009   20.9   1.47
2 Afghanistan  2013   27.3   1.97
3 Brazil       2009  532   205   
4 Brazil       2013  616   241   
5 China        2009 4253   551   
6 China        2013 5010   777   
7 Denmark      2009   38.1  31.5 
8 Denmark      2013   34.4  32.2 
> separate(eg3, col=rate, into=c("EF", "GDP"), sep="/", convert=T)
# A tibble: 8 x 4
  country      year     EF    GDP
  <chr>       <int>  <dbl>  <dbl>
1 Afghanistan  2009   20.9   1.47
2 Afghanistan  2013   27.3   1.97
3 Brazil       2009  532   205   
4 Brazil       2013  616   241   
5 China        2009 4253   551   
6 China        2013 5010   777   
7 Denmark      2009   38.1  31.5 
8 Denmark      2013   34.4  32.2 
> eg3 %>% separate(rate, c("EF", "GDP"), "/", convert=T)
# A tibble: 8 x 4
  country      year     EF    GDP
  <chr>       <int>  <dbl>  <dbl>
1 Afghanistan  2009   20.9   1.47
2 Afghanistan  2013   27.3   1.97
3 Brazil       2009  532   205   
4 Brazil       2013  616   241   
5 China        2009 4253   551   
6 China        2013 5010   777   
7 Denmark      2009   38.1  31.5 
8 Denmark      2013   34.4  32.2 

unite函数
unite 函数是 separate 函数的逆过程

> tidytable %>% unite("rate", c("EF", "GDP"), sep="/")
# A tibble: 8 x 3
  country      year rate     
  <chr>       <int> <chr>    
1 Afghanistan  2009 20.9/1.47
2 Afghanistan  2013 27.3/1.97
3 Brazil       2009 532/205  
4 Brazil       2013 616/241  
5 China        2009 4253/551 
6 China        2013 5010/777 
7 Denmark      2009 38.1/31.5
8 Denmark      2013 34.4/32.2
> eg3 %>% separate(rate, c("EF", "GDP"), sep="/") %>% unite("rate", c("EF", "GDP"), sep="/")
# A tibble: 8 x 3
  country      year rate     
  <chr>       <int> <chr>    
1 Afghanistan  2009 20.9/1.47
2 Afghanistan  2013 27.3/1.97
3 Brazil       2009 532./205 
4 Brazil       2013 616/241  
5 China        2009 4253/551 
6 China        2013 5010/777 
7 Denmark      2009 38.1/31.5
8 Denmark      2013 34.4/32.2

数据转换

> da <- readLines("http://labfile.oss.aliyuncs.com/courses/982/GDP_EF.csv")
> da <- unlist(strsplit(da, split=","))
> da <- c(da[1:463], da[465:length(da)])
> da <- matrix(da, ncol=6, byrow=T)
> da <- as_tibble(da)
> colnames(da) <- da[1,]
> da <- da[-1,]
> da$EF2013 <- as.double(da$EF2013)
> da$EF2009 <- as.double(da$EF2009)
> da$GDP2013 <- as.double(da$GDP2013)
> da$GDP2009 <- as.double(da$GDP2009)
> da
# A tibble: 153 x 6
   country     EF2013  EF2009  GDP2013  GDP2009 continent   
   <chr>        <dbl>   <dbl>    <dbl>    <dbl> <chr>       
 1 Afghanis~   2.73e7  2.09e7  1.97e10  1.47e10 Asia        
 2 Albania     6.62e6  6.82e6  1.25e10  1.15e10 Europe      
 3 Algeria     9.10e7  7.62e7  1.76e11  1.56e11 Africa      
 4 Angola      2.30e7  1.76e7  9.63e10  7.98e10 Africa      
 5 Argentina   1.59e8  1.19e8  4.58e11  3.86e11 South Ameri~
 6 Armenia     6.19e6  5.84e6  1.07e10  9.06e 9 Asia        
 7 Australia   2.05e8  2.16e8  1.24e12  1.12e12 Australia a~
 8 Austria     5.19e7  4.96e7  4.05e11  3.83e11 Europe      
 9 Azerbaij~   2.19e7  1.88e7  5.72e10  5.05e10 Asia        
10 Banglade~   1.18e8  1.00e8  1.39e11  1.09e11 Asia        
# ... with 143 more rows

行子集
filter 函数 filter() 是最常用的行筛选函数

> da %>% filter(country=="China")
# A tibble: 1 x 6
  country    EF2013   EF2009    GDP2013    GDP2009 continent
  <chr>       <dbl>    <dbl>      <dbl>      <dbl> <chr>    
1 China      5.01e9   4.25e9    7.77e12    5.51e12 Asia     
> da %>% filter(GDP2013>GDP2009)
# A tibble: 143 x 6
   country     EF2013  EF2009  GDP2013  GDP2009 continent   
   <chr>        <dbl>   <dbl>    <dbl>    <dbl> <chr>       
 1 Afghanis~   2.73e7  2.09e7  1.97e10  1.47e10 Asia        
 2 Albania     6.62e6  6.82e6  1.25e10  1.15e10 Europe      
 3 Algeria     9.10e7  7.62e7  1.76e11  1.56e11 Africa      
 4 Angola      2.30e7  1.76e7  9.63e10  7.98e10 Africa      
 5 Argentina   1.59e8  1.19e8  4.58e11  3.86e11 South Ameri~
 6 Armenia     6.19e6  5.84e6  1.07e10  9.06e 9 Asia        
 7 Australia   2.05e8  2.16e8  1.24e12  1.12e12 Australia a~
 8 Austria     5.19e7  4.96e7  4.05e11  3.83e11 Europe      
 9 Azerbaij~   2.19e7  1.88e7  5.72e10  5.05e10 Asia        
10 Banglade~   1.18e8  1.00e8  1.39e11  1.09e11 Asia        
# ... with 133 more rows
> da %>% filter(continent=="Asia" | continent=="Europe")
# A tibble: 72 x 6
   country        EF2013  EF2009  GDP2013  GDP2009 continent
   <chr>           <dbl>   <dbl>    <dbl>    <dbl> <chr>    
 1 Afghanistan    2.73e7  2.09e7  1.97e10  1.47e10 Asia     
 2 Albania        6.62e6  6.82e6  1.25e10  1.15e10 Europe   
 3 Armenia        6.19e6  5.84e6  1.07e10  9.06e 9 Asia     
 4 Austria        5.19e7  4.96e7  4.05e11  3.83e11 Europe   
 5 Azerbaijan     2.19e7  1.88e7  5.72e10  5.05e10 Asia     
 6 Bangladesh     1.18e8  1.00e8  1.39e11  1.09e11 Asia     
 7 Belarus        4.39e7  4.21e7  5.99e10  5.13e10 Europe   
 8 Belgium        7.68e7  7.78e7  4.93e11  4.71e11 Europe   
 9 Bhutan         3.48e6  2.85e6  1.84e 9  1.42e 9 Asia     
10 Bosnia and ~   1.23e7  1.23e7  1.76e10  1.70e10 Europe   
# ... with 62 more rows
> da %>% filter(GDP2013>GDP2009 & EF2013<EF2009)
# A tibble: 36 x 6
   country      EF2013  EF2009  GDP2013  GDP2009 continent  
   <chr>         <dbl>   <dbl>    <dbl>    <dbl> <chr>      
 1 Albania      6.62e6  6.82e6  1.25e10  1.15e10 Europe     
 2 Australia    2.05e8  2.16e8  1.24e12  1.12e12 Australia ~
 3 Barbados     9.50e5  1.21e6  4.49e 9  4.44e 9 North Amer~
 4 Belgium      7.68e7  7.78e7  4.93e11  4.71e11 Europe     
 5 Bosnia and~  1.23e7  1.23e7  1.76e10  1.70e10 Europe     
 6 Bulgaria     2.27e7  2.61e7  5.20e10  5.06e10 Europe     
 7 Burundi      6.61e6  7.25e6  2.30e 9  1.95e 9 Africa     
 8 Cuba         2.11e7  2.20e7  7.00e10  6.28e10 North Amer~
 9 Czech Repu~  5.58e7  5.83e7  2.08e11  2.02e11 Europe     
10 Denmark      3.44e7  3.81e7  3.22e11  3.15e11 Europe     
# ... with 26 more rows

如果数据中有重复的行,可以使用 distinct 函数去掉重复的部分,筛选出独一无二的行

> dada <- bind_rows(da, da)
> dada
# A tibble: 306 x 6
   country     EF2013  EF2009  GDP2013  GDP2009 continent   
   <chr>        <dbl>   <dbl>    <dbl>    <dbl> <chr>       
 1 Afghanis~   2.73e7  2.09e7  1.97e10  1.47e10 Asia        
 2 Albania     6.62e6  6.82e6  1.25e10  1.15e10 Europe      
 3 Algeria     9.10e7  7.62e7  1.76e11  1.56e11 Africa      
 4 Angola      2.30e7  1.76e7  9.63e10  7.98e10 Africa      
 5 Argentina   1.59e8  1.19e8  4.58e11  3.86e11 South Ameri~
 6 Armenia     6.19e6  5.84e6  1.07e10  9.06e 9 Asia        
 7 Australia   2.05e8  2.16e8  1.24e12  1.12e12 Australia a~
 8 Austria     5.19e7  4.96e7  4.05e11  3.83e11 Europe      
 9 Azerbaij~   2.19e7  1.88e7  5.72e10  5.05e10 Asia        
10 Banglade~   1.18e8  1.00e8  1.39e11  1.09e11 Asia        
# ... with 296 more rows
> distinct(dada)
# A tibble: 153 x 6
   country     EF2013  EF2009  GDP2013  GDP2009 continent   
   <chr>        <dbl>   <dbl>    <dbl>    <dbl> <chr>       
 1 Afghanis~   2.73e7  2.09e7  1.97e10  1.47e10 Asia        
 2 Albania     6.62e6  6.82e6  1.25e10  1.15e10 Europe      
 3 Algeria     9.10e7  7.62e7  1.76e11  1.56e11 Africa      
 4 Angola      2.30e7  1.76e7  9.63e10  7.98e10 Africa      
 5 Argentina   1.59e8  1.19e8  4.58e11  3.86e11 South Ameri~
 6 Armenia     6.19e6  5.84e6  1.07e10  9.06e 9 Asia        
 7 Australia   2.05e8  2.16e8  1.24e12  1.12e12 Australia a~
 8 Austria     5.19e7  4.96e7  4.05e11  3.83e11 Europe      
 9 Azerbaij~   2.19e7  1.88e7  5.72e10  5.05e10 Asia        
10 Banglade~   1.18e8  1.00e8  1.39e11  1.09e11 Asia        
# ... with 143 more rows
> distinct(da, continent)
# A tibble: 6 x 1
  continent            
  <chr>                
1 Asia                 
2 Europe               
3 Africa               
4 South America        
5 Australia and Oceania
6 North America        
> distinct(da, continent,  .keep_all=T)
# A tibble: 6 x 6
  country     EF2013   EF2009  GDP2013  GDP2009 continent   
  <chr>        <dbl>    <dbl>    <dbl>    <dbl> <chr>       
1 Afghanis~   2.73e7   2.09e7  1.97e10  1.47e10 Asia        
2 Albania     6.62e6   6.82e6  1.25e10  1.15e10 Europe      
3 Algeria     9.10e7   7.62e7  1.76e11  1.56e11 Africa      
4 Argentina   1.59e8   1.19e8  4.58e11  3.86e11 South Ameri~
5 Australia   2.05e8   2.16e8  1.24e12  1.12e12 Australia a~
6 Barbados    9.50e5   1.21e6  4.49e 9  4.44e 9 North Ameri~

slice() 根据行号筛选行

> da %>% slice(3:10)
# A tibble: 8 x 6
  country    EF2013   EF2009   GDP2013  GDP2009 continent   
  <chr>       <dbl>    <dbl>     <dbl>    <dbl> <chr>       
1 Algeria    9.10e7   7.62e7   1.76e11  1.56e11 Africa      
2 Angola     2.30e7   1.76e7   9.63e10  7.98e10 Africa      
3 Argenti~   1.59e8   1.19e8   4.58e11  3.86e11 South Ameri~
4 Armenia    6.19e6   5.84e6   1.07e10  9.06e 9 Asia        
5 Austral~   2.05e8   2.16e8   1.24e12  1.12e12 Australia a~
6 Austria    5.19e7   4.96e7   4.05e11  3.83e11 Europe      
7 Azerbai~   2.19e7   1.88e7   5.72e10  5.05e10 Asia        
8 Banglad~   1.18e8   1.00e8   1.39e11  1.09e11 Asia        

列子集
select()函数筛选出有用的列

> da %>% select(country, GDP2013)
# A tibble: 153 x 2
   country           GDP2013
   <chr>               <dbl>
 1 Afghanistan   19731337261
 2 Albania       12542465022
 3 Algeria      176325000000
 4 Angola        96261432368
 5 Argentina    457639000000
 6 Armenia       10736584306
 7 Australia   1241480000000
 8 Austria      404687000000
 9 Azerbaijan    57238335961
10 Bangladesh   138597000000
# ... with 143 more rows
> da %>% select(-c(country, continent))
# A tibble: 153 x 4
       EF2013     EF2009       GDP2013       GDP2009
        <dbl>      <dbl>         <dbl>         <dbl>
 1  27270736.  20868574.   19731337261   14697331941
 2   6616186.   6815357.   12542465022   11500292411
 3  90957355.  76216524.  176325000000  155605000000
 4  22957308.  17607133.   96261432368   79753199398
 5 158699622. 119344052.  457639000000  385956000000
 6   6193203.   5844214.   10736584306    9060944166
 7 204837815. 216496002. 1241480000000 1119650000000
 8  51883278.  49630217.  404687000000  382851000000
 9  21861066.  18768195.   57238335961   50453518393
10 117772779. 100248357.  138597000000  109195000000
# ... with 143 more rows
> da %>% select(country, starts_with("GDP"))
# A tibble: 153 x 3
   country           GDP2013       GDP2009
   <chr>               <dbl>         <dbl>
 1 Afghanistan   19731337261   14697331941
 2 Albania       12542465022   11500292411
 3 Algeria      176325000000  155605000000
 4 Angola        96261432368   79753199398
 5 Argentina    457639000000  385956000000
 6 Armenia       10736584306    9060944166
 7 Australia   1241480000000 1119650000000
 8 Austria      404687000000  382851000000
 9 Azerbaijan    57238335961   50453518393
10 Bangladesh   138597000000  109195000000
# ... with 143 more rows
> da %>% select(country, ends_with("2013"))
# A tibble: 153 x 3
   country         EF2013       GDP2013
   <chr>            <dbl>         <dbl>
 1 Afghanistan  27270736.   19731337261
 2 Albania       6616186.   12542465022
 3 Algeria      90957355.  176325000000
 4 Angola       22957308.   96261432368
 5 Argentina   158699622.  457639000000
 6 Armenia       6193203.   10736584306
 7 Australia   204837815. 1241480000000
 8 Austria      51883278.  404687000000
 9 Azerbaijan   21861066.   57238335961
10 Bangladesh  117772779.  138597000000
# ... with 143 more rows

arrange() 按照某种标准对行进行排序

> da %>% arrange(EF2013)
# A tibble: 153 x 6
   country      EF2013 EF2009  GDP2013  GDP2009 continent   
   <chr>         <dbl>  <dbl>    <dbl>    <dbl> <chr>       
 1 Dominica    161736. 2.04e5   4.91e8   4.91e8 North Ameri~
 2 Saint Kitt~ 239948. 2.04e5   7.62e8   7.21e8 North Ameri~
 3 Sao Tome a~ 295127. 2.74e5   2.23e8   1.85e8 Africa      
 4 Tonga       304041. 2.29e5   3.71e8   3.57e8 Australia a~
 5 Grenada     323041. 2.94e5   7.86e8   7.75e8 North Ameri~
 6 Bermuda     368750. 3.66e5   5.15e9   5.87e9 North Ameri~
 7 Saint Lucia 429986. 4.44e5   1.23e9   1.26e9 North Ameri~
 8 Samoa       442727. 4.49e5   6.84e8   6.54e8 Australia a~
 9 Comoros     788183. 7.32e5   5.80e8   5.19e8 Africa      
10 Timor-Leste 839659. 5.16e5   1.11e9   8.54e8 Asia        
# ... with 143 more rows
> da %>% arrange(desc(EF2013))
# A tibble: 153 x 6
   country         EF2013  EF2009 GDP2013 GDP2009 continent 
   <chr>            <dbl>   <dbl>   <dbl>   <dbl> <chr>     
 1 China           5.01e9  4.25e9 7.77e12 5.51e12 Asia      
 2 United State~   2.72e9  2.71e9 1.58e13 1.46e13 North Ame~
 3 India           1.36e9  1.20e9 1.99e12 1.50e12 Asia      
 4 Russian Fede~   8.20e8  7.49e8 1.67e12 1.46e12 Europe    
 5 Japan           6.33e8  5.85e8 5.89e12 5.47e12 Asia      
 6 Brazil          6.16e8  5.32e8 2.41e12 2.05e12 South Ame~
 7 Germany         4.40e8  4.25e8 3.58e12 3.28e12 Europe    
 8 Indonesia       3.64e8  3.39e8 8.97e11 7.11e11 Asia      
 9 France          3.23e8  3.43e8 2.72e12 2.60e12 Europe    
10 United Kingd~   3.23e8  3.75e8 2.55e12 2.38e12 Europe    
# ... with 143 more rows
> da %>% arrange(EF2013/EF2009)
# A tibble: 153 x 6
   country       EF2013   EF2009 GDP2013 GDP2009 continent  
   <chr>          <dbl>    <dbl>   <dbl>   <dbl> <chr>      
 1 Cyprus        3.77e6   5.74e6 2.34e10 2.52e10 Europe     
 2 Greece        4.65e7   6.39e7 2.44e11 3.17e11 Europe     
 3 Barbados      9.50e5   1.21e6 4.49e 9 4.44e 9 North Amer~
 4 Dominica      1.62e5   2.04e5 4.91e 8 4.91e 8 North Amer~
 5 Portugal      4.05e7   4.79e7 2.22e11 2.34e11 Europe     
 6 United Kin~   3.23e8   3.75e8 2.55e12 2.38e12 Europe     
 7 Spain         1.87e8   2.17e8 1.36e12 1.43e12 Europe     
 8 Gambia        1.72e6   1.98e6 1.01e 9 8.94e 8 Africa     
 9 Bulgaria      2.27e7   2.61e7 5.20e10 5.06e10 Europe     
10 Macedonia ~   6.34e6   7.25e6 9.86e 9 9.10e 9 Europe     
# ... with 143 more rows

mutate 函数增加新列

> da %>% mutate(EF_delta = EF2013 - EF2009)
# A tibble: 153 x 7
   country  EF2013 EF2009 GDP2013 GDP2009 continent EF_delta
   <chr>     <dbl>  <dbl>   <dbl>   <dbl> <chr>        <dbl>
 1 Afghani~ 2.73e7 2.09e7 1.97e10 1.47e10 Asia        6.40e6
 2 Albania  6.62e6 6.82e6 1.25e10 1.15e10 Europe     -1.99e5
 3 Algeria  9.10e7 7.62e7 1.76e11 1.56e11 Africa      1.47e7
 4 Angola   2.30e7 1.76e7 9.63e10 7.98e10 Africa      5.35e6
 5 Argenti~ 1.59e8 1.19e8 4.58e11 3.86e11 South Am~   3.94e7
 6 Armenia  6.19e6 5.84e6 1.07e10 9.06e 9 Asia        3.49e5
 7 Austral~ 2.05e8 2.16e8 1.24e12 1.12e12 Australi~  -1.17e7
 8 Austria  5.19e7 4.96e7 4.05e11 3.83e11 Europe      2.25e6
 9 Azerbai~ 2.19e7 1.88e7 5.72e10 5.05e10 Asia        3.09e6
10 Banglad~ 1.18e8 1.00e8 1.39e11 1.09e11 Asia        1.75e7
# ... with 143 more rows
> da %>% transmute(country, GDP_delta = GDP2013 - GDP2009, EF_delta = EF2013 - EF2009)
# A tibble: 153 x 3
   country        GDP_delta   EF_delta
   <chr>              <dbl>      <dbl>
 1 Afghanistan   5034005320   6402162.
 2 Albania       1042172611   -199171.
 3 Algeria      20720000000  14740831.
 4 Angola       16508232970   5350176.
 5 Argentina    71683000000  39355571.
 6 Armenia       1675640140    348989.
 7 Australia   121830000000 -11658186.
 8 Austria      21836000000   2253061.
 9 Azerbaijan    6784817568   3092871.
10 Bangladesh   29402000000  17524422.
# ... with 143 more rows

group_by 和 summarise 函数

> da %>% group_by(continent)
# A tibble: 153 x 6
# Groups:   continent [6]
   country     EF2013  EF2009  GDP2013  GDP2009 continent   
   <chr>        <dbl>   <dbl>    <dbl>    <dbl> <chr>       
 1 Afghanis~   2.73e7  2.09e7  1.97e10  1.47e10 Asia        
 2 Albania     6.62e6  6.82e6  1.25e10  1.15e10 Europe      
 3 Algeria     9.10e7  7.62e7  1.76e11  1.56e11 Africa      
 4 Angola      2.30e7  1.76e7  9.63e10  7.98e10 Africa      
 5 Argentina   1.59e8  1.19e8  4.58e11  3.86e11 South Ameri~
 6 Armenia     6.19e6  5.84e6  1.07e10  9.06e 9 Asia        
 7 Australia   2.05e8  2.16e8  1.24e12  1.12e12 Australia a~
 8 Austria     5.19e7  4.96e7  4.05e11  3.83e11 Europe      
 9 Azerbaij~   2.19e7  1.88e7  5.72e10  5.05e10 Asia        
10 Banglade~   1.18e8  1.00e8  1.39e11  1.09e11 Asia        
# ... with 143 more rows
> da %>% group_by(continent) %>% summarise(countries=n())
# A tibble: 6 x 2
  continent             countries
  <chr>                     <int>
1 Africa                       45
2 Asia                         35
3 Australia and Oceania         6
4 Europe                       37
5 North America                20
6 South America                10
> da %>% group_by(continent) %>% summarise(GDP2013=mean(GDP2013))
# A tibble: 6 x 2
  continent                   GDP2013
  <chr>                         <dbl>
1 Africa                 43872335750.
2 Asia                  619164666100.
3 Australia and Oceania 236036198287.
4 Europe                544046378843.
5 North America         952446625871.
6 South America         416427328953.

实战案例

数据整理

> da
# A tibble: 153 x 6
   country     EF2013  EF2009  GDP2013  GDP2009 continent   
   <chr>        <dbl>   <dbl>    <dbl>    <dbl> <chr>       
 1 Afghanis~   2.73e7  2.09e7  1.97e10  1.47e10 Asia        
 2 Albania     6.62e6  6.82e6  1.25e10  1.15e10 Europe      
 3 Algeria     9.10e7  7.62e7  1.76e11  1.56e11 Africa      
 4 Angola      2.30e7  1.76e7  9.63e10  7.98e10 Africa      
 5 Argentina   1.59e8  1.19e8  4.58e11  3.86e11 South Ameri~
 6 Armenia     6.19e6  5.84e6  1.07e10  9.06e 9 Asia        
 7 Australia   2.05e8  2.16e8  1.24e12  1.12e12 Australia a~
 8 Austria     5.19e7  4.96e7  4.05e11  3.83e11 Europe      
 9 Azerbaij~   2.19e7  1.88e7  5.72e10  5.05e10 Asia        
10 Banglade~   1.18e8  1.00e8  1.39e11  1.09e11 Asia        
# ... with 143 more rows

变量 year 分散到了不同的列,这也导致它的每一行包含多个观测值
考虑把 EF2013 和 EF2009 塞进一列,GDP2013 和 GDP2009 塞进另一列

> tida <- da %>% gather(EFyear, EF, EF2009:EF2013) %>% gather(GDPyear, GDP, GDP2009:GDP2013)
> tida
# A tibble: 612 x 6
   country   continent      EFyear      EF GDPyear       GDP
   <chr>     <chr>          <chr>    <dbl> <chr>       <dbl>
 1 Afghanis~ Asia           EF2009  2.09e7 GDP2009   1.47e10
 2 Albania   Europe         EF2009  6.82e6 GDP2009   1.15e10
 3 Algeria   Africa         EF2009  7.62e7 GDP2009   1.56e11
 4 Angola    Africa         EF2009  1.76e7 GDP2009   7.98e10
 5 Argentina South America  EF2009  1.19e8 GDP2009   3.86e11
 6 Armenia   Asia           EF2009  5.84e6 GDP2009   9.06e 9
 7 Australia Australia and~ EF2009  2.16e8 GDP2009   1.12e12
 8 Austria   Europe         EF2009  4.96e7 GDP2009   3.83e11
 9 Azerbaij~ Asia           EF2009  1.88e7 GDP2009   5.05e10
10 Banglade~ Asia           EF2009  1.00e8 GDP2009   1.09e11
# ... with 602 more rows

tida 显示有 612 行,这意味着有许多无效的交叉行(比如 EF2009 和 GDP 2013 在同一行),我们需要剔除交叉行, 用 separate() 和 filter 来完成剔除的任务

> tidda <- tida %>% separate(EFyear, c('useless1', 'year'), 2) %>% separate(GDPyear, c('useless2', 'year_star'), 3) %>% filter(year==year_star) %>% select(-c(useless1, useless2, year_star)) %>% arrange(country)
> tidda
# A tibble: 306 x 5
   country       continent year          EF           GDP
   <chr>         <chr>     <chr>      <dbl>         <dbl>
 1 "\"Korea"     Asia      2009  269778293. 1027730000000
 2 "\"Korea"     Asia      2013  291752674. 1194430000000
 3 "Afghanistan" Asia      2009   20868574.   14697331941
 4 "Afghanistan" Asia      2013   27270736.   19731337261
 5 "Albania"     Europe    2009    6815357.   11500292411
 6 "Albania"     Europe    2013    6616186.   12542465022
 7 "Algeria"     Africa    2009   76216524.  155605000000
 8 "Algeria"     Africa    2013   90957355.  176325000000
 9 "Angola"      Africa    2009   17607133.   79753199398
10 "Angola"      Africa    2013   22957308.   96261432368
# ... with 296 more rows

另外,我们还可以用 rename() 和分步 gather() 来取代 separate()

> tidyda <- da %>% rename(`2013`=EF2013, `2009`=EF2009) %>% gather(year, EF, `2009`:`2013`) %>% rename(`2013`=GDP2013, `2009`=GDP2009) %>% gather(GDPyear, GDP, `2009`:`2013`) %>% filter(GDPyear==year) %>% select(-GDPyear) %>% arrange(country)
> tidyda
# A tibble: 306 x 5
   country       continent year          EF           GDP
   <chr>         <chr>     <chr>      <dbl>         <dbl>
 1 "\"Korea"     Asia      2009  269778293. 1027730000000
 2 "\"Korea"     Asia      2013  291752674. 1194430000000
 3 "Afghanistan" Asia      2009   20868574.   14697331941
 4 "Afghanistan" Asia      2013   27270736.   19731337261
 5 "Albania"     Europe    2009    6815357.   11500292411
 6 "Albania"     Europe    2013    6616186.   12542465022
 7 "Algeria"     Africa    2009   76216524.  155605000000
 8 "Algeria"     Africa    2013   90957355.  176325000000
 9 "Angola"      Africa    2009   17607133.   79753199398
10 "Angola"      Africa    2013   22957308.   96261432368
# ... with 296 more rows

数据分析
数据是否“杂乱”是人为定义的,并不是只有 tidy dataset 才能用于数据分析,事实上,很多“杂乱”的数据是非常有用的,tidy dataset 应该是你的默认选项,而不一定是最佳选项

> whoda <- da %>%
+   transmute(
+     country,
+     EF_delta_p = 100*(EF2013/EF2009 - 1),
+     GDP_delta_p = 100*(GDP2013/GDP2009 - 1),
+     DD_rate = GDP_delta_p - EF_delta_p,
+     continent,
+     GDP2013
+   )
> whoda
# A tibble: 153 x 6
   country  EF_delta_p GDP_delta_p DD_rate continent GDP2013
   <chr>         <dbl>       <dbl>   <dbl> <chr>       <dbl>
 1 Afghani~      30.7        34.3     3.57 Asia      1.97e10
 2 Albania       -2.92        9.06   12.0  Europe    1.25e10
 3 Algeria       19.3        13.3    -6.02 Africa    1.76e11
 4 Angola        30.4        20.7    -9.69 Africa    9.63e10
 5 Argenti~      33.0        18.6   -14.4  South Am~ 4.58e11
 6 Armenia        5.97       18.5    12.5  Asia      1.07e10
 7 Austral~      -5.38       10.9    16.3  Australi~ 1.24e12
 8 Austria        4.54        5.70    1.16 Europe    4.05e11
 9 Azerbai~      16.5        13.4    -3.03 Asia      5.72e10
10 Banglad~      17.5        26.9     9.45 Asia      1.39e11
# ... with 143 more rows

我们想看经济增长与自然环境的脱钩程度最大的 10 个国家

> whoda %>% arrange(desc(DD_rate)) %>% slice(1:10)
# A tibble: 10 x 6
   country  EF_delta_p GDP_delta_p DD_rate continent GDP2013
   <chr>         <dbl>       <dbl>   <dbl> <chr>       <dbl>
 1 Papua N~    -10.9          35.9    46.9 Australi~ 1.23e10
 2 Ethiopia     12.1          50.3    38.3 Africa    4.00e10
 3 Myanmar      -3.44         34.7    38.2 Asia      6.09e10
 4 Philipp~     -9.67         27.4    37.1 Asia      2.36e11
 5 Sierra ~     17.7          53.5    35.9 Africa    3.81e 9
 6 Zimbabwe      9.18         44.0    34.8 Africa    1.22e10
 7 Mongolia     22.7          56.4    33.7 Asia      1.06e10
 8 Uzbekis~      4.10         37.3    33.2 Asia      4.98e10
 9 Turkmen~     21.0          53.3    32.3 Asia      3.17e10
10 Singapo~      0.755        32.8    32.1 Asia      2.72e11

绝对脱钩的国家

> whoda %>% filter(GDP_delta_p>0, EF_delta_p<0)
# A tibble: 36 x 6
   country  EF_delta_p GDP_delta_p DD_rate continent GDP2013
   <chr>         <dbl>       <dbl>   <dbl> <chr>       <dbl>
 1 Albania      -2.92         9.06   12.0  Europe    1.25e10
 2 Austral~     -5.38        10.9    16.3  Australi~ 1.24e12
 3 Barbados    -21.3          1.24   22.6  North Am~ 4.49e 9
 4 Belgium      -1.19         4.61    5.80 Europe    4.93e11
 5 Bosnia ~     -0.156        3.15    3.30 Europe    1.76e10
 6 Bulgaria    -13.2          2.88   16.0  Europe    5.20e10
 7 Burundi      -8.78        17.7    26.4  Africa    2.30e 9
 8 Cuba         -4.20        11.3    15.5  North Am~ 7.00e10
 9 Czech R~     -4.31         3.01    7.32 Europe    2.08e11
10 Denmark      -9.74         2.47   12.2  Europe    3.22e11
# ... with 26 more rows

筛选出每个大洲 2013 年 GDP 排名前三的国家

> whoda %>%
+   group_by(continent) %>%
+   arrange(desc(GDP2013)) %>%
+   slice(1:3) %>%
+   select(continent, country, GDP2013)
# A tibble: 18 x 3
# Groups:   continent [6]
   continent             country                  GDP2013
   <chr>                 <chr>                      <dbl>
 1 Africa                Nigeria                  4.25e11
 2 Africa                South Africa             4.05e11
 3 Africa                Egypt                    2.33e11
 4 Asia                  China                    7.77e12
 5 Asia                  Japan                    5.89e12
 6 Asia                  India                    1.99e12
 7 Australia and Oceania Australia                1.24e12
 8 Australia and Oceania New Zealand              1.58e11
 9 Australia and Oceania Papua New Guinea         1.23e10
10 Europe                Germany                  3.58e12
11 Europe                France                   2.72e12
12 Europe                United Kingdom           2.55e12
13 North America         United States of America 1.58e13
14 North America         Canada                   1.73e12
15 North America         Mexico                   1.15e12
16 South America         Brazil                   2.41e12
17 South America         Argentina                4.58e11
18 South America         Venezuela                4.39e11
> ggplot(whoda, aes(GDP_delta_p, EF_delta_p)) +
+   geom_point(aes(size=GDP2013, color=continent)) +
+   geom_text(aes(label=country), size=2.5, hjust=-.2) +
+   geom_vline(xintercept=0, linetype='dashed', size=1, color='blue') +
+   geom_hline(yintercept=0, linetype='dashed', size=1, color='blue') +
+   guides(size=F) +
+   theme(legend.position=c(.1, .85), legend.background=element_blank(), legend.key=element_blank()) +
+   xlab('GDP Growth') +
+   ylab('EF Growth')
image.png

美国的 GDP 增长没有想象中那么快,因为总量已经很大;左上角代表 GDP 负增长、环境压力却加大了的国家,图中显示为百慕大群岛;欧洲大部分国家都集中在水平参考线附近及以下,说明欧洲的社会发展水平和经济结构已经不太依赖于自然资源……

> whoda %>% group_by(continent) %>% summarise(GDP=mean(GDP2013)) %>%
+   ggplot() +
+   geom_bar(aes(x=continent, y=GDP), stat='identity') +
+   coord_flip()
image.png

各大洲的国均 GDP,北美仍然是最大的,样本国中亚洲的平均 GDP 已经超过了欧洲。

相关文章

网友评论

      本文标题:08数据整理与转换:tidyr和dplyr应用

      本文链接:https://www.haomeiwen.com/subject/plfupktx.html