tidy data规则
tidyverse 生态链中最常用的数据类型 tibble,tibble 默认每行为一个观测值,每列为一个变量,而 tidyverse 的操作都基于 tibble,一个整齐的数据集(tidy dataset)需要满足以下三个规则:
每个变量有单独的列。
每个观测值有单独的行。
每个值有单独的单元格。
不整齐的数据集(untidy dataset)可能遭遇了以下问题:
一个变量被分散到了多列。
一个观测值被分散到了多行。
> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
√ ggplot2 3.3.1 √ purrr 0.3.4
√ tibble 3.0.1 √ dplyr 0.8.5
√ tidyr 1.1.0 √ stringr 1.4.0
√ readr 1.3.1 √ forcats 0.5.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Warning messages:
1: 程辑包‘tibble’是用R版本3.6.3 来建造的
2: 程辑包‘tidyr’是用R版本3.6.3 来建造的
3: 程辑包‘purrr’是用R版本3.6.3 来建造的
4: 程辑包‘dplyr’是用R版本3.6.3 来建造的
5: 程辑包‘forcats’是用R版本3.6.3 来建造的
> tidytable <- as.tibble(tidytable)
Warning messages:
1: `as.tibble()` is deprecated as of tibble 2.0.0.
Please use `as_tibble()` instead.
The signature and semantics have changed, see `?as_tibble`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
2: The `x` argument of `as_tibble.matrix()` must have column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
> tidytable <- readLines("http://labfile.oss.aliyuncs.com/courses/982/tidytable.csv")
> tidytable <- unlist(strsplit(tidytable, split=","))
> tidytable <- matrix(tidytable, ncol=4, byrow=T)
> tidytable <- as_tibble(tidytable)
> colnames(tidytable) <- tidytable[1,]
> tidytable <- tidytable[-1,]
> tidytable$year <- as.integer(tidytable$year)
> tidytable$EF <- as.double(tidytable$EF)
> tidytable$GDP <- as.double(tidytable$GDP)
> tidytable
# A tibble: 8 x 4
country year EF GDP
<chr> <int> <dbl> <dbl>
1 Afghanistan 2009 20.9 1.47
2 Afghanistan 2013 27.3 1.97
3 Brazil 2009 532 205
4 Brazil 2013 616 241
5 China 2009 4253 551
6 China 2013 5010 777
7 Denmark 2009 38.1 31.5
8 Denmark 2013 34.4 32.2
生态足迹(EF, Ecology Footprint)衡量一国维持当前生活水平所需要的自然资源,EF 越小说明该地区自然环境所承受的压力越小。国内生产总值(GDP, Gross Domestic Product)衡量一国的发展水平,GDP 越大说明该地区(从总量上来说)越富有。
gather函数
gather 函数用于处理一个变量被分散到多列的情形
> eg1 <- readLines("http://labfile.oss.aliyuncs.com/courses/982/eg1.csv")
> eg1 <- unlist(strsplit(eg1, split=","))
> eg1 <- matrix(eg1, ncol=3, byrow=T)
> eg1 <- as_tibble(eg1)
> colnames(eg1) <- eg1[1,]
> eg1 <- eg1[-1,]
> eg1$`2009` <- as.double(eg1$`2009`)
> eg1$`2013` <- as.double(eg1$`2013`)
> eg1
# A tibble: 4 x 3
country `2009` `2013`
<chr> <dbl> <dbl>
1 Afghanistan 1.47 1.97
2 Brazil 205 241
3 China 551 777
4 Denmark 31.5 32.2
> eg1 %>% gather(key=year, value=GDP, `2009`:`2013`)
# A tibble: 8 x 3
country year GDP
<chr> <chr> <dbl>
1 Afghanistan 2009 1.47
2 Brazil 2009 205
3 China 2009 551
4 Denmark 2009 31.5
5 Afghanistan 2013 1.97
6 Brazil 2013 241
7 China 2013 777
8 Denmark 2013 32.2
> gather(eg1, key=year, value=GDP, `2009`:`2013`)
# A tibble: 8 x 3
country year GDP
<chr> <chr> <dbl>
1 Afghanistan 2009 1.47
2 Brazil 2009 205
3 China 2009 551
4 Denmark 2009 31.5
5 Afghanistan 2013 1.97
6 Brazil 2013 241
7 China 2013 777
8 Denmark 2013 32.2
spread函数
于处理一个观测值被分散到多行的情形
> eg2 <- readLines("http://labfile.oss.aliyuncs.com/courses/982/eg2.csv")
> eg2 <- unlist(strsplit(eg2, split=","))
> eg2 <- matrix(eg2, ncol=4, byrow=T)
> eg2 <- as_tibble(eg2)
> colnames(eg2) <- eg2[1,]
> eg2 <- eg2[-1,]
> eg2$year <- as.integer(eg2$year)
> eg2$count <- as.double(eg2$count)
> eg2
# A tibble: 16 x 4
country year type count
<chr> <int> <chr> <dbl>
1 Afghanistan 2009 GDP 1.47
2 Afghanistan 2009 EF 20.9
3 Afghanistan 2013 GDP 1.97
4 Afghanistan 2013 EF 27.3
5 Brazil 2009 GDP 205
6 Brazil 2009 EF 532
7 Brazil 2013 GDP 241
8 Brazil 2013 EF 616
9 China 2009 GDP 551
10 China 2009 EF 4253
11 China 2013 GDP 777
12 China 2013 EF 5010
13 Denmark 2009 GDP 31.5
14 Denmark 2009 EF 38.1
15 Denmark 2013 GDP 32.2
16 Denmark 2013 EF 34.4
> eg2 %>% spread(key=type, value=count)
# A tibble: 8 x 4
country year EF GDP
<chr> <int> <dbl> <dbl>
1 Afghanistan 2009 20.9 1.47
2 Afghanistan 2013 27.3 1.97
3 Brazil 2009 532 205
4 Brazil 2013 616 241
5 China 2009 4253 551
6 China 2013 5010 777
7 Denmark 2009 38.1 31.5
8 Denmark 2013 34.4 32.2
> spread(eg2, key=type, value=count)
# A tibble: 8 x 4
country year EF GDP
<chr> <int> <dbl> <dbl>
1 Afghanistan 2009 20.9 1.47
2 Afghanistan 2013 27.3 1.97
3 Brazil 2009 532 205
4 Brazil 2013 616 241
5 China 2009 4253 551
6 China 2013 5010 777
7 Denmark 2009 38.1 31.5
8 Denmark 2013 34.4 32.2
> eg2 %>% spread(type, count)
# A tibble: 8 x 4
country year EF GDP
<chr> <int> <dbl> <dbl>
1 Afghanistan 2009 20.9 1.47
2 Afghanistan 2013 27.3 1.97
3 Brazil 2009 532 205
4 Brazil 2013 616 241
5 China 2009 4253 551
6 China 2013 5010 777
7 Denmark 2009 38.1 31.5
8 Denmark 2013 34.4 32.2
separate函数
> eg3 <- readLines("http://labfile.oss.aliyuncs.com/courses/982/eg3.csv")
> eg3 <- unlist(strsplit(eg3, split=","))
> eg3 <- matrix(eg3, ncol=3, byrow=T)
> eg3 <- as.tibble(eg3)
> colnames(eg3) <- eg3[1,]
> eg3 <- eg3[-1,]
> eg3$year <- as.integer(eg3$year)
> eg3
# A tibble: 8 x 3
country year rate
<chr> <int> <chr>
1 Afghanistan 2009 20.9/1.47
2 Afghanistan 2013 27.3/1.97
3 Brazil 2009 532./205
4 Brazil 2013 616/241
5 China 2009 4253/551
6 China 2013 5010/777
7 Denmark 2009 38.1/31.5
8 Denmark 2013 34.4/32.2
> eg3 %>% separate(col=rate, into=c("EF", "GDP"), sep="/", convert=T)
# A tibble: 8 x 4
country year EF GDP
<chr> <int> <dbl> <dbl>
1 Afghanistan 2009 20.9 1.47
2 Afghanistan 2013 27.3 1.97
3 Brazil 2009 532 205
4 Brazil 2013 616 241
5 China 2009 4253 551
6 China 2013 5010 777
7 Denmark 2009 38.1 31.5
8 Denmark 2013 34.4 32.2
> separate(eg3, col=rate, into=c("EF", "GDP"), sep="/", convert=T)
# A tibble: 8 x 4
country year EF GDP
<chr> <int> <dbl> <dbl>
1 Afghanistan 2009 20.9 1.47
2 Afghanistan 2013 27.3 1.97
3 Brazil 2009 532 205
4 Brazil 2013 616 241
5 China 2009 4253 551
6 China 2013 5010 777
7 Denmark 2009 38.1 31.5
8 Denmark 2013 34.4 32.2
> eg3 %>% separate(rate, c("EF", "GDP"), "/", convert=T)
# A tibble: 8 x 4
country year EF GDP
<chr> <int> <dbl> <dbl>
1 Afghanistan 2009 20.9 1.47
2 Afghanistan 2013 27.3 1.97
3 Brazil 2009 532 205
4 Brazil 2013 616 241
5 China 2009 4253 551
6 China 2013 5010 777
7 Denmark 2009 38.1 31.5
8 Denmark 2013 34.4 32.2
unite函数
unite 函数是 separate 函数的逆过程
> tidytable %>% unite("rate", c("EF", "GDP"), sep="/")
# A tibble: 8 x 3
country year rate
<chr> <int> <chr>
1 Afghanistan 2009 20.9/1.47
2 Afghanistan 2013 27.3/1.97
3 Brazil 2009 532/205
4 Brazil 2013 616/241
5 China 2009 4253/551
6 China 2013 5010/777
7 Denmark 2009 38.1/31.5
8 Denmark 2013 34.4/32.2
> eg3 %>% separate(rate, c("EF", "GDP"), sep="/") %>% unite("rate", c("EF", "GDP"), sep="/")
# A tibble: 8 x 3
country year rate
<chr> <int> <chr>
1 Afghanistan 2009 20.9/1.47
2 Afghanistan 2013 27.3/1.97
3 Brazil 2009 532./205
4 Brazil 2013 616/241
5 China 2009 4253/551
6 China 2013 5010/777
7 Denmark 2009 38.1/31.5
8 Denmark 2013 34.4/32.2
数据转换
> da <- readLines("http://labfile.oss.aliyuncs.com/courses/982/GDP_EF.csv")
> da <- unlist(strsplit(da, split=","))
> da <- c(da[1:463], da[465:length(da)])
> da <- matrix(da, ncol=6, byrow=T)
> da <- as_tibble(da)
> colnames(da) <- da[1,]
> da <- da[-1,]
> da$EF2013 <- as.double(da$EF2013)
> da$EF2009 <- as.double(da$EF2009)
> da$GDP2013 <- as.double(da$GDP2013)
> da$GDP2009 <- as.double(da$GDP2009)
> da
# A tibble: 153 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Afghanis~ 2.73e7 2.09e7 1.97e10 1.47e10 Asia
2 Albania 6.62e6 6.82e6 1.25e10 1.15e10 Europe
3 Algeria 9.10e7 7.62e7 1.76e11 1.56e11 Africa
4 Angola 2.30e7 1.76e7 9.63e10 7.98e10 Africa
5 Argentina 1.59e8 1.19e8 4.58e11 3.86e11 South Ameri~
6 Armenia 6.19e6 5.84e6 1.07e10 9.06e 9 Asia
7 Australia 2.05e8 2.16e8 1.24e12 1.12e12 Australia a~
8 Austria 5.19e7 4.96e7 4.05e11 3.83e11 Europe
9 Azerbaij~ 2.19e7 1.88e7 5.72e10 5.05e10 Asia
10 Banglade~ 1.18e8 1.00e8 1.39e11 1.09e11 Asia
# ... with 143 more rows
行子集
filter 函数 filter() 是最常用的行筛选函数
> da %>% filter(country=="China")
# A tibble: 1 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 China 5.01e9 4.25e9 7.77e12 5.51e12 Asia
> da %>% filter(GDP2013>GDP2009)
# A tibble: 143 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Afghanis~ 2.73e7 2.09e7 1.97e10 1.47e10 Asia
2 Albania 6.62e6 6.82e6 1.25e10 1.15e10 Europe
3 Algeria 9.10e7 7.62e7 1.76e11 1.56e11 Africa
4 Angola 2.30e7 1.76e7 9.63e10 7.98e10 Africa
5 Argentina 1.59e8 1.19e8 4.58e11 3.86e11 South Ameri~
6 Armenia 6.19e6 5.84e6 1.07e10 9.06e 9 Asia
7 Australia 2.05e8 2.16e8 1.24e12 1.12e12 Australia a~
8 Austria 5.19e7 4.96e7 4.05e11 3.83e11 Europe
9 Azerbaij~ 2.19e7 1.88e7 5.72e10 5.05e10 Asia
10 Banglade~ 1.18e8 1.00e8 1.39e11 1.09e11 Asia
# ... with 133 more rows
> da %>% filter(continent=="Asia" | continent=="Europe")
# A tibble: 72 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Afghanistan 2.73e7 2.09e7 1.97e10 1.47e10 Asia
2 Albania 6.62e6 6.82e6 1.25e10 1.15e10 Europe
3 Armenia 6.19e6 5.84e6 1.07e10 9.06e 9 Asia
4 Austria 5.19e7 4.96e7 4.05e11 3.83e11 Europe
5 Azerbaijan 2.19e7 1.88e7 5.72e10 5.05e10 Asia
6 Bangladesh 1.18e8 1.00e8 1.39e11 1.09e11 Asia
7 Belarus 4.39e7 4.21e7 5.99e10 5.13e10 Europe
8 Belgium 7.68e7 7.78e7 4.93e11 4.71e11 Europe
9 Bhutan 3.48e6 2.85e6 1.84e 9 1.42e 9 Asia
10 Bosnia and ~ 1.23e7 1.23e7 1.76e10 1.70e10 Europe
# ... with 62 more rows
> da %>% filter(GDP2013>GDP2009 & EF2013<EF2009)
# A tibble: 36 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Albania 6.62e6 6.82e6 1.25e10 1.15e10 Europe
2 Australia 2.05e8 2.16e8 1.24e12 1.12e12 Australia ~
3 Barbados 9.50e5 1.21e6 4.49e 9 4.44e 9 North Amer~
4 Belgium 7.68e7 7.78e7 4.93e11 4.71e11 Europe
5 Bosnia and~ 1.23e7 1.23e7 1.76e10 1.70e10 Europe
6 Bulgaria 2.27e7 2.61e7 5.20e10 5.06e10 Europe
7 Burundi 6.61e6 7.25e6 2.30e 9 1.95e 9 Africa
8 Cuba 2.11e7 2.20e7 7.00e10 6.28e10 North Amer~
9 Czech Repu~ 5.58e7 5.83e7 2.08e11 2.02e11 Europe
10 Denmark 3.44e7 3.81e7 3.22e11 3.15e11 Europe
# ... with 26 more rows
如果数据中有重复的行,可以使用 distinct 函数去掉重复的部分,筛选出独一无二的行
> dada <- bind_rows(da, da)
> dada
# A tibble: 306 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Afghanis~ 2.73e7 2.09e7 1.97e10 1.47e10 Asia
2 Albania 6.62e6 6.82e6 1.25e10 1.15e10 Europe
3 Algeria 9.10e7 7.62e7 1.76e11 1.56e11 Africa
4 Angola 2.30e7 1.76e7 9.63e10 7.98e10 Africa
5 Argentina 1.59e8 1.19e8 4.58e11 3.86e11 South Ameri~
6 Armenia 6.19e6 5.84e6 1.07e10 9.06e 9 Asia
7 Australia 2.05e8 2.16e8 1.24e12 1.12e12 Australia a~
8 Austria 5.19e7 4.96e7 4.05e11 3.83e11 Europe
9 Azerbaij~ 2.19e7 1.88e7 5.72e10 5.05e10 Asia
10 Banglade~ 1.18e8 1.00e8 1.39e11 1.09e11 Asia
# ... with 296 more rows
> distinct(dada)
# A tibble: 153 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Afghanis~ 2.73e7 2.09e7 1.97e10 1.47e10 Asia
2 Albania 6.62e6 6.82e6 1.25e10 1.15e10 Europe
3 Algeria 9.10e7 7.62e7 1.76e11 1.56e11 Africa
4 Angola 2.30e7 1.76e7 9.63e10 7.98e10 Africa
5 Argentina 1.59e8 1.19e8 4.58e11 3.86e11 South Ameri~
6 Armenia 6.19e6 5.84e6 1.07e10 9.06e 9 Asia
7 Australia 2.05e8 2.16e8 1.24e12 1.12e12 Australia a~
8 Austria 5.19e7 4.96e7 4.05e11 3.83e11 Europe
9 Azerbaij~ 2.19e7 1.88e7 5.72e10 5.05e10 Asia
10 Banglade~ 1.18e8 1.00e8 1.39e11 1.09e11 Asia
# ... with 143 more rows
> distinct(da, continent)
# A tibble: 6 x 1
continent
<chr>
1 Asia
2 Europe
3 Africa
4 South America
5 Australia and Oceania
6 North America
> distinct(da, continent, .keep_all=T)
# A tibble: 6 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Afghanis~ 2.73e7 2.09e7 1.97e10 1.47e10 Asia
2 Albania 6.62e6 6.82e6 1.25e10 1.15e10 Europe
3 Algeria 9.10e7 7.62e7 1.76e11 1.56e11 Africa
4 Argentina 1.59e8 1.19e8 4.58e11 3.86e11 South Ameri~
5 Australia 2.05e8 2.16e8 1.24e12 1.12e12 Australia a~
6 Barbados 9.50e5 1.21e6 4.49e 9 4.44e 9 North Ameri~
slice() 根据行号筛选行
> da %>% slice(3:10)
# A tibble: 8 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Algeria 9.10e7 7.62e7 1.76e11 1.56e11 Africa
2 Angola 2.30e7 1.76e7 9.63e10 7.98e10 Africa
3 Argenti~ 1.59e8 1.19e8 4.58e11 3.86e11 South Ameri~
4 Armenia 6.19e6 5.84e6 1.07e10 9.06e 9 Asia
5 Austral~ 2.05e8 2.16e8 1.24e12 1.12e12 Australia a~
6 Austria 5.19e7 4.96e7 4.05e11 3.83e11 Europe
7 Azerbai~ 2.19e7 1.88e7 5.72e10 5.05e10 Asia
8 Banglad~ 1.18e8 1.00e8 1.39e11 1.09e11 Asia
列子集
select()函数筛选出有用的列
> da %>% select(country, GDP2013)
# A tibble: 153 x 2
country GDP2013
<chr> <dbl>
1 Afghanistan 19731337261
2 Albania 12542465022
3 Algeria 176325000000
4 Angola 96261432368
5 Argentina 457639000000
6 Armenia 10736584306
7 Australia 1241480000000
8 Austria 404687000000
9 Azerbaijan 57238335961
10 Bangladesh 138597000000
# ... with 143 more rows
> da %>% select(-c(country, continent))
# A tibble: 153 x 4
EF2013 EF2009 GDP2013 GDP2009
<dbl> <dbl> <dbl> <dbl>
1 27270736. 20868574. 19731337261 14697331941
2 6616186. 6815357. 12542465022 11500292411
3 90957355. 76216524. 176325000000 155605000000
4 22957308. 17607133. 96261432368 79753199398
5 158699622. 119344052. 457639000000 385956000000
6 6193203. 5844214. 10736584306 9060944166
7 204837815. 216496002. 1241480000000 1119650000000
8 51883278. 49630217. 404687000000 382851000000
9 21861066. 18768195. 57238335961 50453518393
10 117772779. 100248357. 138597000000 109195000000
# ... with 143 more rows
> da %>% select(country, starts_with("GDP"))
# A tibble: 153 x 3
country GDP2013 GDP2009
<chr> <dbl> <dbl>
1 Afghanistan 19731337261 14697331941
2 Albania 12542465022 11500292411
3 Algeria 176325000000 155605000000
4 Angola 96261432368 79753199398
5 Argentina 457639000000 385956000000
6 Armenia 10736584306 9060944166
7 Australia 1241480000000 1119650000000
8 Austria 404687000000 382851000000
9 Azerbaijan 57238335961 50453518393
10 Bangladesh 138597000000 109195000000
# ... with 143 more rows
> da %>% select(country, ends_with("2013"))
# A tibble: 153 x 3
country EF2013 GDP2013
<chr> <dbl> <dbl>
1 Afghanistan 27270736. 19731337261
2 Albania 6616186. 12542465022
3 Algeria 90957355. 176325000000
4 Angola 22957308. 96261432368
5 Argentina 158699622. 457639000000
6 Armenia 6193203. 10736584306
7 Australia 204837815. 1241480000000
8 Austria 51883278. 404687000000
9 Azerbaijan 21861066. 57238335961
10 Bangladesh 117772779. 138597000000
# ... with 143 more rows
arrange() 按照某种标准对行进行排序
> da %>% arrange(EF2013)
# A tibble: 153 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Dominica 161736. 2.04e5 4.91e8 4.91e8 North Ameri~
2 Saint Kitt~ 239948. 2.04e5 7.62e8 7.21e8 North Ameri~
3 Sao Tome a~ 295127. 2.74e5 2.23e8 1.85e8 Africa
4 Tonga 304041. 2.29e5 3.71e8 3.57e8 Australia a~
5 Grenada 323041. 2.94e5 7.86e8 7.75e8 North Ameri~
6 Bermuda 368750. 3.66e5 5.15e9 5.87e9 North Ameri~
7 Saint Lucia 429986. 4.44e5 1.23e9 1.26e9 North Ameri~
8 Samoa 442727. 4.49e5 6.84e8 6.54e8 Australia a~
9 Comoros 788183. 7.32e5 5.80e8 5.19e8 Africa
10 Timor-Leste 839659. 5.16e5 1.11e9 8.54e8 Asia
# ... with 143 more rows
> da %>% arrange(desc(EF2013))
# A tibble: 153 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 China 5.01e9 4.25e9 7.77e12 5.51e12 Asia
2 United State~ 2.72e9 2.71e9 1.58e13 1.46e13 North Ame~
3 India 1.36e9 1.20e9 1.99e12 1.50e12 Asia
4 Russian Fede~ 8.20e8 7.49e8 1.67e12 1.46e12 Europe
5 Japan 6.33e8 5.85e8 5.89e12 5.47e12 Asia
6 Brazil 6.16e8 5.32e8 2.41e12 2.05e12 South Ame~
7 Germany 4.40e8 4.25e8 3.58e12 3.28e12 Europe
8 Indonesia 3.64e8 3.39e8 8.97e11 7.11e11 Asia
9 France 3.23e8 3.43e8 2.72e12 2.60e12 Europe
10 United Kingd~ 3.23e8 3.75e8 2.55e12 2.38e12 Europe
# ... with 143 more rows
> da %>% arrange(EF2013/EF2009)
# A tibble: 153 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Cyprus 3.77e6 5.74e6 2.34e10 2.52e10 Europe
2 Greece 4.65e7 6.39e7 2.44e11 3.17e11 Europe
3 Barbados 9.50e5 1.21e6 4.49e 9 4.44e 9 North Amer~
4 Dominica 1.62e5 2.04e5 4.91e 8 4.91e 8 North Amer~
5 Portugal 4.05e7 4.79e7 2.22e11 2.34e11 Europe
6 United Kin~ 3.23e8 3.75e8 2.55e12 2.38e12 Europe
7 Spain 1.87e8 2.17e8 1.36e12 1.43e12 Europe
8 Gambia 1.72e6 1.98e6 1.01e 9 8.94e 8 Africa
9 Bulgaria 2.27e7 2.61e7 5.20e10 5.06e10 Europe
10 Macedonia ~ 6.34e6 7.25e6 9.86e 9 9.10e 9 Europe
# ... with 143 more rows
mutate 函数增加新列
> da %>% mutate(EF_delta = EF2013 - EF2009)
# A tibble: 153 x 7
country EF2013 EF2009 GDP2013 GDP2009 continent EF_delta
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 Afghani~ 2.73e7 2.09e7 1.97e10 1.47e10 Asia 6.40e6
2 Albania 6.62e6 6.82e6 1.25e10 1.15e10 Europe -1.99e5
3 Algeria 9.10e7 7.62e7 1.76e11 1.56e11 Africa 1.47e7
4 Angola 2.30e7 1.76e7 9.63e10 7.98e10 Africa 5.35e6
5 Argenti~ 1.59e8 1.19e8 4.58e11 3.86e11 South Am~ 3.94e7
6 Armenia 6.19e6 5.84e6 1.07e10 9.06e 9 Asia 3.49e5
7 Austral~ 2.05e8 2.16e8 1.24e12 1.12e12 Australi~ -1.17e7
8 Austria 5.19e7 4.96e7 4.05e11 3.83e11 Europe 2.25e6
9 Azerbai~ 2.19e7 1.88e7 5.72e10 5.05e10 Asia 3.09e6
10 Banglad~ 1.18e8 1.00e8 1.39e11 1.09e11 Asia 1.75e7
# ... with 143 more rows
> da %>% transmute(country, GDP_delta = GDP2013 - GDP2009, EF_delta = EF2013 - EF2009)
# A tibble: 153 x 3
country GDP_delta EF_delta
<chr> <dbl> <dbl>
1 Afghanistan 5034005320 6402162.
2 Albania 1042172611 -199171.
3 Algeria 20720000000 14740831.
4 Angola 16508232970 5350176.
5 Argentina 71683000000 39355571.
6 Armenia 1675640140 348989.
7 Australia 121830000000 -11658186.
8 Austria 21836000000 2253061.
9 Azerbaijan 6784817568 3092871.
10 Bangladesh 29402000000 17524422.
# ... with 143 more rows
group_by 和 summarise 函数
> da %>% group_by(continent)
# A tibble: 153 x 6
# Groups: continent [6]
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Afghanis~ 2.73e7 2.09e7 1.97e10 1.47e10 Asia
2 Albania 6.62e6 6.82e6 1.25e10 1.15e10 Europe
3 Algeria 9.10e7 7.62e7 1.76e11 1.56e11 Africa
4 Angola 2.30e7 1.76e7 9.63e10 7.98e10 Africa
5 Argentina 1.59e8 1.19e8 4.58e11 3.86e11 South Ameri~
6 Armenia 6.19e6 5.84e6 1.07e10 9.06e 9 Asia
7 Australia 2.05e8 2.16e8 1.24e12 1.12e12 Australia a~
8 Austria 5.19e7 4.96e7 4.05e11 3.83e11 Europe
9 Azerbaij~ 2.19e7 1.88e7 5.72e10 5.05e10 Asia
10 Banglade~ 1.18e8 1.00e8 1.39e11 1.09e11 Asia
# ... with 143 more rows
> da %>% group_by(continent) %>% summarise(countries=n())
# A tibble: 6 x 2
continent countries
<chr> <int>
1 Africa 45
2 Asia 35
3 Australia and Oceania 6
4 Europe 37
5 North America 20
6 South America 10
> da %>% group_by(continent) %>% summarise(GDP2013=mean(GDP2013))
# A tibble: 6 x 2
continent GDP2013
<chr> <dbl>
1 Africa 43872335750.
2 Asia 619164666100.
3 Australia and Oceania 236036198287.
4 Europe 544046378843.
5 North America 952446625871.
6 South America 416427328953.
实战案例
数据整理
> da
# A tibble: 153 x 6
country EF2013 EF2009 GDP2013 GDP2009 continent
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Afghanis~ 2.73e7 2.09e7 1.97e10 1.47e10 Asia
2 Albania 6.62e6 6.82e6 1.25e10 1.15e10 Europe
3 Algeria 9.10e7 7.62e7 1.76e11 1.56e11 Africa
4 Angola 2.30e7 1.76e7 9.63e10 7.98e10 Africa
5 Argentina 1.59e8 1.19e8 4.58e11 3.86e11 South Ameri~
6 Armenia 6.19e6 5.84e6 1.07e10 9.06e 9 Asia
7 Australia 2.05e8 2.16e8 1.24e12 1.12e12 Australia a~
8 Austria 5.19e7 4.96e7 4.05e11 3.83e11 Europe
9 Azerbaij~ 2.19e7 1.88e7 5.72e10 5.05e10 Asia
10 Banglade~ 1.18e8 1.00e8 1.39e11 1.09e11 Asia
# ... with 143 more rows
变量 year 分散到了不同的列,这也导致它的每一行包含多个观测值
考虑把 EF2013 和 EF2009 塞进一列,GDP2013 和 GDP2009 塞进另一列
> tida <- da %>% gather(EFyear, EF, EF2009:EF2013) %>% gather(GDPyear, GDP, GDP2009:GDP2013)
> tida
# A tibble: 612 x 6
country continent EFyear EF GDPyear GDP
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Afghanis~ Asia EF2009 2.09e7 GDP2009 1.47e10
2 Albania Europe EF2009 6.82e6 GDP2009 1.15e10
3 Algeria Africa EF2009 7.62e7 GDP2009 1.56e11
4 Angola Africa EF2009 1.76e7 GDP2009 7.98e10
5 Argentina South America EF2009 1.19e8 GDP2009 3.86e11
6 Armenia Asia EF2009 5.84e6 GDP2009 9.06e 9
7 Australia Australia and~ EF2009 2.16e8 GDP2009 1.12e12
8 Austria Europe EF2009 4.96e7 GDP2009 3.83e11
9 Azerbaij~ Asia EF2009 1.88e7 GDP2009 5.05e10
10 Banglade~ Asia EF2009 1.00e8 GDP2009 1.09e11
# ... with 602 more rows
tida 显示有 612 行,这意味着有许多无效的交叉行(比如 EF2009 和 GDP 2013 在同一行),我们需要剔除交叉行, 用 separate() 和 filter 来完成剔除的任务
> tidda <- tida %>% separate(EFyear, c('useless1', 'year'), 2) %>% separate(GDPyear, c('useless2', 'year_star'), 3) %>% filter(year==year_star) %>% select(-c(useless1, useless2, year_star)) %>% arrange(country)
> tidda
# A tibble: 306 x 5
country continent year EF GDP
<chr> <chr> <chr> <dbl> <dbl>
1 "\"Korea" Asia 2009 269778293. 1027730000000
2 "\"Korea" Asia 2013 291752674. 1194430000000
3 "Afghanistan" Asia 2009 20868574. 14697331941
4 "Afghanistan" Asia 2013 27270736. 19731337261
5 "Albania" Europe 2009 6815357. 11500292411
6 "Albania" Europe 2013 6616186. 12542465022
7 "Algeria" Africa 2009 76216524. 155605000000
8 "Algeria" Africa 2013 90957355. 176325000000
9 "Angola" Africa 2009 17607133. 79753199398
10 "Angola" Africa 2013 22957308. 96261432368
# ... with 296 more rows
另外,我们还可以用 rename() 和分步 gather() 来取代 separate()
> tidyda <- da %>% rename(`2013`=EF2013, `2009`=EF2009) %>% gather(year, EF, `2009`:`2013`) %>% rename(`2013`=GDP2013, `2009`=GDP2009) %>% gather(GDPyear, GDP, `2009`:`2013`) %>% filter(GDPyear==year) %>% select(-GDPyear) %>% arrange(country)
> tidyda
# A tibble: 306 x 5
country continent year EF GDP
<chr> <chr> <chr> <dbl> <dbl>
1 "\"Korea" Asia 2009 269778293. 1027730000000
2 "\"Korea" Asia 2013 291752674. 1194430000000
3 "Afghanistan" Asia 2009 20868574. 14697331941
4 "Afghanistan" Asia 2013 27270736. 19731337261
5 "Albania" Europe 2009 6815357. 11500292411
6 "Albania" Europe 2013 6616186. 12542465022
7 "Algeria" Africa 2009 76216524. 155605000000
8 "Algeria" Africa 2013 90957355. 176325000000
9 "Angola" Africa 2009 17607133. 79753199398
10 "Angola" Africa 2013 22957308. 96261432368
# ... with 296 more rows
数据分析
数据是否“杂乱”是人为定义的,并不是只有 tidy dataset 才能用于数据分析,事实上,很多“杂乱”的数据是非常有用的,tidy dataset 应该是你的默认选项,而不一定是最佳选项
> whoda <- da %>%
+ transmute(
+ country,
+ EF_delta_p = 100*(EF2013/EF2009 - 1),
+ GDP_delta_p = 100*(GDP2013/GDP2009 - 1),
+ DD_rate = GDP_delta_p - EF_delta_p,
+ continent,
+ GDP2013
+ )
> whoda
# A tibble: 153 x 6
country EF_delta_p GDP_delta_p DD_rate continent GDP2013
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 Afghani~ 30.7 34.3 3.57 Asia 1.97e10
2 Albania -2.92 9.06 12.0 Europe 1.25e10
3 Algeria 19.3 13.3 -6.02 Africa 1.76e11
4 Angola 30.4 20.7 -9.69 Africa 9.63e10
5 Argenti~ 33.0 18.6 -14.4 South Am~ 4.58e11
6 Armenia 5.97 18.5 12.5 Asia 1.07e10
7 Austral~ -5.38 10.9 16.3 Australi~ 1.24e12
8 Austria 4.54 5.70 1.16 Europe 4.05e11
9 Azerbai~ 16.5 13.4 -3.03 Asia 5.72e10
10 Banglad~ 17.5 26.9 9.45 Asia 1.39e11
# ... with 143 more rows
我们想看经济增长与自然环境的脱钩程度最大的 10 个国家
> whoda %>% arrange(desc(DD_rate)) %>% slice(1:10)
# A tibble: 10 x 6
country EF_delta_p GDP_delta_p DD_rate continent GDP2013
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 Papua N~ -10.9 35.9 46.9 Australi~ 1.23e10
2 Ethiopia 12.1 50.3 38.3 Africa 4.00e10
3 Myanmar -3.44 34.7 38.2 Asia 6.09e10
4 Philipp~ -9.67 27.4 37.1 Asia 2.36e11
5 Sierra ~ 17.7 53.5 35.9 Africa 3.81e 9
6 Zimbabwe 9.18 44.0 34.8 Africa 1.22e10
7 Mongolia 22.7 56.4 33.7 Asia 1.06e10
8 Uzbekis~ 4.10 37.3 33.2 Asia 4.98e10
9 Turkmen~ 21.0 53.3 32.3 Asia 3.17e10
10 Singapo~ 0.755 32.8 32.1 Asia 2.72e11
绝对脱钩的国家
> whoda %>% filter(GDP_delta_p>0, EF_delta_p<0)
# A tibble: 36 x 6
country EF_delta_p GDP_delta_p DD_rate continent GDP2013
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 Albania -2.92 9.06 12.0 Europe 1.25e10
2 Austral~ -5.38 10.9 16.3 Australi~ 1.24e12
3 Barbados -21.3 1.24 22.6 North Am~ 4.49e 9
4 Belgium -1.19 4.61 5.80 Europe 4.93e11
5 Bosnia ~ -0.156 3.15 3.30 Europe 1.76e10
6 Bulgaria -13.2 2.88 16.0 Europe 5.20e10
7 Burundi -8.78 17.7 26.4 Africa 2.30e 9
8 Cuba -4.20 11.3 15.5 North Am~ 7.00e10
9 Czech R~ -4.31 3.01 7.32 Europe 2.08e11
10 Denmark -9.74 2.47 12.2 Europe 3.22e11
# ... with 26 more rows
筛选出每个大洲 2013 年 GDP 排名前三的国家
> whoda %>%
+ group_by(continent) %>%
+ arrange(desc(GDP2013)) %>%
+ slice(1:3) %>%
+ select(continent, country, GDP2013)
# A tibble: 18 x 3
# Groups: continent [6]
continent country GDP2013
<chr> <chr> <dbl>
1 Africa Nigeria 4.25e11
2 Africa South Africa 4.05e11
3 Africa Egypt 2.33e11
4 Asia China 7.77e12
5 Asia Japan 5.89e12
6 Asia India 1.99e12
7 Australia and Oceania Australia 1.24e12
8 Australia and Oceania New Zealand 1.58e11
9 Australia and Oceania Papua New Guinea 1.23e10
10 Europe Germany 3.58e12
11 Europe France 2.72e12
12 Europe United Kingdom 2.55e12
13 North America United States of America 1.58e13
14 North America Canada 1.73e12
15 North America Mexico 1.15e12
16 South America Brazil 2.41e12
17 South America Argentina 4.58e11
18 South America Venezuela 4.39e11
> ggplot(whoda, aes(GDP_delta_p, EF_delta_p)) +
+ geom_point(aes(size=GDP2013, color=continent)) +
+ geom_text(aes(label=country), size=2.5, hjust=-.2) +
+ geom_vline(xintercept=0, linetype='dashed', size=1, color='blue') +
+ geom_hline(yintercept=0, linetype='dashed', size=1, color='blue') +
+ guides(size=F) +
+ theme(legend.position=c(.1, .85), legend.background=element_blank(), legend.key=element_blank()) +
+ xlab('GDP Growth') +
+ ylab('EF Growth')
image.png
美国的 GDP 增长没有想象中那么快,因为总量已经很大;左上角代表 GDP 负增长、环境压力却加大了的国家,图中显示为百慕大群岛;欧洲大部分国家都集中在水平参考线附近及以下,说明欧洲的社会发展水平和经济结构已经不太依赖于自然资源……
> whoda %>% group_by(continent) %>% summarise(GDP=mean(GDP2013)) %>%
+ ggplot() +
+ geom_bar(aes(x=continent, y=GDP), stat='identity') +
+ coord_flip()
image.png
各大洲的国均 GDP,北美仍然是最大的,样本国中亚洲的平均 GDP 已经超过了欧洲。
网友评论