【上一篇:38.读取其他类型的数据】
【下一篇:41.关于Pivoting方法整理成tidy data】
前面学到的用于绘图的典型数据框、tibble都是tidy data,将其他各种数据都整理成tidy data,形成一种一致的数据格式,再利用某些包(例如tidyr,dplyr,ggplot2都是tidyverse的核心包)中的tidy tools就很容易对数据进行各种分析;另外tidy data中变量单独成列的一个好处是它允许R的向量化特性发挥作用,R的许多内置函数(比如mutate和summary函数)都在向量上进行工作。
tidy data的三个规则:每个变量必须有自己的列;每个观测必须有自己的行;每个值必须有自己的单元格。如图:
tidy data的规则
例如,tidyr包中的以下个数据中,只有table1是tidy data。
library(tidyverse)
> table1
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
> table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
> table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
> table4a
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
> table4b
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
用table1、table2、table4a+table4b分别计算rate,如下:
> table1 %>% mutate(rate=cases/population *10000)
# A tibble: 6 x 5
country year cases population rate
<chr> <int> <int> <int> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
----------------------------------------------------------------------------
> (cases <- table2 %>% filter(type=="cases") %>% select(c("country","year","count")))
# A tibble: 6 x 3
country year count
<chr> <int> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
> (population <- table2 %>% filter(type=="population") %>% select(c("country","year","count")))
# A tibble: 6 x 3
country year count
<chr> <int> <int>
1 Afghanistan 1999 19987071
2 Afghanistan 2000 20595360
3 Brazil 1999 172006362
4 Brazil 2000 174504898
5 China 1999 1272915272
6 China 2000 1280428583
> (merge_data<-merge(cases,population,by=c("country","year")))
country year count.x count.y
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
> colnames(merge_data) <-c("country","year","cases","population")
> merge_data %>% mutate(rate=cases/population*10000)
country year cases population rate
1 Afghanistan 1999 745 19987071 0.372741
2 Afghanistan 2000 2666 20595360 1.294466
3 Brazil 1999 37737 172006362 2.193930
4 Brazil 2000 80488 174504898 4.612363
5 China 1999 212258 1272915272 1.667495
6 China 2000 213766 1280428583 1.669488
----------------------------------------------------------------------------
> (merge_data1<-merge(table4a,table4b,by=c("country")))
country 1999.x 2000.x 1999.y 2000.y
1 Afghanistan 745 2666 19987071 20595360
2 Brazil 37737 80488 172006362 174504898
3 China 212258 213766 1272915272 1280428583
> colnames(merge_data1)<-c("country","1999_cases","2000_cases","1999_population","2000_population")
> merge_data1 %>% mutate(rate_1999 = `1999_cases`/`1999_population` *10000,rate_2000 = `2000_cases`/`2000_population`*10000)
country 1999_cases 2000_cases 1999_population 2000_population rate_1999
1 Afghanistan 745 2666 19987071 20595360 0.372741
2 Brazil 37737 80488 172006362 174504898 2.193930
3 China 212258 213766 1272915272 1280428583 1.667495
rate_2000
1 1.294466
2 4.612363
3 1.669488
网友评论