美文网首页
40.Tidy Data的开篇

40.Tidy Data的开篇

作者: 心惊梦醒 | 来源:发表于2021-08-23 21:46 被阅读0次

【上一篇:38.读取其他类型的数据】
【下一篇:41.关于Pivoting方法整理成tidy data】

    前面学到的用于绘图的典型数据框、tibble都是tidy data,将其他各种数据都整理成tidy data,形成一种一致的数据格式,再利用某些包(例如tidyr,dplyr,ggplot2都是tidyverse的核心包)中的tidy tools就很容易对数据进行各种分析;另外tidy data中变量单独成列的一个好处是它允许R的向量化特性发挥作用,R的许多内置函数(比如mutate和summary函数)都在向量上进行工作。
    tidy data的三个规则:每个变量必须有自己的列;每个观测必须有自己的行;每个值必须有自己的单元格。如图:


tidy data的规则

    例如,tidyr包中的以下个数据中,只有table1是tidy data。

library(tidyverse)
> table1
# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
> table2
# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
> table3
# A tibble: 6 x 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
> table4a
# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
> table4b
# A tibble: 3 x 3
  country         `1999`     `2000`
* <chr>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

    用table1、table2、table4a+table4b分别计算rate,如下:

> table1 %>% mutate(rate=cases/population *10000) 
# A tibble: 6 x 5
  country      year  cases population  rate
  <chr>       <int>  <int>      <int> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67 
----------------------------------------------------------------------------
> (cases <- table2 %>% filter(type=="cases") %>% select(c("country","year","count")))
# A tibble: 6 x 3
  country      year  count
  <chr>       <int>  <int>
1 Afghanistan  1999    745
2 Afghanistan  2000   2666
3 Brazil       1999  37737
4 Brazil       2000  80488
5 China        1999 212258
6 China        2000 213766
> (population <- table2 %>% filter(type=="population") %>% select(c("country","year","count")))
# A tibble: 6 x 3
  country      year      count
  <chr>       <int>      <int>
1 Afghanistan  1999   19987071
2 Afghanistan  2000   20595360
3 Brazil       1999  172006362
4 Brazil       2000  174504898
5 China        1999 1272915272
6 China        2000 1280428583
> (merge_data<-merge(cases,population,by=c("country","year")))
      country year count.x    count.y
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3      Brazil 1999   37737  172006362
4      Brazil 2000   80488  174504898
5       China 1999  212258 1272915272
6       China 2000  213766 1280428583
> colnames(merge_data) <-c("country","year","cases","population")
> merge_data %>% mutate(rate=cases/population*10000)
      country year  cases population     rate
1 Afghanistan 1999    745   19987071 0.372741
2 Afghanistan 2000   2666   20595360 1.294466
3      Brazil 1999  37737  172006362 2.193930
4      Brazil 2000  80488  174504898 4.612363
5       China 1999 212258 1272915272 1.667495
6       China 2000 213766 1280428583 1.669488
----------------------------------------------------------------------------
> (merge_data1<-merge(table4a,table4b,by=c("country")))
      country 1999.x 2000.x     1999.y     2000.y
1 Afghanistan    745   2666   19987071   20595360
2      Brazil  37737  80488  172006362  174504898
3       China 212258 213766 1272915272 1280428583
> colnames(merge_data1)<-c("country","1999_cases","2000_cases","1999_population","2000_population")
> merge_data1 %>% mutate(rate_1999 = `1999_cases`/`1999_population` *10000,rate_2000 = `2000_cases`/`2000_population`*10000)
      country 1999_cases 2000_cases 1999_population 2000_population rate_1999
1 Afghanistan        745       2666        19987071        20595360  0.372741
2      Brazil      37737      80488       172006362       174504898  2.193930
3       China     212258     213766      1272915272      1280428583  1.667495
  rate_2000
1  1.294466
2  4.612363
3  1.669488

【上一篇:38.读取其他类型的数据】
【下一篇:41.关于Pivoting方法整理成tidy data】

相关文章

网友评论

      本文标题:40.Tidy Data的开篇

      本文链接:https://www.haomeiwen.com/subject/cdtciltx.html