44.用学到的tidy data整理工具处理tidyr::who

作者: 心惊梦醒 | 来源:发表于2021-08-27 00:04 被阅读0次

45.关于关系型数据的开篇
44.用学到的tidy data整理工具处理tidyr::who
R 语言-tidyr 和 dplyr
数据处理R包-第二枚
Data Wrangling（数据的争吵）
学习小组Day6笔记--志笑梦
R语言入门6：数据处理之单表操作-Dplyr
R数据整理方法
用tidyr包进行长数据和宽数据的相互转换
42.关于separate()和unite()两个函数

前面，以tidyr包中的table1-5的示例数据展示了相同底层数据的不同表现形式，并学习了用pivot_longer()、pivot_wider()、separate()、unite()函数处理数据以得到感兴趣的tidy data。还进一步学习了缺失值的知识，包括显式缺失和隐式缺失。
tigyr包中的who数据集是一个很好的练手的数据。数据集的内容是来自世卫组织关于结核病的数据，本篇用这个数据集尝试整理出一个tidy data。
看到复杂非tidy data，不要害怕，把脑子里用一个函数就能解决问题的想法扔掉，用多个函数是很正常的。

不知道数据集的每列是什么的时候，可以先?who看看帮助文档
> who
# A tibble: 7,240 x 60
   country     iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534
   <chr>       <chr> <chr> <int>       <int>        <int>        <int>
 1 Afghanistan AF    AFG    1980          NA           NA           NA
 2 Afghanistan AF    AFG    1981          NA           NA           NA
 3 Afghanistan AF    AFG    1982          NA           NA           NA
 4 Afghanistan AF    AFG    1983          NA           NA           NA
 5 Afghanistan AF    AFG    1984          NA           NA           NA
 6 Afghanistan AF    AFG    1985          NA           NA           NA
 7 Afghanistan AF    AFG    1986          NA           NA           NA
 8 Afghanistan AF    AFG    1987          NA           NA           NA
 9 Afghanistan AF    AFG    1988          NA           NA           NA
10 Afghanistan AF    AFG    1989          NA           NA           NA
# ... with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
#   new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
#   new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
#   new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
#   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
#   new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
#   new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
#   new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
#   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
#   new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
#   new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
#   new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
#   new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
#   new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
#   newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
#   newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
#   newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
#   newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
#   newrel_f65 <int>

iso2和iso3是2和3个字母的ISO国家代码，与country列实际意义相同，可以去掉
new_sp_m014~newrel_f65列是每个国家每年的新结核病例数
列名字中,new表示是新病例：rel表示复发产生的新病例；sp表示涂片阳性结核病例；
sn表示涂片阴性的结核病例（也就是涂片筛查没正确筛查出的病例）；
ep表示肺外结核病例；
f和m表示性别female和male；
数字代表年龄分组：014等于0-14岁；1524等于15-24岁等等，65等于65岁以上

开始处理who得到tidy data：
step1：使用pivot_longer将new_sp_m014~newrel_f65的列名变成一个变量key
step2:观察new*的名字发现规则有些许不同，所以稍微改一下，str_replace函数
step3:拆分new*名字产生新列
step4:去掉冗余列和无意义的列
step5:进一步把性别和年龄分组拆出来

who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

以上处理中，用pivot_longer转换的时候去掉了缺失值，实际去掉可能存在问题
who中也没有隐式缺失值

【上一篇：43.关于缺失值】
【下一篇：45.关于关系型数据的开篇】