2022-07-08 pivot_longer与pivot_wi

作者: 学习生信的小兔子 | 来源:发表于2022-07-08 21:38 被阅读0次

2022-07-08 pivot_longer与pivot_wi
2022-07-08
怎么让生活恢复简朴自在的节奏
R语言基础入门(3) select选择列的方法从基础到高级
养家午报
2022-07-08
戒定慧·《感恩日志》重启·“重新做人”第02天
0249觉察日记｜在人世间好好玩耍
2022-07-08
如何调仓业已入手的13票

参考：张敬信老师：R语言编程

宽变长

每一行只有1个观测的情形

#其实不是很明白“每一行只有1个观测的情形”这个意思
getwd()
[1] "E:/tidy-R/introR-master"
df=read_csv("datas/分省年度GDP.csv")

df %>% 
  pivot_longer(-地区,names_to = "年份",values_to = "GDP")
# A tibble: 12 x 3
   地区     年份      GDP
   <chr>    <chr>   <dbl>
 1 北京市   2019年 35371.
 2 北京市   2018年 33106.
 3 北京市   2017年 28015.
 4 天津市   2019年 14104.
 5 天津市   2018年 13363.
 6 天津市   2017年 18549.
 7 河北省   2019年 35105.
 8 河北省   2018年 32495.
 9 河北省   2017年 34016.
10 黑龙江省 2019年 13613.
11 黑龙江省 2018年 12846.

每一行有多个观测

load("datas/family.rda")
family
# A tibble: 5 x 5
  family dob_child1 dob_child2 gender_child1
   <int> <date>     <date>             <int>
1      1 1998-11-26 2000-01-29             1
2      2 1996-06-22 NA                     2
3      3 2002-07-11 2004-04-05             2
4      4 2004-10-10 2009-08-27             1
5      5 2000-12-05 2005-02-28             2
# ... with 1 more variable: gender_child2 <int>

 family %>% 
+   pivot_longer(-family,
+                names_to = c(".value","child"),
+                names_sep = "_",
+                values_drop_na = TRUE)
# A tibble: 9 x 4
  family child  dob        gender
   <int> <chr>  <date>      <int>
1      1 child1 1998-11-26      1
2      1 child2 2000-01-29      2
3      2 child1 1996-06-22      2
4      3 child1 2002-07-11      2
5      3 child2 2004-04-05      2
6      4 child1 2004-10-10      1
7      4 child2 2009-08-27      1
8      5 child1 2000-12-05      2
9      5 child2 2005-02-28      1

长变宽

有一个列名列和一个值列

load("datas/animals.rda")
animals
# A tibble: 228 x 3
   Type    Year  Heads
   <chr>  <dbl>  <dbl>
 1 Sheep   2015 24943.
 2 Cattle  1972  2189.
 3 Camel   1985   559 
 4 Camel   1995   368.
 5 Camel   1997   355.
 6 Goat    1977  4411.
 7 Cattle  1979  2477.
 8 Cattle  2014  3414.
 9 Cattle  1996  3476.
10 Cattle  2017  4388.
# ... with 218 more rows
animals %>% 
  pivot_wider(
    names_from = Type,
    values_from = Heads,
    values_fill = 0
  )
# A tibble: 48 x 6
    Year  Sheep Cattle Camel   Goat Horse
   <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
 1  2015 24943.  3780.  368. 23593. 3295.
 2  1972 13716.  2189.  625.  4338. 2239.
 3  1985 13249.  2408.  559   4299. 1971 
 4  1995     0   3317.  368.  8521. 2684.
 5  1997 14166.  3613.  355. 10265. 2893.
 6  1977 13430.  2388.  609   4411. 2104.
 7  1979 14400.  2477.  614.  4715. 2079.
 8  2014 23215.  3414.  349. 22009.    0 
 9  1996 13561.  3476.  358.  9135. 2770.
10  2017 30110.  4388.  434. 27347. 3940.
# ... with 38 more rows

多个列名列或多个值列

us_rent_income
#us_rent_income 数据集有两个值列
# A tibble: 104 x 5
   GEOID NAME       variable estimate   moe
   <chr> <chr>      <chr>       <dbl> <dbl>
 1 01    Alabama    income      24476   136
 2 01    Alabama    rent          747     3
 3 02    Alaska     income      32940   508
 4 02    Alaska     rent         1200    13
 5 04    Arizona    income      27517   148
 6 04    Arizona    rent          972     4
 7 05    Arkansas   income      23789   165
 8 05    Arkansas   rent          709     5
 9 06    California income      29454   109
10 06    California rent         1358     3
# ... with 94 more rows


us_rent_income %>% 
  pivot_wider(
    names_from = variable,
    values_from = c(estimate,moe)
  )
# A tibble: 52 x 6
   GEOID NAME         estimate_income estimate_rent
   <chr> <chr>                  <dbl>         <dbl>
 1 01    Alabama                24476           747
 2 02    Alaska                 32940          1200
 3 04    Arizona                27517           972
 4 05    Arkansas               23789           709
 5 06    California             29454          1358
 6 08    Colorado               32401          1125
 7 09    Connecticut            35326          1123
 8 10    Delaware               31560          1076
 9 11    District of~           43198          1424
10 12    Florida                25952          1077
# ... with 42 more rows, and 2 more variables:
#   moe_income <dbl>, moe_rent <dbl>

长变宽时会遇到的问题

df=tibble(
  x=1:6,
  y=c(rep(c("A","B","C"),each=2)),
  z=c(2.13,3.65,1.88,2.30,6.55,4.21)
)
# A tibble: 6 x 3
      x y         z
  <int> <chr> <dbl>
1     1 A      2.13
2     2 A      3.65
3     3 B      1.88
4     4 B      2.3 
5     5 C      6.55
6     6 C      4.21

df %>% 
  pivot_wider(
    names_from = y,
    values_from = z
  )
# A tibble: 6 x 4
      x     A     B     C
  <int> <dbl> <dbl> <dbl>
1     1  2.13 NA    NA   
2     2  3.65 NA    NA   
3     3 NA     1.88 NA   
4     4 NA     2.3  NA   
5     5 NA    NA     6.55
6     6 NA    NA     4.21

df=df %>% 
  group_by(y) %>% 
  mutate(n=row_number()) %>% 
  select(-x)
df
# A tibble: 6 x 3
# Groups:   y [3]
  y         z     n
  <chr> <dbl> <int>
1 A      2.13     1
2 A      3.65     2
3 B      1.88     1
4 B      2.3      2
5 C      6.55     1
6 C      4.21     2
df %>% 
  pivot_wider(names_from = y,
              values_from = z)
# A tibble: 2 x 4
      n     A     B     C
  <int> <dbl> <dbl> <dbl>
1     1  2.13  1.88  6.55
2     2  3.65  2.3   4.21
回头再看一下，所谓的各组内值唯一识别，比如 A 组有两个数 2.13 和 3.65, 给了它们唯一识别: n
= 1 和 n = 2, 当然 1 和 2 换成其他的两个不同值也是一样的，这样就知道谁作为第一个样本（行），谁
作为第二个样本（行）。否则两个数无法区分，只能放到一个列表里了，就是前面的错误结果 + 警告。