美文网首页
46.关于Mutating joins类函数和merge函数

46.关于Mutating joins类函数和merge函数

作者: 心惊梦醒 | 来源:发表于2021-08-29 00:01 被阅读0次

    【上一篇:45.关于关系型数据的开篇】
    【下一篇:47.关于Filtering joins类函数】

        上篇说了两种关系型数据中的需求,处理关系型数据有三大类函数:Mutating joins、Filtering joins、Set Operations。
        Mutating joins类函数包括inner_join()、left_join()、right_join()、full_join()。inner_join()函数术语inner join,另外三个属于outer join。另外base R中的merge()函数也能通过设置不同参数分别实现这四种功能。Mutating joins类函数的返回结果是为original table增加新的列,即variables。
        Filtering joins类函数包括semi_join()、anti_join(),返回的结果是过滤后的observations。
        Set Operations类函数(集合操作类函数)包括intersect()、union()、setdiff(),用来比较两个数据框,返回两个数据框行的交集、并集和差集。


        Mutating joins类的四个函数的Usage是一样的:


    Mutating joins类函数
    x和y:数据框
    by:向量,表示根据两个表格中的哪些列match
    copy:貌似不懂的样子!
    suffix:包含两个值的向量,用于合并后区分相同的列名,下面再详细讲
    keep:逻辑值,是否来自x和y的join keys都输出,默认为FALSE
    

        inner_join()的返回结果是x和y中都有的行。
        left_join()的返回结果是x中所有的行。
        right_join()的返回结果是y中所有的行。
        full_join()的返回结果是x或y中的所有行。
        outer join类中match后缺失的值直接用NA填充。
        返回的结果用Veen图表示大概是这个样子的:

    四个函数返回结果
        by默认是NULL,表示根据x和y中的公共列进行match,这种情况下R同时会打印出公共列都是什么,便于检查;可以直接指定一个变量名,也可以指定一个向量,这两种情况的前提是x和y中都有这些变量(相同变量名)。如果x和y中你用来match的列名不同,可以设置by=c("a"="b"),即x$a=y$b列,多个不同列可以设置by=c("a1"="b1","a2"="b2","a3"="b3"),即x$a1=y$b1、x$a2=y$b2、x$a3=y$b3。
    test1<-tribble(
    ~name,~sex,~score,
    "ZhangS","male",90,
    "LiS","male",80,
    "WangJ","female",85,
    "MuQ","female",95,
    "NieP","male",100,
    "ZhaoY","female",93,
    "JiaW","female",98,
    )
    
    test2<-tribble(
    ~name,~Name,~address,~sex,
    "ZhangS","ZhangS","Beijing","male",
    "LiS","LiS","Beijing","male",
    "WangJ","WangJ","Tanjin","female",
    "MuQ","MuQ","Taijin","female",
    "NieP","NieP","Henan","male",
    "ZhaoY","ZhaoY","Henan","female",
    "Qiany","Qiany","Xinjiang","male",
    )
    
    # by=NULL,自动打印实际使用的by值,即by = c("name", "sex")
    > inner_join(test1,test2)
    Joining, by = c("name", "sex")
    # A tibble: 6 x 5
    name   sex    score Name   address
    <chr>  <chr>  <dbl> <chr>  <chr>  
    1 ZhangS male      90 ZhangS Beijing
    2 LiS    male      80 LiS    Beijing
    3 WangJ  female    85 WangJ  Tanjin 
    4 MuQ    female    95 MuQ    Taijin 
    5 NieP   male     100 NieP   Henan  
    6 ZhaoY  female    93 ZhaoY  Henan  
    
    # 直接指定两个数据框中的都有的列名
    > left_join(test1,test2,by="name")
    # A tibble: 7 x 6
    name   sex.x  score Name   address sex.y 
    <chr>  <chr>  <dbl> <chr>  <chr>   <chr> 
    1 ZhangS male      90 ZhangS Beijing male  
    2 LiS    male      80 LiS    Beijing male  
    3 WangJ  female    85 WangJ  Tanjin  female
    4 MuQ    female    95 MuQ    Taijin  female
    5 NieP   male     100 NieP   Henan   male  
    6 ZhaoY  female    93 ZhaoY  Henan   female
    7 JiaW   female    98 NA     NA      NA    
    
    # 用test1中的name和test2中的Name进行match
    > right_join(test1,test2,by=c("name"="Name"))
    # A tibble: 7 x 6
    name   sex.x  score name.y address  sex.y 
    <chr>  <chr>  <dbl> <chr>  <chr>    <chr> 
      1 ZhangS male      90 ZhangS Beijing  male  
    2 LiS    male      80 LiS    Beijing  male  
    3 WangJ  female    85 WangJ  Tanjin   female
    4 MuQ    female    95 MuQ    Taijin   female
    5 NieP   male     100 NieP   Henan    male  
    6 ZhaoY  female    93 ZhaoY  Henan    female
    7 Qiany  NA        NA Qiany  Xinjiang male 
    
    # 改变相同列名的后缀
    > full_join(test1,test2,by="name",suffix=c(".test1",".test2"))
    # A tibble: 8 x 6
    name   sex.test1 score Name   address  sex.test2
    <chr>  <chr>     <dbl> <chr>  <chr>    <chr>    
      1 ZhangS male         90 ZhangS Beijing  male     
    2 LiS    male         80 LiS    Beijing  male     
    3 WangJ  female       85 WangJ  Tanjin   female   
    4 MuQ    female       95 MuQ    Taijin   female   
    5 NieP   male        100 NieP   Henan    male     
    6 ZhaoY  female       93 ZhaoY  Henan    female   
    7 JiaW   female       98 NA     NA       NA       
    8 Qiany  NA           NA Qiany  Xinjiang male
    
    

        上面示例考虑的情况都是没有重复值的情况,实际中重复值出现很常见,那这四个函数又是怎么处理重复值的呢?
        对于重复值,四个函数返回的结果中会保留所有重复的行
        基于上面的举例进行说明:
        1) 如果test1中有两个ZhangS,test2不变,则:

    test3<-tribble(
      ~name,~sex,~score,
      "ZhangS","male",90,
      "ZhangS","male",98,
      "LiS","male",80,
      "WangJ","female",85,
      "MuQ","female",95,
      "NieP","male",100,
      "ZhaoY","female",93,
      "JiaW","female",98,
    )
    > inner_join(test3,test2)
    Joining, by = c("name", "sex")
    # A tibble: 7 x 5
    name   sex    score Name   address
    <chr>  <chr>  <dbl> <chr>  <chr>  
      1 ZhangS male      90 ZhangS Beijing
    2 ZhangS male      98 ZhangS Beijing
    3 LiS    male      80 LiS    Beijing
    4 WangJ  female    85 WangJ  Tanjin 
    5 MuQ    female    95 MuQ    Taijin 
    6 NieP   male     100 NieP   Henan  
    7 ZhaoY  female    93 ZhaoY  Henan
    > left_join(test3,test2)
    Joining, by = c("name", "sex")
    # A tibble: 8 x 5
    name   sex    score Name   address
    <chr>  <chr>  <dbl> <chr>  <chr>  
      1 ZhangS male      90 ZhangS Beijing
    2 ZhangS male      98 ZhangS Beijing
    3 LiS    male      80 LiS    Beijing
    4 WangJ  female    85 WangJ  Tanjin 
    5 MuQ    female    95 MuQ    Taijin 
    6 NieP   male     100 NieP   Henan  
    7 ZhaoY  female    93 ZhaoY  Henan  
    8 JiaW   female    98 NA     NA
    > right_join(test3,test2)
    Joining, by = c("name", "sex")
    # A tibble: 8 x 5
    name   sex    score Name   address 
    <chr>  <chr>  <dbl> <chr>  <chr>   
      1 ZhangS male      90 ZhangS Beijing 
    2 ZhangS male      98 ZhangS Beijing 
    3 LiS    male      80 LiS    Beijing 
    4 WangJ  female    85 WangJ  Tanjin  
    5 MuQ    female    95 MuQ    Taijin  
    6 NieP   male     100 NieP   Henan   
    7 ZhaoY  female    93 ZhaoY  Henan   
    8 Qiany  male      NA Qiany  Xinjiang
    > full_join(test3,test2)
    Joining, by = c("name", "sex")
    # A tibble: 9 x 5
    name   sex    score Name   address 
    <chr>  <chr>  <dbl> <chr>  <chr>   
      1 ZhangS male      90 ZhangS Beijing 
    2 ZhangS male      98 ZhangS Beijing 
    3 LiS    male      80 LiS    Beijing 
    4 WangJ  female    85 WangJ  Tanjin  
    5 MuQ    female    95 MuQ    Taijin  
    6 NieP   male     100 NieP   Henan   
    7 ZhaoY  female    93 ZhaoY  Henan   
    8 JiaW   female    98 NA     NA      
    9 Qiany  male      NA Qiany  Xinjiang
    

        2) 如果test1不变,test2中有两个LiS,则:

    test4<-tribble(
      ~name,~Name,~address,~sex,
      "ZhangS","ZhangS","Beijing","male",
      "LiS","LiS","Beijing","male",
      "LiS","LiS","Beijing_new","male",
      "WangJ","WangJ","Tanjin","female",
      "MuQ","MuQ","Taijin","female",
      "NieP","NieP","Henan","male",
      "ZhaoY","ZhaoY","Henan","female",
      "Qiany","Qiany","Xinjiang","male",
    )
    > inner_join(test1,test4)
    Joining, by = c("name", "sex")
    # A tibble: 7 x 5
      name   sex    score Name   address    
      <chr>  <chr>  <dbl> <chr>  <chr>      
    1 ZhangS male      90 ZhangS Beijing    
    2 LiS    male      80 LiS    Beijing    
    3 LiS    male      80 LiS    Beijing_new
    4 WangJ  female    85 WangJ  Tanjin     
    5 MuQ    female    95 MuQ    Taijin     
    6 NieP   male     100 NieP   Henan      
    7 ZhaoY  female    93 ZhaoY  Henan 
    其他的可以自己尝试
    

        3) 如果test1和test2中均有重复值,则:

    > inner_join(test3,test4)
    Joining, by = c("name", "sex")
    # A tibble: 8 x 5
      name   sex    score Name   address    
      <chr>  <chr>  <dbl> <chr>  <chr>      
    1 ZhangS male      90 ZhangS Beijing    
    2 ZhangS male      98 ZhangS Beijing    
    3 LiS    male      80 LiS    Beijing    
    4 LiS    male      80 LiS    Beijing_new
    5 WangJ  female    85 WangJ  Tanjin     
    6 MuQ    female    95 MuQ    Taijin     
    7 NieP   male     100 NieP   Henan      
    8 ZhaoY  female    93 ZhaoY  Henan
    
    可以看出来test1中的两行ZhangS和test2中的两行LiS分别组合了,生成了2x2=4行
    

        关于输出:这四个函数返回的都是一个与x类型相同的对象,尽可能地保留x中行和列的顺序。输出结果中包含x和y中的所有列,当然by指定的列是合并的,除by指定的列之外其他名字相同的列通过加后缀加以区分。


        dplyr包中链接函数和merge函数的比较:


    dplyr和merge函数的比较
    merge(x, y, by = intersect(names(x), names(y)),
          by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
          sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
          incomparables = NULL, ...)
    
    关于by的使用:
    如果指定by,则不用指定by.x和by.y了
    

    【上一篇:45.关于关系型数据的开篇】
    【下一篇:47.关于Filtering joins类函数】

    相关文章

      网友评论

          本文标题:46.关于Mutating joins类函数和merge函数

          本文链接:https://www.haomeiwen.com/subject/avcyiltx.html