美文网首页R for Data Science
[R语言] Join 连接《R for data science

[R语言] Join 连接《R for data science

作者: 半为花间酒 | 来源:发表于2020-04-22 09:09 被阅读0次

    《R for Data Science》第十三章 Relational data 啃书知识点积累
    参考链接:

    1. R for Data Science
    2. R语言中dplyr包join函数之目前我看到过的最形象的教程

    Introduction

    本章节的应用前提需要了解nycflights13各数据集的内容

    library(nycflights13)
    
    ?flights # dplyr用的很熟悉就不查了
    ?airlines
    ?airports
    ?planes
    ?weather
    

    画模式图

    涉及了datamodelrDiagrammeR

    library(datamodelr,DiagrammeR)
    
    dm <- dm_from_data_frames(list(
      #  指定数据集来源
      Master = Lahman::Master,
      Managers = Lahman::Managers,
      AwardsManagers = Lahman::AwardsManagers
    )) %>%
      # 设置主键
      dm_set_key("Master", "playerID") %>%
      dm_set_key("Managers", c("yearID", "teamID", "inseason")) %>%
      dm_set_key("AwardsManagers", c("playerID", "awardID", "yearID")) %>%
      # 设置外键
      dm_add_references(
        Managers$playerID == Master$playerID,
        AwardsManagers$playerID == Master$playerID
      )
    
    # 绘图
    # rankdir可以指定'BT', 'RL'等
    # columnArrows默认为F,只是数据集间粗略箭头
    dm_create_graph(dm, rankdir = "LR", columnArrows = TRUE) %>%
      dm_render_graph()
    

    There are three families of verbs designed to work with relational data:

    • Mutating joins, which add new variables to one data frame from matching observations in another.
    • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.
    • Set operations, which treat observations as if they were set elements.

    Keys

    There are two types of keys:

    • A primary key uniquely identifies an observation in its own table.
    • A foreign key uniquely identifies an observation in another table.
    • 确认是不是主键:分组后各组仅一个观测
    planes %>% 
      count(tailnum) %>% 
      filter(n > 1) %>% 
      nrow()
    # [1] 0
    
    • 代理键的概念:

    If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key.

    flights %>% 
      mutate(id = row_number()) %>% 
      select(id,everything())
    

    Mutating joins

    flights %>% 
      select(year:day, hour,tailnum, carrier) %>% 
      left_join(airlines, by = "carrier")
    
    
    # 也可以利用mutate和match完成left_join相同操作,但麻烦很多
    flights %>% 
      select(year:day, hour,tailnum, carrier) %>% 
      mutate(name = airlines$name[match(carrier, airlines$carrier)])
    

    - inner_join

    The most important property of an inner join is that unmatched rows are not included in the result.

    - Outer joins

    1. A left join keeps all observations in x.
    2. A right join keeps all observations in y.
    3. A full join keeps all observations in x and y.

    The left join should be your default join

    • base::merge()有类似join的功能

    - Defining the key columns

    1. The default, by = NULL, uses all variables that appear in both tables, the so called natural join
    2. A character vector,by = "x". This is like a natural join, but uses only some of the common variables.
    flights %>% 
      left_join(planes, by = "tailnum")
    #> # A tibble: 336,776 x 16
    #>   year.x month   day  hour origin dest  tailnum carrier year.y type 
    #>    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
    #> 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe…
    #> 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe…
    #> 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe…
    #> 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe…
    #> 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe…
    #> 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe…
    #> # … with 3.368e+05 more rows, and 6 more variables: manufacturer <chr>,
    #> #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
    
    1. A named character vector: by = c("a" = "b"). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.
    flights %>% 
      left_join(airports, c("origin" = "faa"))
    #> # A tibble: 336,776 x 15
    #>    year month   day  hour origin dest  tailnum carrier name    lat   lon   alt
    #>   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
    #> 1  2013     1     1     5 EWR    IAH   N14228  UA      Newa…  40.7 -74.2    18
    #> 2  2013     1     1     5 LGA    IAH   N24211  UA      La G…  40.8 -73.9    22
    #> 3  2013     1     1     5 JFK    MIA   N619AA  AA      John…  40.6 -73.8    13
    #> 4  2013     1     1     5 JFK    BQN   N804JB  B6      John…  40.6 -73.8    13
    #> 5  2013     1     1     6 LGA    ATL   N668DN  DL      La G…  40.8 -73.9    22
    #> 6  2013     1     1     5 EWR    ORD   N39463  UA      Newa…  40.7 -74.2    18
    #> # … with 3.368e+05 more rows, and 3 more variables: tz <dbl>, dst <chr>,
    #> #   tzone <chr>
    

    - Q: 画出航班的大致起落

    flights %>%
      inner_join(select(airports, origin = faa, ori_lat = lat, ori_lon = lon),
                 by = "origin") %>% 
      inner_join(select(airports, dest = faa, des_lat = lat, des_lon = lon),
                 by = "dest") %>% 
      # 取子集,类似filter(row_number() <= 100)
      slice(1:100) %>%
      ggplot(aes(
        x = ori_lon, xend = des_lon,
        y = ori_lat, yend = des_lat
      )) +
      # 美国地图
      borders("state") +
      # 调整地图到合适比例
      coord_quickmap() +
      geom_segment(arrow = arrow(length = unit(0.1, "cm"))) +
      labs(y = "Latitude", x = "Longitude")
    

    Filtering joins

    1. semi_join(x, y) keeps all observations in x that have a match in y.
      Semi-joins are useful for matching filtered summary tables back to the original rows.
    2. anti_join(x, y) drops all observations in x that have a match in y.、
      Anti-joins are useful for diagnosing join mismatches.
    x <- tribble(
      ~key, ~val_x,
      # --/---
        1, "x1",
        2, "x2",
        3, "x3"
    )
    
    y <- tribble(
      ~key, ~val_y,
      # --/---
        1, "y1",
        2, "y2",
        4, "y3"
    )
    
    semi_join(x,y,by='key')
    # # A tibble: 2 x 2
    #     key val_x
    #   <dbl> <chr>
    # 1     1 x1   
    # 2     2 x2 
    
    semi_join(y,x,by='key')
    # # A tibble: 2 x 2
    #     key val_x
    #   <dbl> <chr>
    # 1     1 y1   
    # 2     2 y2
    
    anti_join(x,y,by='key')
    # # A tibble: 1 x 2
    #     key val_x
    #   <dbl> <chr>
    # 1     3 x3 
    
    anti_join(y,x,by='key')
    # # A tibble: 1 x 2
    #     key val_x
    #   <dbl> <chr>
    # 1     4 y3 
    
    
    # 另一个例子,获取飞往受欢迎目的地的所有航班
    top_dest <- flights %>%
      count(dest, sort = TRUE) %>%
      slice(1:10)
    
    flights %>% 
      filter(dest %in% top_dest$dest)
    
    # 用半连接
    flights %>% 
      semi_join(top_dest)
    

    Set operations

    1. intersect(x, y): return only observations in both x and y.
    2. union(x, y): return unique observations in x and y.
    3. setdiff(x, y): return observations in x, but not in y.

    These expect the x and y inputs to have the same variables

    df1 <- tribble(
      ~x, ~y,
       1,  1,
       2,  1
    )
    df2 <- tribble(
      ~x, ~y,
       1,  1,
       1,  2
    )
    
    intersect(df1, df2)
    #> # A tibble: 1 x 2
    #>       x     y
    #>   <dbl> <dbl>
    #> 1     1     1
    
    union(df1, df2)
    #> # A tibble: 3 x 2
    #>       x     y
    #>   <dbl> <dbl>
    #> 1     1     1
    #> 2     2     1
    #> 3     1     2
    
    setdiff(df1, df2)
    #> # A tibble: 1 x 2
    #>       x     y
    #>   <dbl> <dbl>
    #> 1     2     1
    
    setdiff(df2, df1)
    #> # A tibble: 1 x 2
    #>       x     y
    #>   <dbl> <dbl>
    #> 1     1     2
    

    相关文章

      网友评论

        本文标题:[R语言] Join 连接《R for data science

        本文链接:https://www.haomeiwen.com/subject/llebbhtx.html