[R语言] Join 连接《R for data science

作者: 半为花间酒 | 来源:发表于2020-04-22 09:09 被阅读0次

[R语言] Join 连接《R for data science
「r<-workshop」第二期：程序控制与数据操作流
学习小组Day4笔记--安之若素
学习小组DAY4笔记——shoan
《R数据科学》学习笔记|Note6:使用tibble实现简单数据
《R数据科学》学习笔记|Note1:绪论
【R for Data Science】(2) tibble
《R数据科学》学习笔记|Note15:使用purrr实现迭代(上
《R数据科学》学习笔记|Note16:使用purrr实现迭代(下
《R数据科学》学习笔记|Note12:使用magrittr进行管

《R for Data Science》第十三章 Relational data 啃书知识点积累
参考链接：

R for Data Science

R语言中dplyr包join函数之目前我看到过的最形象的教程

Introduction

本章节的应用前提需要了解nycflights13各数据集的内容

library(nycflights13)

?flights # dplyr用的很熟悉就不查了
?airlines
?airports
?planes
?weather

画模式图

涉及了datamodelr和DiagrammeR

library(datamodelr,DiagrammeR)

dm <- dm_from_data_frames(list(
  #  指定数据集来源
  Master = Lahman::Master,
  Managers = Lahman::Managers,
  AwardsManagers = Lahman::AwardsManagers
)) %>%
  # 设置主键
  dm_set_key("Master", "playerID") %>%
  dm_set_key("Managers", c("yearID", "teamID", "inseason")) %>%
  dm_set_key("AwardsManagers", c("playerID", "awardID", "yearID")) %>%
  # 设置外键
  dm_add_references(
    Managers$playerID == Master$playerID,
    AwardsManagers$playerID == Master$playerID
  )

# 绘图
# rankdir可以指定'BT', 'RL'等
# columnArrows默认为F，只是数据集间粗略箭头
dm_create_graph(dm, rankdir = "LR", columnArrows = TRUE) %>%
  dm_render_graph()

There are three families of verbs designed to work with relational data:

Mutating joins, which add new variables to one data frame from matching observations in another.

Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.

Set operations, which treat observations as if they were set elements.

Keys

There are two types of keys:

A primary key uniquely identifies an observation in its own table.

A foreign key uniquely identifies an observation in another table.

确认是不是主键：分组后各组仅一个观测

planes %>% 
  count(tailnum) %>% 
  filter(n > 1) %>% 
  nrow()
# [1] 0

代理键的概念：

If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key.

flights %>% 
  mutate(id = row_number()) %>% 
  select(id,everything())

Mutating joins

flights %>% 
  select(year:day, hour,tailnum, carrier) %>% 
  left_join(airlines, by = "carrier")


# 也可以利用mutate和match完成left_join相同操作，但麻烦很多
flights %>% 
  select(year:day, hour,tailnum, carrier) %>% 
  mutate(name = airlines$name[match(carrier, airlines$carrier)])

- inner_join

The most important property of an inner join is that unmatched rows are not included in the result.

- Outer joins

A left join keeps all observations in x.
A right join keeps all observations in y.
A full join keeps all observations in x and y.

The left join should be your default join

base::merge()有类似join的功能

- Defining the key columns

The default, by = NULL, uses all variables that appear in both tables, the so called natural join
A character vector,by = "x". This is like a natural join, but uses only some of the common variables.

flights %>% 
  left_join(planes, by = "tailnum")
#> # A tibble: 336,776 x 16
#>   year.x month   day  hour origin dest  tailnum carrier year.y type 
#>    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
#> 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe…
#> 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe…
#> 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe…
#> 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe…
#> 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe…
#> 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe…
#> # … with 3.368e+05 more rows, and 6 more variables: manufacturer <chr>,
#> #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>

A named character vector: by = c("a" = "b"). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.

flights %>% 
  left_join(airports, c("origin" = "faa"))
#> # A tibble: 336,776 x 15
#>    year month   day  hour origin dest  tailnum carrier name    lat   lon   alt
#>   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
#> 1  2013     1     1     5 EWR    IAH   N14228  UA      Newa…  40.7 -74.2    18
#> 2  2013     1     1     5 LGA    IAH   N24211  UA      La G…  40.8 -73.9    22
#> 3  2013     1     1     5 JFK    MIA   N619AA  AA      John…  40.6 -73.8    13
#> 4  2013     1     1     5 JFK    BQN   N804JB  B6      John…  40.6 -73.8    13
#> 5  2013     1     1     6 LGA    ATL   N668DN  DL      La G…  40.8 -73.9    22
#> 6  2013     1     1     5 EWR    ORD   N39463  UA      Newa…  40.7 -74.2    18
#> # … with 3.368e+05 more rows, and 3 more variables: tz <dbl>, dst <chr>,
#> #   tzone <chr>

- Q: 画出航班的大致起落

flights %>%
  inner_join(select(airports, origin = faa, ori_lat = lat, ori_lon = lon),
             by = "origin") %>% 
  inner_join(select(airports, dest = faa, des_lat = lat, des_lon = lon),
             by = "dest") %>% 
  # 取子集，类似filter(row_number() <= 100)
  slice(1:100) %>%
  ggplot(aes(
    x = ori_lon, xend = des_lon,
    y = ori_lat, yend = des_lat
  )) +
  # 美国地图
  borders("state") +
  # 调整地图到合适比例
  coord_quickmap() +
  geom_segment(arrow = arrow(length = unit(0.1, "cm"))) +
  labs(y = "Latitude", x = "Longitude")

Filtering joins

semi_join(x, y) keeps all observations in x that have a match in y.
Semi-joins are useful for matching filtered summary tables back to the original rows.
anti_join(x, y) drops all observations in x that have a match in y.、
Anti-joins are useful for diagnosing join mismatches.

x <- tribble(
  ~key, ~val_x,
  # --/---
    1, "x1",
    2, "x2",
    3, "x3"
)

y <- tribble(
  ~key, ~val_y,
  # --/---
    1, "y1",
    2, "y2",
    4, "y3"
)

semi_join(x,y,by='key')
# # A tibble: 2 x 2
#     key val_x
#   <dbl> <chr>
# 1     1 x1   
# 2     2 x2 

semi_join(y,x,by='key')
# # A tibble: 2 x 2
#     key val_x
#   <dbl> <chr>
# 1     1 y1   
# 2     2 y2

anti_join(x,y,by='key')
# # A tibble: 1 x 2
#     key val_x
#   <dbl> <chr>
# 1     3 x3 

anti_join(y,x,by='key')
# # A tibble: 1 x 2
#     key val_x
#   <dbl> <chr>
# 1     4 y3 


# 另一个例子，获取飞往受欢迎目的地的所有航班
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  slice(1:10)

flights %>% 
  filter(dest %in% top_dest$dest)

# 用半连接
flights %>% 
  semi_join(top_dest)

Set operations

intersect(x, y): return only observations in both x and y.

union(x, y): return unique observations in x and y.

setdiff(x, y): return observations in x, but not in y.

These expect the x and y inputs to have the same variables

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)

intersect(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1

union(df1, df2)
#> # A tibble: 3 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     1
#> 3     1     2

setdiff(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     2     1

setdiff(df2, df1)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2