Joining Data with dplyr
inner join
inner_join
,通过某个列表信息作为纽带,将两个不同的表格连接起来。
sets %>%
inner_join(theme, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
# 通过theme_id与id将sets 与theme 连接起来。
c("first_table_column" = "second_table_column").
需要注意的是,by后面内容有顺序要求。通过by 后面内容中的向量使用,
分割,可以设定列表通过多行内容连接。
-
连接多行列表
只要是相连的两列有相同内容的信息即可,即便所有列表可能不全部包含该列信息。
left join
如果相连接的列表中有某个列表没有某个内容,inner join 会自动忽略掉那一行oberservation 的信息。
使用left join 可以保留first table 中的信息,而被连接列表中没有的信息则显示为NA
。
例子
inventory_version_1 <- inventories %>%
filter(version == 1)
# Join versions to sets
sets %>%
left_join(inventory_version_1, by = "set_num") %>%
# Filter for where version is na
filter(is.na(version))
right join
与left join 相反,保留连接的second table中内容
-
三者对比
replace_na
replace_na(list(n=0))
例子
batman_parts %>%
# Combine the star_wars_parts table
full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
# Replace NAs with 0s in the n_batman and n_star_wars columns
replace_na(list(n_batman = 0, n_star_wars = 0))
join table to themselves
通过自连接可以将这种具有层级关系的表格连接在一起。
子数据关于父数据的id(parent_id)直接与对于的父数据id 相等。
例子
# themes 表格
id name parent_id
<dbl> <chr> <dbl>
1 1 Technic NA
2 2 Arctic Technic 1
3 3 Competition 1
4 4 Expert Builder 1
5 5 Model 1
6 6 Airport 5
7 7 Construction 5
8 8 Farm 5
9 9 Fire 5
10 10 Harbor 5
themes %>%
# Inner join the themes table
inner_join(themes, by = c("id" = "parent_id"),suffix = c("_parent", "_child")) %>%
# Filter for the "Harry Potter" parent name
filter(name_parent == "Harry Potter")
# 自连接后输出
id name_parent parent_id id_child name_child
<dbl> <chr> <dbl> <dbl> <chr>
1 246 Harry Potter NA 247 Chamber of Secrets
2 246 Harry Potter NA 248 Goblet of Fire
3 246 Harry Potter NA 249 Order of the Phoenix
4 246 Harry Potter NA 250 Prisoner of Azkaban
5 246 Harry Potter NA 251 Sorcerer's Stone
6 246 Harry Potter NA 667 Fantastic Beasts
full join
全连接,不管连的(first table)还是被连的(second table)若数据存在缺失,均连接起来。
缺失内容用NA
表示。
batman_parts %>%
# Combine the star_wars_parts table
full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
# Replace NAs with 0s in the n_batman and n_star_wars columns
replace_na(list(n_batman = 0, n_star_wars = 0))
inner_join, left/right_join, full_join 都可以称为mutating join。
filtering join
semi join & anti join
例子
# Filter the batwing set for parts that are also in the batmobile set
batwing %>%
semi_join(batmobile, by = c("part_num"))
# Filter the batwing set for parts that aren't in the batmobile set
batwing %>%
anti_join(batmobile, by = c("part_num"))
总结
ps: 使用n() 获得总共的变量数量
questions_with_tags %>%
# Group by tag_name
group_by(tag_name) %>%
# Get mean score and num_questions
summarize(score = mean(score),
num_questions = n()) %>%
# Sort num_questions in descending order
arrange(desc(num_questions))
其他语句
bind_rows()
可以将两个表格的内容上下拼接起来(行与行)
例子
# Combine the two tables into posts_with_tags
posts_with_tags <- bind_rows(questions_with_tags %>% mutate(type = "question"),
answers_with_tags %>% mutate(type = "answer"))
# Add a year column, then aggregate by type, year, and tag_name
posts_with_tags %>%
mutate(year = year(creation_date)) %>%
count(type, year, tag_name)
网友评论