data-manipulation-with-dplyr
使用glimpse()
快速查看表格信息。
- select
通过select
选择指定表格信息
# Select the columns
counties %>%
select(state, county, population, poverty)
复习之前的mutate, arrange 及fliter
counties %>%
# Select the five columns
select(state, county, population, men, women) %>%
# Add the proportion_men variable
mutate(proportion_men = men / population) %>%
# Filter for population of at least 10,000
filter(population >= 10000) %>%
# Arrange proportion of men in descending order
arrange(desc(proportion_men))
使用select 选中多行信息。
select(state:drive)
选中 state-drive 之间的全部列表
contain("work")
选中全部带有work 的列表
starts_with("income")
选中以income 开头的列表
- count
计算数据数量
count()
,还可以定义sort对数值进行排序。
# Use count to find the number of counties in each region
counties_selected %>%
count(region, sort = TRUE)
还可以设定排序的选项。通过调整weights 的数值。
# Find number of counties per state, weighted by citizens
counties_selected %>%
count(state, wt = citizens, sort = TRUE)
- top_n
top_n(2, population)
选择表格中按照population为权重排序,筛选出前2个。
例子
# Group by region and find the greatest number of citizens who walk to work
counties_selected %>%
group_by(region) %>%
top_n(1, walk)
- rename
rename()
对变量名称进行修改
也可以直接在select 中语句使用
name = new_name
类似。
不过注意的是,新命名的名字在左。
# Rename the n column to num_counties
counties %>%
count(state) %>%
rename(num_counties = n)
- transmute
通过transmute
,对列的内容不仅更名,还可以做计算操作。类似于特殊操作版本的select。
counties %>%
# Keep the state, county, and populations columns, and add a density column
transmute(state, county, population, density = population / land_area) %>%
# Filter for counties with a population greater than one million
filter(population > 1000000) %>%
# Sort density in ascending order
arrange(density)
select, rename, mutate, transmute的区别
# Change the name of the unemployment column
counties %>%
rename(unemployment_rate = unemployment)
# Keep the state and county columns, and the columns containing poverty
counties %>%
select(state, county, contains("poverty"))
# Calculate the fraction_women column without dropping the other columns
counties %>%
mutate(fraction_women = women / population)
# Keep only the state, county, and employment_rate columns
counties %>%
transmute(state, county, employment_rate = employed / population)
- lag()
通过lag()
函数将连续数据内容向后移动一位。
v = c(1, 2, 4 ,5)
lag(v) = (NA, 2, 4)
因此,借助这样方式,可以计算连续项的差值。
difference = lag(v) - v
例子
babynames_fraction %>%
# Arrange the data in order of name, then year
arrange(name, year) %>%
# Group the data by name
group_by(name) %>%
# Add a ratio column that contains the ratio between each year
mutate(ratio = fraction / lag(fraction))
总结
网友评论