data-manipulation-with-dplyr

使用glimpse() 快速查看表格信息。

select
通过select选择指定表格信息

# Select the columns 
counties %>%
  select(state, county, population, poverty)

复习之前的mutate, arrange 及fliter

counties %>%
  # Select the five columns 
  select(state, county, population, men, women) %>%
  # Add the proportion_men variable
  mutate(proportion_men = men / population) %>%
  # Filter for population of at least 10,000
  filter(population >= 10000) %>%
  # Arrange proportion of men in descending order 
  arrange(desc(proportion_men))

使用select 选中多行信息。
select(state:drive) 选中 state-drive 之间的全部列表
contain("work") 选中全部带有work 的列表
starts_with("income") 选中以income 开头的列表

count
计算数据数量
count() ，还可以定义sort对数值进行排序。

# Use count to find the number of counties in each region
counties_selected %>%
  count(region, sort = TRUE)

还可以设定排序的选项。通过调整weights 的数值。

# Find number of counties per state, weighted by citizens
counties_selected %>%
  count(state, wt = citizens, sort = TRUE)

top_n
top_n(2, population) 选择表格中按照population为权重排序，筛选出前2个。
例子

# Group by region and find the greatest number of citizens who walk to work
counties_selected %>%
  group_by(region) %>%
  top_n(1, walk)

rename
rename() 对变量名称进行修改
也可以直接在select 中语句使用
name = new_name类似。
不过注意的是，新命名的名字在左。

# Rename the n column to num_counties
counties %>%
  count(state) %>%
  rename(num_counties = n)

transmute
通过transmute，对列的内容不仅更名，还可以做计算操作。类似于特殊操作版本的select。

counties %>%
  # Keep the state, county, and populations columns, and add a density column
  transmute(state, county, population, density = population / land_area) %>%
  # Filter for counties with a population greater than one million 
  filter(population > 1000000) %>%
  # Sort density in ascending order 
  arrange(density)

select, rename, mutate, transmute的区别

# Change the name of the unemployment column
counties %>%
  rename(unemployment_rate = unemployment)

# Keep the state and county columns, and the columns containing poverty
counties %>%
  select(state, county, contains("poverty"))

# Calculate the fraction_women column without dropping the other columns
counties %>%
  mutate(fraction_women = women / population)

# Keep only the state, county, and employment_rate columns
counties %>%
  transmute(state, county, employment_rate = employed / population)

lag()
通过lag() 函数将连续数据内容向后移动一位。

v = c(1, 2, 4 ,5)
lag(v) = (NA, 2, 4)

因此，借助这样方式，可以计算连续项的差值。

difference = lag(v) - v

例子

babynames_fraction %>%
  # Arrange the data in order of name, then year 
  arrange(name, year) %>%
  # Group the data by name
  group_by(name) %>%
  # Add a ratio column that contains the ratio between each year 
  mutate(ratio = fraction / lag(fraction))

总结