R数据科学（三）dplyr

作者: 子鹿学生信 | 来源:发表于2018-11-17 10:13 被阅读3次

R数据科学（三）dplyr
12-1 tidyr dplyr stringr
R数据科学-1（dplyr）
《R数据科学》学习笔记|Note5:使用dplyr进行数据转换(
R数据科学》学习笔记|Note4:使用dplyr进行数据转换(上
《学习小组Day6笔记--寒鹤》
《R数据科学》学习笔记|Note8:使用dplyr处理关系数据
R_Datacamp3(2018-07-20——2018-07-
R数据科学-2（tidyr）
[14] 《R数据科学》dplyr练习

CHAPTER 3 Data Transformation with dplyr

library(nycflights13)
library(tidyverse)

查看冲突信息发现dplyr与基本包的函数冲突，如想用基本包的函数，可以这样写：stats::filter()，stats::lag()。
本次演示数据为nycflights13::flights,包括336,776 flights that departed from New York City in 2013，数据来自US Bureau of Transportation Statistics。
查看具体信息：

?flights
class(flights)
dim(flights)
head(flights)
View(flights)

可以发现该数据是一个Tibbles，属于数据框，但是在tidyverse里更方便。

dplyr包的核心函数：

filter 按行筛选
arrange 给行排序
select 按列筛选
mutate 根据原有列生成新列
summarize 摘要统计
以上函数都可以与group_by()连用，用于分组执行。
执行模板为：第一个参数为数据框，后面的参数为要做的事情，然后返回新数据框。

Filter Rows with filter()

filter(flights, month == 1, day == 1)
# 如果想保存结果，需要另外赋值给一个变量
jan1 <- filter(flights, month == 1, day == 1)
# 如果想保存同时打印，需要在外层用括号包裹起来。
(jan1 <- filter(flights, month == 1, day == 1))

Comparisons

比较运算符：>, >=, <, <=, != (not equal), and == (equal)。
几个易犯的错误：

判断相等是“==” 而不是“=”；
浮点型和整型数据不相等，用near判断。

sqrt(2) ^ 2 == 2
#> [1] FALSE
1/49 * 49 == 1
#> [1] FALSE
near(sqrt(2) ^ 2, 2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE

Logical Operators

或 | ，与 & ，非！

# 找出11月或12月出发的航班
filter(flights, month == 11 | month == 12)

用管道符简化选择 x%in%y，指x是y中的一个

(nov_dec <- filter(flights, month %in% c(11,12)))

其他简化操作：!(x & y)等价于!x | !y，!(x | y)等价于!x & !y

# 这两个结果相同
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

Missing Values 缺失值

缺失值用NAs表示 (“not availables”)
NULL表示空值，无

NA > 5
#> [1] NA
10 == NA
#> [1] NA
NA + 10
#> [1] NA

NA == NA
#> [1] NA

is.na()检查是否是缺失值
注意filter只选择TRUE的值，FALSE 和 NA 的会排除掉，
如果想保留缺失值

df <- tibble(x=c(1,NA,3))
filter(df,x > 1)
#> # A tibble: 1 × 1
#> x
#> <dbl>
#> 1 3
filter(df, is.na(x) | x > 1)
#> # A tibble: 2 × 1
#> x
#> <dbl>
#> 1 NA
#> 2 3

练习题：
1.a选择延误到达大于等于两个小时的航班

View(flights)
filter(flights, arr_delay >= 120)

b.The flights that flew to Houston were are those flights where the destination (dest) is either “IAH” or “HOU”.

filter(flights,dest=='IAH' | dest=='HOU')
# 或者用%in% 选择
filter(flights,dest %in% c('IAH','HOU'))

c. Were operated by United, American, or Delta

filter(flights, carrier %in% c("AA", "DL", "UA"))

d.Departed in summer (July, August, and September)

filter(flights, month >= 7, month <= 9)

e. Arrived more than two hours late, but didn’t leave late

filter(flights, dep_delay <= 0, arr_delay > 120)

f. Were delayed by at least an hour, but made up over 30 minutes in flight

filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)

g. Departed between midnight and 6 a.m. (inclusive)

filter(flights, dep_time <= 600 | dep_time == 2400)

between()的作用
between(x, left, right) 与 x >= left & x <= right 相同

filter(flights, between(month, 7, 9))
# month >= 7 & month <= 9

missing dep_time 缺失值

filter(flights, is.na(dep_time))

Why is NA ^ 0 not missing? Why is NA | TRUE not missing?
Why is FALSE & NA not missing? Can you figure out the general
rule? (NA * 0 is a tricky counterexample!)

NA ^ 0
#> [1] 1
NA | TRUE
#> [1] TRUE
NA & FALSE
#> [1] FALSE
NA | FALSE
#> [1] NA
NA & TRUE
#> [1] NA
NA * 0
#> [1] NA
Inf * 0
#> [1] NaN
-Inf * 0
#> [1] NaN

Arrange Rows with arrange() 对列排序

arrange(flights, year, month, day)

desc()函数逆序排序

arrange(flights, desc(arr_delay))
df <- tibble(x = c(5, 2, NA))
arrange(df, x) #缺失值排到最后
arrange(df, desc(x))

Select Columns with select() 按列选择

select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))
select(flights, ends_with("y"))

select支持正则表达式：
starts_with("abc")，ends_with("xyz") ，contains("ijk")，matches("(.)\1")，num_range("x", 1:3)

rename() 重命名变量

rename(flights, tail_num = tailnum)
select(flights, time_hour, air_time, everything())

练习题：
Exercise 5.4.1.1 Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, 4, 5, 6, 9)
select(flights, one_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))

variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, one_of(variables))
select(flights, starts_with("dep_"), starts_with("arr_"))
select(flights, matches("^(dep|arr)_(time|delay)$"))
select(flights, ends_with("arr_time"), ends_with("dep_time"))
select(flights, contains("_time"), contains("arr_"))

Exercise 5.4.1.2 What happens if you include the name of a variable multiple times in a select() call?

# select忽略重复项，只选第一个。
select(flights, year, month, day, year, year)
# everything
select(flights, arr_delay, everything())

Exercise 5.4.1.3 What does the one_of() function do? Why might it be helpful in conjunction with this vector?

# one_of 可以将一个向量传入
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))

Exercise 5.4.1.4 Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))
# contains忽略了大小写，有个参数可以改变
select(flights, contains("TIME", ignore.case = FALSE))

Add New Variables with mutate() 添加新列

flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)

mutate(flights_sml,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60)

mutate(flights_sml,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
# transmute
transmute(flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)

该函数支持输入函数，注意输入和输出必须为向量。常用创建函数运算为：

算数运算符：+ - * / ^
模运算符：%/% 取整，%% 取余
对数函数：log(),log2(),log10()
偏移函数lead(),lag(),帮助向左或者向右延伸一个变量

(x <- 1:10)
lead(x)
lag(x)

累加和积等 cumsum(), cumprod(), cummin(), cummax()

cumsum(x)
cummean(x)

逻辑比较：<, <=, >, >=, !=
排序：min_rank()，row_number(), dense_rank(), percent_rank(), cume_dist(),
and ntile()

5.6 Grouped summaries with summarise() 折叠数据框，一般与group_by()连用

by_day <- group_by(flights, year, month, day) # 设置分组
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE)) #summarize设置函数

Combining Multiple Operations with the Pipe

head(flights)
# 1. 对dest进行分组
by_dest <- group_by(flights, dest)
# 2.计算距离，平均延误时间，飞机数量
delay <- summarize(by_dest,
count = n(),dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
# 3.对数据进行过滤
delay <- filter(delay, count > 20, dest != "HNL")

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)

# 用管道符 %>% 连接前后数据
delays <- flights %>%
group_by(dest) %>%
summarize(
count = n(),dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")

Missing Values 的处理 na.rm 参数

flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay)) # 如果有一个NA，那么结果就为NA，需要先去掉

flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay, na.rm = TRUE))

# 可以先把数据中的空值去掉
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay))

Counts

delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise(delay=mean(arr_delay))

ggplot(delays,aes(delay))+geom_freqpoly(binwidth=10)


delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)

delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)

library(Lahman)
batting <- as_tibble(Lahman::Batting)

batters <- batting %>% group_by(playerID) %>% 
  summarize(ba=sum(H,na.rm=TRUE)/sum(AB,na.rm = TRUE),
            ab = sum(AB,na.rm = TRUE))

batters %>% filter(ab>100) %>% ggplot(aes(ab,ba)) + geom_point() +geom_smooth(se=F)

batters %>% arrange(desc(ba))

# 常用的分组函数：mean(x)，median(x)
not_cancelled %>% group_by(year,month,day) %>% summarize(
  avg_delay1 = mean(arr_delay),
  avg_delay2 = mean(arr_delay[arr_delay > 0])
)

# sd(x), 四分位数IQR(x), 中位值偏差mad(x)
not_cancelled %>%
group_by(dest) %>%
summarize(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))

# min(x), quantile(x, 0.25), max(x)
# When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarize(
first = min(dep_time),
last = max(dep_time)
)

not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))

# 计算非空值sum(!is.na(x)),计算唯一值：n_distinct(x)
# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarize(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))

# 计数：n()，和count()
not_cancelled %>%
count(dest)

not_cancelled %>%
count(tailnum, wt = distance)

# How many flights left before 5am? (these usually
# indicate delayed flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarize(n_early = sum(dep_time < 500)) 
# 也可以对逻辑值进行计数：如sum(x>10)代表数多少个TRUE，mean(x)计算其比例。
not_cancelled %>%
group_by(year, month, day) %>%
summarize(hour_perc = mean(arr_delay > 60))

按多个变量分组

daily <- group_by(flights, year, month, day)
(per_day <- summarize(daily, flights = n()))
(per_month <- summarize(per_day, flights = sum(flights)))
(per_year <- summarize(per_month, flights = sum(flights)))
flights %>% group_by(day) %>% summarize(mean(dep_time,na.rm = T))

# ungrouping 取消分组
daily %>%
ungroup() %>% # no longer grouped by date
summarize(flights = n()) # all flights

3.6.7　练习

(2) 找出另外一种方法，这种方法要可以给出与 not_cancelled %>% count(dest) 和 not_
cancelled %>% count(tailnum, wt = distance) 同样的输出（不能使用 count()）。

not_canceled <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>% count(dest)
not_cancelled %>% count(tailnum, wt = distance)
# 可以先分组再求每组的长度。
not_cancelled %>%
  group_by(dest) %>%
  summarise(n = length(dest))

not_cancelled %>%
  group_by(dest) %>%
  summarise(n = n())

not_cancelled %>%
  group_by(tailnum) %>%
  summarise(n = sum(distance))

3.7　分组新变量（和筛选器）

# 找出每个分组中最差的成员
flights_sml %>% group_by(year,month,day) %>% filter(rank(desc(arr_delay))<10)

#找出大于某个阈值的所有分组：
popular_dests <- flights %>% group_by(dest) %>% filter(n()>365)
popular_dests

#对数据进行标准化以计算分组指标
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)

head(flights)
filter(flights,origin %>% c('IAH'))

阅读推荐：
生信技能树公益视频合辑：学习顺序是linux，r，软件安装，geo，小技巧，ngs组学！
B站链接：https://m.bilibili.com/space/338686099
YouTube链接：https://m.youtube.com/channel/UC67sImqK7V8tSWHMG8azIVA/playlists
生信工程师入门最佳指南：https://mp.weixin.qq.com/s/vaX4ttaLIa19MefD86WfUA

R数据科学（三）dplyr
CHAPTER 3 Data Transformation with dplyr 查看冲突信息发现dplyr与基本...
12-1 tidyr dplyr stringr
tidyr dplyr stringr “R数据科学” R1：tidyr 核心函数处理NA: R2：dply...
R数据科学-1（dplyr）
R数据科学（dplyr）如今数据分析如火如荼，R与Python大行其道。你还在用Excel整理数据么，你还在用s...
《R数据科学》学习笔记|Note5:使用dplyr进行数据转换(
原文链接：《R数据科学》学习笔记|Note5:使用dplyr进行数据转换(下）[https://link.zhih...
R数据科学》学习笔记|Note4:使用dplyr进行数据转换(上
原文链接：R数据科学》学习笔记|Note4:使用dplyr进行数据转换(上）[https://mp.weixin....
《学习小组Day6笔记--寒鹤》
R包之dplyr包学习 R包dplyr可用于处理R内部或外部的结构化数据，dplyr专注接受dataframe对象...
《R数据科学》学习笔记|Note8:使用dplyr处理关系数据
使用dplyr处理关系数据往期文章《R数据科学》学习笔记|Note1:绪论[http://mp.weixin.q...
R_Datacamp3(2018-07-20——2018-07-
Data Manipulation in R with dplyr用dplyr包来处理数据 Introductio...
R数据科学-2（tidyr）
R数据科学-2 是用于清洗数据的工具，如dplyr[https://www.jianshu.com/p/5d2f5...
[14] 《R数据科学》dplyr练习
（1）找出满足以下条件的所有航班。 a.到达时间延误2小时或更多的航班 b.飞往休斯顿(IAH 或 HOU机场)的...