这两节的习题好多问题啊
6.grouped summaries with summarise()
6.7 Exercises
-
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
- A flight is always 10 minutes late.
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
- 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
方法一
> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-15)==0.5)&(mean(arr_delay==15)==0.5))
>not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-15)==0.5)&(mean(arr_delay==15)==0.5))
> not_cancelled%>%
+ group_by(flight)%>%
+ filter(mean(arr_delay==10)==1)
> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==-30)==0.5)&(mean(arr_delay==30)==0.5))
> not_cancelled%>%
+ group_by(flight)%>%
+ filter((mean(arr_delay==0)==0.99)&(mean(arr_delay==120)==0.01))
方法二
> percent<-not_cancelled%>%
+ group_by(flight)%>%
+ summarise(n=n(),percent_15ahead=sum(arr_delay==-15)/n,
+ percent_15late=sum(arr_delay==15)/n,
+ percent_10late=sum(arr_delay==10)/n,
+ percent_30ahead=sum(arr_delay==-30)/n,
+ percent_30late=sum(arr_delay==30)/n,
+ percent_ontime=sum(arr_delay==0)/n,
+ percent_2hlate=sum(arr_delay==120)/n)
> percent
# A tibble: 3,835 x 9
flight n percent_15ahead percent_15late percent_10late percent_30ahead percent_30late
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 697 0.0215 0.0100 0.00574 0.00574 0.00574
2 2 51 0.0392 0.0196 0 0 0
3 3 628 0.00955 0.00637 0.0159 0.0159 0.00318
4 4 391 0.0358 0.0102 0.00767 0.0128 0.00256
5 5 324 0.0123 0.00617 0.00926 0.0216 0
6 6 206 0.0291 0.00485 0.00485 0.0291 0
7 7 236 0.0169 0.00424 0 0.00847 0.00424
8 8 234 0.0556 0.00855 0.0214 0 0
9 9 152 0.0132 0.0132 0.0197 0 0
10 10 61 0.0164 0.0164 0.0328 0 0
# ... with 3,825 more rows, and 2 more variables: percent_ontime <dbl>, percent_2hlate <dbl>
> percent%>%
+ filter(percent_15ahead==0.5&percent_15late==0.5)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
# percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
# percent_2hlate <dbl>
> percent%>%
+ filter(percent_10late==1)
# A tibble: 5 x 9
flight n percent_15ahead percent_15late percent_10late percent_30ahead percent_30late
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2254 1 0 0 1 0 0
2 3656 1 0 0 1 0 0
3 3785 1 0 0 1 0 0
4 3880 1 0 0 1 0 0
5 5854 1 0 0 1 0 0
# ... with 2 more variables: percent_ontime <dbl>, percent_2hlate <dbl>
> percent%>%
+ filter(percent_30ahead==0.5&percent_30late==0.5)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
# percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
# percent_2hlate <dbl>
> percent%>%
+ filter(percent_ontime==0.99&percent_2hlate==0.01)
# A tibble: 0 x 9
# ... with 9 variables: flight <int>, n <int>, percent_15ahead <dbl>, percent_15late <dbl>,
# percent_10late <dbl>, percent_30ahead <dbl>, percent_30late <dbl>, percent_ontime <dbl>,
# percent_2hlate <dbl>
还有个问题,如果这里的比例是算上cancel的怎么办?
Which is more important: arrival delay or departure delay?
我觉得应该是arr_delay吧,因为的确有起飞延误然后返航或者备降的航班?(dep_delay=na)
-
Come up with another approach that will give you the same output as
not_cancelled %>% count(dest)
andnot_cancelled %>% count(tailnum, wt = distance)
(without usingcount()
).> not_cancelled%>% + group_by(dest)%>% + summarise(n=n()) > not_cancelled%>% + group_by(tailnum)%>% + summarise(n=sum(distance))#被自己蠢哭了.. #wt=distance就是count=sum(n(x)*distance(x)),那直接分组后sum(distance)就好了啊
-
Our definition of cancelled flights (
is.na(dep_delay) | is.na(arr_delay)
) is slightly suboptimal. Why? Which is the most important column?>summarise(flights,exist_dep_delay=sum(!is.na(dep_delay)),exist_arr_delay=sum(!is.na(arr_delay)),not_cancelled=sum(!(is.na(dep_delay)|is.na(arr_delay)))) # A tibble: 1 x 3 exist_dep_delay exist_arr_delay not_cancelled <int> <int> <int> 1 328521 327346 327346 > #也就是说,存在只有dep_delay的数据而无arr_delay的数据,那意思是返航了?所以应该扩大点范围? > #只要存在dep_delay就应该不算取消吗? >#我明白了,不可能有到达了但没有起飞的航班,so只需关注dep_delay
-
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
> flights%>%#题目里的pattern啥意思? +group_by(year,month,day)%>% +summarise(cancel=sum(is.na(dep_delay)|is.na(arr_delay)),cancel_prop=mean(is.na(dep_delay)|is.na(arr_delay)),avg_delay=mean(arr_delay,na.rm=TRUE))%>% +ggplot(mapping=aes(x=avg_delay,y=cancel_prop))+geom_point()+geom_smooth(se=FALSE) `geom_smooth()` using method = 'loess' and formula 'y ~ x'
不太明白题目里pattern啥意思。看图的话两个应该是正相关的
-
Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about
flights %>% group_by(carrier, dest) %>% summarise(n())
)> not_cancelled%>% + group_by(carrier)%>% + summarise(mean_delay=mean(arr_delay[arr_delay>0]))%>% + ggplot(mapping=aes(x=carrier,y=mean_delay))+geom_bar(stat="identity")
怎么去定义worst delay?如果是指mean_delay最高的话,那应该是OO。
题目的后半部分不会做.我觉得没办法,因为有些地方的机场只有一个或两个carrier,或者说有的carrier只有一个航线。无法区分是bad carrier还是bad airport
-
What does the
sort
argument tocount()
do. When might you use it?sort if TRUE will sort output in descending order of n
> not_cancelled%>%
+ count(dest,sort=TRUE)#当想看哪个变量的数量最多时用很方便
# A tibble: 104 x 2
dest n
<chr> <int>
1 ATL 16837
2 ORD 16566
3 LAX 16026
4 BOS 15022
5 MCO 13967
6 CLT 13674
7 SFO 13173
8 FLL 11897
9 MIA 11593
10 DCA 9111
# ... with 94 more rows
等价于
not_cancelled%>%
group_by(dest)%>%
summarise(n=n())%>%
arrange(desc(n))
7.Grouped mutates (and filters)
Find the worst members of each group:
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
#注意这里找的是每天arr_delay最高的九个航班,不是arr_delay最高的日子
#每天9个所以有3300多行哦
Find all groups bigger than a threshold:
> popular_dest<-flights%>%
+ group_by(dest)%>%
+ filter(n()>365)
> popular_dest
Standardise to compute per group metrics:
> popular_dest%>%#这个数据集已经按dest分组了
+ filter(arr_delay>0)%>%#这里的filter是对每行数据
+ mutate(prop_delay=arr_delay/sum(arr_delay))%>%#sum_delay是整组的sum(每个dest在整个数据集中的总delay)
#得到的prop是每一个单独的航程,占该dest全年所有航班总delay的比例
+ select(year:day,arr_delay,prop_delay)
Adding missing grouping variables: `dest`
Exercise
-
Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
?我觉得没啥变化,只不过这些函数都是在group内进行啊 -
Which plane (tailnum) has the worst on-time record?
#还是那个问题,怎么定义worst?是发生delay的比例吗还是平均delay时间?
#如果先group再做filter/select,都是针对组内进行操作
#如果要对整组的情况做一个总体的描述,得用summarise
#这里如果先group_by(tailnum),后面得用summarise吧?
> tailnum<-not_cancelled%>%
+ group_by(tailnum)%>%
+ summarise(prop_delay=sum(arr_delay>0)/n(),mean_delay=sum(arr_delay)/n())
> arrange(tailnum,desc(prop_delay))
# A tibble: 4,037 x 3
tailnum prop_delay mean_delay
<chr> <dbl> <dbl>
1 N121DE 1 15
2 N136DL 1 146
3 N143DA 1 24
4 N17627 1 15.5
5 N240AT 1 53.6
6 N26906 1 19
7 N295AT 1 30.8
8 N302AS 1 29
9 N303AS 1 46
10 N32626 1 5
# ... with 4,027 more rows
> arrange(tailnum,desc(mean_delay))
# A tibble: 4,037 x 3
tailnum prop_delay mean_delay
<chr> <dbl> <dbl>
1 N844MH 1 320
2 N911DA 1 294
3 N922EV 1 276
4 N587NW 1 264
5 N851NW 1 219
6 N928DN 1 201
7 N7715E 1 188
8 N654UA 1 185
9 N665MQ 0.667 175.
10 N427SW 1 157
# ... with 4,027 more rows
#可是我用filter(rank(desc(prop_delay))<10)就怎么都返回空表诶
怎样精简代码呢?
-
What time of day should you fly if you want to avoid delays as much as possible?
-
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
> not_cancelled%>%
+ group_by(dest)%>%
+ summarise(total_by_dest=sum(arr_delay))
我不明白这里的flight,是指一个航班号(每天都飞,总共有很多行数据,是一个组)还是每一个单独的航程(就是单行数据)
先按航班号理解好了
思路:先按dest分组,用mutate计算每一个单独航程的比例,再按flight分组,sum
> by_flight<-not_cancelled%>%
+ group_by(flight)
> by_dest%>%
+ mutate(prop_each=arr_delay/sum(arr_delay))%>%
+ group_by(flight)%>%
+ mutate(prop_flight=sum(prop_each))%>%
+ select(year:day,flight,prop_each,prop_flight)
# A tibble: 327,346 x 6
# Groups: flight [3,835]
year month day flight prop_each prop_flight
<int> <int> <int> <int> <dbl> <dbl>
1 2013 1 1 1545 0.000366 0.0116
2 2013 1 1 1714 0.000666 0.0647
3 2013 1 1 1141 0.00952 -0.119
4 2013 1 1 725 -0.00246 0.0240
5 2013 1 1 461 -0.000131 -0.0243
6 2013 1 1 1696 0.000123 0.00471
7 2013 1 1 507 0.000198 0.0508
8 2013 1 1 5708 -0.000188 0.00154
9 2013 1 1 79 -0.000105 -0.00491
10 2013 1 1 301 0.0000822 -0.177
# ... with 327,336 more rows
- Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.
思路:按origin分组,将数据按预计起飞时间(sched)排序,x=lag(dep_delay),y=dep_delay作图,得到前一个航班的delay时间与当前航班delay时间的相关关系
> flights%>%
+ group_by(origin)%>%
+ arrange(year,month,day,sched_dep_time)%>%
+ mutate(prev_delay=lag(dep_delay))%>%
+ ggplot(mapping=aes(x=prev_delay,y=dep_delay))+geom_point()
所以这图传达了什么信息呢?
所以这图是说,相关性不大吗orz
- Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
思路:定义suspciously fast为air_time<=30min
按dest分组,去掉缺失值,去掉suspciously fast航班,找出最短飞行时间再计算
> not_cancelled%>%
+ filter(air_time<=30)
> not_cancelled%>%
+ filter(air_time>30)%>%
+ group_by(dest)%>%
+ mutate(rel_airtime=air_time/min(air_time))%>%
+ filter(rank(desc(rel_airtime))<10)
# A tibble: 940 x 20
# Groups: dest [104]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 646 645 1 1023 1030 -7 UA
2 2013 1 1 656 705 -9 1007 940 27 MQ
3 2013 1 1 805 815 -10 1006 1010 -4 MQ
4 2013 1 1 817 810 7 1005 948 17 B6
5 2013 1 1 848 851 -3 1155 1136 19 UA
6 2013 1 1 926 928 -2 1233 1220 13 UA
7 2013 1 1 1032 1035 -3 1305 1250 15 EV
8 2013 1 1 1044 1045 -1 1231 1212 19 EV
9 2013 1 1 1147 1155 -8 1335 1327 8 FL
10 2013 1 1 1330 1321 9 1613 1536 37 EV
- Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
其实没懂题目啥意思,以下代码是在github看见的
> flights%>%
+ group_by(dest)%>%
+ filter(n_distinct(carrier)>=2)%>%
+ group_by(carrier)%>%
+ mutate(possible_transfers=n_distinct(dest))%>%
+ arrange(desc(possible_transfers))%>%
+ select(dest,year:day,possible_transfers)
Adding missing grouping variables: `carrier`
# A tibble: 325,397 x 6
# Groups: carrier [16]
carrier dest year month day possible_transfers
<chr> <chr> <int> <int> <int> <int>
1 EV IAD 2013 1 1 51
2 EV MSP 2013 1 1 51
3 EV IAD 2013 1 1 51
4 EV JAX 2013 1 1 51
5 EV CHS 2013 1 1 51
6 EV MEM 2013 1 1 51
7 EV CLE 2013 1 1 51
8 EV RDU 2013 1 1 51
9 EV DCA 2013 1 1 51
10 EV AVL 2013 1 1 51
# ... with 325,387 more rows
我有一个疑问,这里第二次group_by的时候是在先按dest分组的情况下分还是直接在原tibble重新分组?按最后输出结果以及可以用dest当运算参数看应该是重新分组的?我有点不懂这个管道操作是什么个意思了
- For each plane, count the number of flights before the first delay of greater than 1 hour.
思路:应该是按示例中的threshold类型处理。
想要得到一个第一次delay超过1小时之前的数据的tibble,得排序
哭泣了这题也不会做是在StackOverflow上找的答案
方法一:用first()函数得到第一个delay超过一小时的数据
> filghts_del<-flights%>%
+ group_by(tailnum)%>%
+ arrange(year,month,day,dep_time)%>%
+ filter(arr_delay>60)%>%
+ summarise(first_month=first(month),first_day=first(day),first_time=first(dep_time))
> filghts_del
# A tibble: 3,371 x 4
tailnum first_month first_day first_time
<chr> <int> <int> <int>
1 D942DN 2 11 1508
2 N0EGMQ 1 1 1604
3 N10156 1 13 2221
4 N102UW 6 22 2141
5 N104UW 2 13 1331
6 N10575 1 2 1548
7 N105UW 7 11 1411
8 N107US 7 27 1624
9 N108UW 8 13 1329
10 N109UW 9 13 1643
# ... with 3,361 more rows
> not_cancelled%>%
+ group_by(tailnum)%>%
+ summarise(n=sum(month<filghts_del$first_month)+sum(month==filghts_del$first_month&day<filghts_del$first_day)+sum(month==filghts_del$first_month&day==filghts_del$first_day&dep_time<filghts_del$first_time))
# A tibble: 4,037 x 2
tailnum n
<chr> <int>
1 D942DN 1390
2 N0EGMQ 913
3 N10156 1150
4 N102UW 671
5 N103US 693
6 N104UW 632
7 N10575 930
8 N105UW 476
9 N107US 583
10 N108UW 578
# ... with 4,027 more rows
There were 50 or more warnings (use warnings() to see the first 50)
特别复杂还不知道对不对,哭泣了
方法二:用cumall
> not_cancelled%>%
+ group_by(tailnum)%>%
+ arrange(year,month,day,dep_time)%>%
+ filter(cumall(arr_delay<60))%>%
+ tally()
# A tibble: 3,799 x 2
tailnum n
<chr> <int>
1 N10156 9
2 N102UW 25
3 N103US 46
4 N104UW 3
5 N105UW 22
6 N107US 20
7 N108UW 36
8 N109UW 28
9 N110UW 15
10 N11107 7
# ... with 3,789 more rows
网友评论