在线读书:
R for data science
gethub地址: https://github.com/hadley/r4ds
通过视频课程,自己看帖子,已经自学R有一段时间,断断续续的也算入门了,但是还是感觉知识不系统,因此,想系统的学习一下R,优化自己的工作流程。
学习目标:应用领域:作物遗传育种;数据类型:主要用来分析转录组或者重测序数据,不进行大规模Rawdata 处理,以及田间农艺性状的表型调查的统计分析,实验室标记统计及与表型的连锁分析;研究重点:要解决生物学问题,不在于秀技术,能重测序数据中挖掘关键基因,能够数据可视化,能够作图发文章。
Data visualisation
数据可视化的利器,当然是大名鼎鼎的ggplot2 了,完全执行图形语法。
安装tidyverse包,直接包含数据分析中的常用R包,省时省力。##(其实对自己常用的软件也可以的写一个类似的函数,直接一行代码解决。)
install.packages("tidyverse")
library(tidyverse)
#> ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.1.0.9000 ✔ purrr 0.2.5
#> ✔ tibble 2.0.0 ✔ dplyr 0.7.8
#> ✔ tidyr 0.8.2 ✔ stringr 1.3.1
#> ✔ readr 1.3.1 ✔ forcats 0.3.0
#> ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
当函数有冲突时,可以通过package::function() 的方式,指定运行特定包的函数。例如:ggplot2::filter() , dplyr::filter().
First step 开始作图
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
在使用ggplot2时,分行写函数时“+”一定要放在前一行的末尾,而不能放在下一行的前边。
ggplot() 中的第一个参数"data"是作图需要用的数据,并创建一个空白图。之后可以通过geom_xxx() 图形语法添加不同的图形到图上。每个geom函数均有一个mapping参数,来定义作图数据中的变量,mapping 参数总数和aes()同时应用,你可以额外增加 变量,通过aesthetic参数,例如:color(colour),size,shape。从而在1副图上展现不同的的数据。也可以自定义aesthetic参数,像geom函数的参数一样,在aes()外面。
但是也要注意:
-Using size for a discrete variable is not advised.# 离散型变量不要用size 展示。
-Using alpha for a discrete variable is not advised. #离散型变量不要用size 展示。
-The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate.#1副图中最多出现6种shape,超过6中会识别困难ggplot2不会绘制。
-A continuous variable can not be mapped to shape. #连续型变量无法映射到shape.
你需要确保设定的aes()是有意义的:
- color 的名字是 character string.
- 点的szie单位是 mm.
-
点的shape以不同的数字表示 Figure [3.1]
image.png
Figure 3.1: R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the colour and fill aesthetics. The hollow shapes (0–14) have a border determined by colour,不可填充颜色; the solid shapes (15–18) are filled with colour; the filled shapes (21–24) have a border of colour and are filled with fill.
Facets
除了通过aes()映射外,还可以通过facet_wrap() 进行分面,将离散型数据,按不同类型单独作图#。
按class 变量进行分面作图
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
通过facet_grid()函数进行用两种变量进行分面。
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
也可以用“. ” 替代其中一个变量进行分面, e.g. + facet_grid(. ~ cyl) 或者e.g. + facet_grid(drv ~ .)。前者 分面后按列排布,后者分面后按行排布。
facet_wrap(~class, scales = "free") ##分面可以用不同的坐标系,默认为scales="fixed".
facet_wrap(c("cyl", "drv"), labeller = "label_both")## 同过c()函数,facet_wrap()也可实现两个变量的分面,通过labeller参数控制 标签显示更加完全。
To repeat the same data in every panel, simply construct a data frame that does not contain the faceting variable.
ggplot(mpg, aes(displ, hwy)) +
geom_point(data = transform(mpg, class = NULL), colour = "grey85") +
geom_point() +
facet_wrap(~class)
Statistical transformations 统计转化
每个geom类型均默认一种stat 方式,也可以用stat()作图,效果相同。
也可以在geom作图时,重新定义stat="xx".
也可以更改变量的统计方式,在映射的变量两侧加..,表示进行stat.
Position 位置
dodge ##并列
fill ##百分比堆叠
identity ## 原位
jitter ##扰动,避免重叠
stack ##堆叠图
Coordinate systems 坐标系统
最常见的坐标系统就是笛卡尔坐标系(包括x轴,y轴),但其他坐标系也有一定的作用。
coord_flip() ##进行x轴与y轴转化
coord_quickmap() #sets the aspect ratio correctly for maps. 设定地图正确的纵横比。
coord_polar() #uses polar coordinates. 使用极坐标,可以绘制饼图,圈图,鸡冠花图。
coord_fixed() ##对坐标系统进行校正,使得x轴与y轴符合比例。
练习题:
- What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
image
自定义的aes参数要放置在aes()外。
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
-
Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
-
Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
连续型变量无法映射到shape.color 和size 两种均可以。 -
What happens if you map the same variable to multiple aesthetics?
可以。 -
What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
"stroke"参数用来设置形状(非实心shape)轮廓的线条粗细。 -
What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.
经过计算,或者判定的变量也可以直接映射给geom函数作图。 -
What happens if you facet on a continuous variable?
可以分面,但分面过多,无意义。 -
What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
- What plots does the following code make? What does . do?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
- Take the first faceted plot in this section:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
-
What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
-
Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?
-
When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?
-
What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
geom_line(),geom_boxplot, geom_histogram, geom_area -
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
```
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
```
- What does
show.legend = FALSE
do? What happens if you remove it?
Why do you think I used it earlier in the chapter?
show.legend = FALSE,##不显示图例,默认为TRUE, - What does the
se
argument togeom_smooth()
do?
'se'参数控制平滑线附近的置信区间显示,默认为TRUE,显示置信区间。 - Will these two graphs look different? Why/why not?
```
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
```
-
Recreate the R code necessary to generate the following graphs.
image
ggplot(mpg,aes(x=displ,y=hwy))+geom_point(size=3)+
geom_smooth(size=2,se=FALSE)
image
ggplot(mpg,aes(x=displ,y=hwy))+geom_point(size=3)+
geom_smooth(aes(group=drv),size=2,se=FALSE)
image
ggplot(mpg,aes(x=displ,y=hwy,color=drv))+geom_point(size=3)+
geom_smooth(size=2,se=FALSE)
image
ggplot(mpg,aes(x=displ,y=hwy))+geom_point(aes(color=drv),size=3)+
geom_smooth(size=2,se=FALSE)
image
ggplot(mpg,aes(x=displ,y=hwy))+geom_point(aes(color=drv),size=3)+
geom_smooth(aes(linetype=drv),size=2,se=FALSE)
image
ggplot(mpg,aes(x=displ,y=hwy))+
geom_point(aes(fill=drv),size=3,shape=21,stroke=2.5,color="white")
-
What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
stat_summary() 默认geom="pointrange"。 -
What does geom_col() do? How is it different to geom_bar()?
geom_col()也是绘制柱形图,默认stat="identity";
geom_bar()默认stat="count" -
Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
-
What variables does stat_smooth() compute? What parameters control its behaviour?
continuous variable,"method"参数控制stat_smooth的运行 -
In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
让..prop..不再为1,需要分组,group接分组变量可以实现分组,但是它没有展示的途径,所以一般即使用group分组,做出图形也看不出来组别之间的区别。但是对于..prop..确是有影响的。
第二句:去掉“,y = ..prop..”可以实现以颜色分组的堆叠图;増加position=“fill”可以实现百分比堆叠图。
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill=color))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill=color),position="fill")
-
What parameters to geom_jitter() control the amount of jittering?
-
Compare and contrast geom_jitter() with geom_count().
geom_jitter()##展示原始数据
geom_count() ##展示统计后的数据 -
What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.
geom_boxplot()的默认位置参数为"dodge2", -
What’s the difference between coord_quickmap() and coord_map()?
coord_quickmap() ## 计算速度比coord_map()快,适合于小地图。
coord_map() ## 计算速度慢,占内存大。
网友评论