R数据科学(一)ggplot2

作者: 子鹿学生信 | 来源:发表于2018-11-17 10:12 被阅读2次

    1. install packages

    install.packages("tidyverse")
    library(tidyverse)
    tidyverse_update()
    ##################

    安装三个数据包

    install.packages(c("nycflights13", "gapminder", "Lahman"))

    tidyverse 包括ggplot2, tibble, tidyr, readr, purrr和 dplyr包

    PART I Explore

    CHAPTER 1: Data Visualization with ggplot2

    以ggplot2包中的mpg数据为例,它是一个数据框,每行为一个数据,每列为一个观测。mpg包括38种车的数据。

    # 查看该数据集
    head(ggplot2::mpg)
    

    displ:车发动机大小,hwy:车的燃油效率

    • 用该数据集创造第一幅ggplot图
    library(ggplot2)
    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy))
    

    该图表示发动机大小与燃油呈现负相关。

    • ggplot() 函数产生最基础的坐标系统,然后可以在上面加图层,
    # 空图层,背景,颜色,字体都设好了
    ggplot(data = mpg)  
    # aes()将数字映射为图形
    ggplot(data = mpg) + geom_point(aes(displ,hwy)) 
    
    #查看mpg数据
    dim(mpg)
    head(mpg)
    # 查看hwy和cyl的关系
    ggplot(mpg,aes(hwy,cyl)) + geom_point()
    

    这里提供了一个画图模板:
    ggplot(data = <DATA>) +
    <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

    Aesthetic Mappings

    aesthetic美学的,在图中表示点的大小,颜色等
    我们可以把点的颜色按某个数值分组,如class

    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, color = class))
    

    也可以按点的大小分组

    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, size = class))
    

    或者映射给透明度或者形状

    # Top
    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
    # Bottom
    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, shape = class))
    # ggplot一次只能用6个形状,这里有7个,所以SUV不显示了
    

    我们可以手动定义几何类型

    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
    

    练习题:
    1.为什么点不是蓝色的?

    ggplot(data = mpg) +
    geom_point(
    mapping = aes(x = displ, y = hwy, color = "blue")
    )
    

    因为color放在映射里面了,映射自动从彩色里赋值。

    ggplot(data = mpg) +
    geom_point(
    mapping = aes(x = displ, y =hwy, color = cty))
    

    2.注意映射连续变量与分类变量的区别。如颜色连续变量为一个颜色从深到浅,分类变量为各个颜色的分类。

    ggplot(data = mpg) +
    geom_point(
    mapping = aes(x = displ, y =hwy, color = displ))
    

    4.一个变量有多个映射是可以的,但是造成了信息的冗余,一般不会这样做。

    1. stroke是映射什么的?
    ggplot(mtcars, aes(wt, mpg)) +
      geom_point(shape = 21, colour = "black", fill = "white", size = 5, stroke = 1)
    

    stroke映射点的边框粗细。

    ggplot(mpg, aes(x = displ, y = hwy, colour = displ < 5)) +
      geom_point()
    

    注意:R语法很容易出错,注意(),“”是否配对,如果运行R代码无反应,按Esc键退出。

    Facets 分面

    增加信息的方式一个是将变量给映射,另外一个方法是将分类变量给分面,从而将图分成几个小的面。
    分面有两种函数,facet_wrap(~分类变量,nrow,ncol)这个函数放入一个分类变量。

    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~ class, nrow = 2)
    

    facet_grid(a ~ b) 可以用两个组合变量来分面

    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_grid(drv ~ cyl)
    

    facet_grid()函数如果只想用一个变量来分面,可以用.留空。

    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_grid(. ~ cyl)
    
    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_grid(drv ~ .)
    

    练习题:

    1. 如果用连续型变量来分面会出现什么后果?
    head(mpg)
    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~ cty, nrow = 2)
    

    结果是将连续型变量转换为因子,每个因子都有一个分面。
    2.该图中有空位子,表示什么意思?

    ggplot(data = mpg) +
    geom_point(mapping = aes(x = drv, y = cyl))
    

    空点表示该位子无数值。
    3.下面两个代码有何不同?

    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_grid(drv ~ .)
    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_grid(. ~ cyl)
    

    .的位置代表不想用该变量进行分面。
    4.用分面代替颜色映射的优势和劣势是什么?
    一幅图中人眼可以识别的颜色不超过9种,分面可以区分更多的信息,但是不容易相互比较。

    3.6 Geometric Objects 几何对象

    几何对象是把数据用图形的方式映射出来

    # left
    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy))
    # right
    ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy))
    

    每个几何对象函数都有对应的映射参数,但是具有独立性,有些不能通用

    ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
    

    许多几何对象可以展示多组图形,ggplot2会自动分组,但是不展示图例。

    ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy))
    ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
    ggplot(data = mpg) +
    geom_smooth(
    mapping = aes(x = displ, y = hwy, color = drv),
    show.legend = FALSE
    )
    

    ggplot2也可以展示多个图层

    ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    geom_smooth(mapping = aes(x = displ, y = hwy))
    

    同一张图显示多个几何对象--局部映射和全局映射的区别,如有冲突,以局部变量为准。

    # ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
    # geom_point(mapping = aes(color = class)) +
    # geom_smooth(
    # data = filter(mpg, class == "subcompact"),
    # se = FALSE
    # )
    

    filter设置geom_smooth几何对象的过滤,se表示标准差
    练习题:
    Exercise 3.6.2 该代码画图是什么样的?

    ggplot(data = mpg, mapping = aes(x = displ, y = hwy, colour = drv)) +
      geom_point() +
      geom_smooth(se = FALSE)
    

    color作为全局变量传递给point和smooth,因此,这两个都画出来了。

    Exercise 3.6.3
    What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

    ggplot(data = mpg) +
      geom_smooth(
        mapping = aes(x = displ, y = hwy, colour = drv),
      )
    
    ggplot(data = mpg) +
      geom_smooth(
        mapping = aes(x = displ, y = hwy, colour = drv),
        show.legend = FALSE)
    
    1. Re-create the R code necessary to generate the following graphs.
    ggplot(mpg,aes(displ,hwy))+geom_point()+geom_smooth(se=F)
    ggplot(mpg,aes(displ,hwy))+geom_point()+geom_smooth(aes(group=drv),se=F)
    ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
      geom_point() +
      geom_smooth(se = FALSE)
    ggplot(mpg,aes(displ,hwy))+geom_point(aes(color=drv))+geom_smooth(se=F)
    ggplot(mpg,aes(displ,hwy))+geom_point(aes(color=drv))+geom_smooth(aes(linetype=drv),se=F)
    ggplot(mpg, aes(x = displ, y = hwy)) +
       geom_point(size = 4, color = "white") +
       geom_point(aes(colour = drv))
    

    3.7 Statistical Transformations 统计变换

    统计变换:绘图时用来计算新数据的算法叫做统计变换stat
    每个几何对象函数都有一个默认的统计变换,每个统计变换函数都又一个默认的几何对象。
    用几何对象函数geom_bar作直方图,默认统计变换是stat_count.
    一般可以用默认的统计变换,以下情况要用新的统计变换:
    1.覆盖默认的统计变换

    • 直方图默认的统计变换是stat_count,也就是统计计数。当需要直接用原表格的数据作图时就会需要覆盖默认的。
    library(tibble)
    
    demo <- tribble(
    ~a, ~b,
    "bar_1", 20,
    "bar_2", 30,
    "bar_3", 40
    )
    # 默认stat=count,这里改成 "identity"
    
    ggplot(data = demo) +
    geom_bar(
    mapping = aes(x = a, y = b), stat = "identity"
    )
    
    
    

    2.覆盖从统计变换生成变量到图形属性的默认映射
    直方图默认的y轴是x轴的计数。此例子中x轴是五种cut(切割质量),直方图自动统计了这五种质量的钻石的统计计数,当你不想使用计数,而是想显示各质量等级所占比例的时候就需要用到prop。

    ggplot(diamonds,aes(cut,..prop..,group=1))+geom_bar()
    #group=1的意思是把所有钻石作为一个整体,显示五种质量的钻石所占比例体现出来。
    

    3.在代码中强调统计变换
    以stat_summary为例。

    ggplot(diamonds)+stat_summary(aes(cut,depth),
                                  fun.ymin = min,
                                  fun.ymax=max,
                                  fun.y=median)
    

    练习题:
    1.stat_summary()默认的几何对象是什么?
    stat_summary的默认几何图形是geom_pointrange,而geom_pointrange默认的统计变换却是identity

    ggplot(diamonds) + geom_pointrange(aes(cut,depth),
                                       stat = 'summary',
                                       fun.ymin=min,
                                       fun.ymax=max,
                                       fun.y=median)
    
    1. geom_col()与geom_bar()的区别
      geom_col()的默认统计变换为identity(),geom_bar()默认为count()

    2. stat_smooth()计算变量为预测值,最低和最高置信区间及SE

    3. geom_bar(aes(y = ..prop..))中group=1的设置?
      默认分组是等于x的,分组是在组内执行

    ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, y = ..prop..))
    ggplot(data = diamonds) +
    geom_bar(
    mapping = aes(x = cut, fill = color, y = ..prop..)
    )
    ggplot(data = diamonds) +
      geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
    
    ggplot(data = diamonds) +
      geom_bar(mapping = aes(x = cut, fill = color, y = ..prop.., group = color))
    

    3.8 Position Adjustments

    geom_bar的颜色可以用color和fill调整

    ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, color = cut))
    ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, fill = cut))
    
    ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, fill = clarity))
    

    bar的位置有三个参数可以调整"identity", "dodge" or "fill"
    "identity"直接显示

    ggplot(diamonds,aes(cut,fill=clarity))+geom_bar(alpha=1/5,position = 'identity')
    ggplot(diamonds,aes(cut,color=clarity))+geom_bar(fill=NA,position = 'identity')
    

    "fill"堆叠式,x每个分组都为100%

    ggplot(data = diamonds) +
    geom_bar(
    mapping = aes(x = cut, fill = clarity),
    position = "fill"
    )
    

    "dodge" 并列式,一个放在另一个旁边

    ggplot(data = diamonds) +
    geom_bar(
    mapping = aes(x = cut, fill = clarity),
    position = "dodge"
    )
    

    position = "jitter" 添加点的随机扰动,使重复的点暴露出来。

    ggplot(data = mpg) +
    geom_point(
    mapping = aes(x = displ, y = hwy),
    position = "jitter"
    )
    

    ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?posi
    tion_stack.

    ggplot(mtcars,aes(factor(cyl),fill=factor(vs))) + 
      geom_bar(position = position_dodge(preserve = 'total'))
    

    练习题:

    1. geom_jitter()哪个参数控制扰动大小?
      width,height从水平和垂直方向控制

    3.对比geom_jitter() 和 geom_count()

    #geom_jitter()对点添加随机扰动
    ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_jitter()
    #geom_count()重复的点越多,点越大
    ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_count()
    
    1. geom_boxplot()默认的统计变换是什么?
    ggplot(data = mpg, mapping = aes(x = drv, y = hwy,color=class)) +
      geom_boxplot()
    ggplot(data = mpg, aes(x = drv, y = hwy, colour = class)) +
      geom_boxplot(position = "identity")
    

    默认为position_dodge()

    3.9 Coordinate Systems 坐标系统

    ggplot2默认为笛卡尔坐标系,x和y轴是独立的
    coord_flip() 调换x和y轴

    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
    geom_boxplot()
    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
    geom_boxplot() +
    coord_flip()
    

    coord_quickmap
    为地图设置长宽比
    此处需要加载maps包,否则会报错。

    library(maps)
    #如果报错则:install.packages("maps")
    #library(maps)
    nz <- map_data("nz")
     
    ggplot(nz, aes(long, lat, group = group)) +
      geom_polygon(fill = "white", colour = "black")
     # geom_polygon 是多边形图
    ggplot(nz, aes(long, lat, group = group)) +
      geom_polygon(fill = "white", colour = "black") +
      coord_quickmap()
    

    coord_polar()极坐标系统

    bar <- ggplot(data = diamonds) +
    geom_bar(
    mapping = aes(x = cut, fill = cut),
    show.legend = FALSE,
    width = 1
    ) +
    theme(aspect.ratio = 1) +
    labs(x = NULL, y = NULL)
    bar + coord_flip()
    bar + coord_polar()
    
    #width = 1把柱形图中间的空去掉了,
    ggplot(mpg, aes(x = factor(1), fill = drv)) +
      geom_bar()
    #theta = "y"是将角度按y轴变量来设置,如不设置,会出现中间空心原点
    ggplot(mpg, aes(x = factor(1), fill = drv)) +
      geom_bar(width = 1) +
      coord_polar(theta = "y")
    ggplot(mpg, aes(x = factor(1), fill = drv)) +
      geom_bar(width = 1) +
      coord_polar()
    ggplot(diamonds) + geom_bar(aes(x=cut,fill=cut))+coord_polar()
    
    #多组的bar图也能画出饼图
    head(diamonds)
    ggplot(diamonds,aes(cut,fill=color)) + 
      geom_bar(position = "fill") #注意position位置参数的设置,默认position = "identity"
    
    ggplot(diamonds,aes(cut,fill=color)) + 
      geom_bar(position = "fill") + 
      coord_polar(theta = "y")
    
    

    Exercise 3.9.2 lab()函数可以给图层增加x和y的标签和title

    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
      geom_boxplot() +
      coord_flip() +
      labs(y = "Highway MPG", x = "", title = "Highway MPG by car class")
    

    Exercise 3.9.4
    coord_fixed()保持线为45度

    p <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_point() +
      geom_abline()
    p
    p + coord_fixed() 
    

    3.10 The Layered Grammar of Graphics
    ggplot2通用模板

    ggplot(data = <DATA>) + #数据集data
    <GEOM_FUNCTION>( #几何对象geom
    mapping = aes(<MAPPINGS>), #映射aes
    stat = <STAT>, #统计变换stat
    position = <POSITION> #位置调整position
    ) +
    <COORDINATE_FUNCTION> + #坐标系统
    <FACET_FUNCTION> #分面系统

    图形构建的过程由以上五个指标构建,后面两个用于微调。

    阅读推荐:
    生信技能树公益视频合辑:学习顺序是linux,r,软件安装,geo,小技巧,ngs组学!
    B站链接:https://m.bilibili.com/space/338686099
    YouTube链接:https://m.youtube.com/channel/UC67sImqK7V8tSWHMG8azIVA/playlists
    生信工程师入门最佳指南:https://mp.weixin.qq.com/s/vaX4ttaLIa19MefD86WfUA

    相关文章

      网友评论

        本文标题:R数据科学(一)ggplot2

        本文链接:https://www.haomeiwen.com/subject/hcsbfqtx.html