美文网首页R for Data Science哲思
[R语言] ggplot2包 可视化《R for data sc

[R语言] ggplot2包 可视化《R for data sc

作者: 半为花间酒 | 来源:发表于2020-04-06 09:33 被阅读0次

    《R for Data Science》第二、三章 Data visualisation 啃书知识点积累

    参考书籍

    1. 《R for data science》
    2. 《R数据科学》
    3. The Layered Grammar of Graphics.
    4. ggplot2: Points

    “The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
    “The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

    A graphing template

    ggplot(data = <DATA>) + 
      <GEOM_FUNCTION>(
         mapping = aes(<MAPPINGS>),
         stat = <STAT>, 
         position = <POSITION>
      ) +
      <COORDINATE_FUNCTION> +
      <FACET_FUNCTION>
    

    Aesthetic mappings

    # Left
    p1 <- ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
    
    # Right
    p2 <- ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy, shape = class))  
    
    p1 + p2
    # Warning messages:
    # 1: Using alpha for a discrete variable is not advised. 
    # 2: The shape palette can deal with a maximum of 6 discrete values
    # because more than 6 becomes difficult to discriminate; you have
    # 7. Consider specifying shapes manually if you must have them. 
    # 3: Removed 62 rows containing missing values (geom_point). 
    

    ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.

    - How do these aesthetics behave differently for categorical vs. continuous variables

    '''
    color 有序属性
    1. 分类变量映射:对应多种不同颜色
    2. 连续变量映射:形成有固定范围的色阶,在色阶内部取色
    
    size 有序属性
    1. 分类变量映射:点大小和分类类型逐一对应但不相关,且会警告
    2. 连续变量映射:点的大小和连续变量线性相关
    
    shape 无序属性
    1. 分类变量映射:对应多种形状,最多同时出现6种,超过则不显示且有警告
    2. 连续变量映射:无法映射
    '''
    

    - mpg的变量类型

    • stroke属性
    p1 <- ggplot(mpg,aes(x = displ, y = hwy)) +
      geom_point(shape = 1)
    
    p2 <- ggplot(mpg,aes(x = displ, y = hwy)) +
      geom_point(shape = 1,stroke = 2)
    
    p1 + p2
    

    Facet 分面

    - 封装型 wrap

    ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) + 
      facet_wrap(~ class, nrow = 2)
    

    facet_wrap()参数如下:


    # strip.position参数调节标签的朝向
    p1 <- ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) + 
      facet_wrap(~ class, nrow = 2, strip.position = 'bottom')
    
    p2 <- ggplot(mpg) + 
      geom_point(aes(x = displ, y = hwy)) + 
      facet_wrap(~ class, nrow = 2, strip.position = 'right')
    
    p1 + p2
    

    - 在分面中呈现总数据

    ggplot(mpg, aes(displ, hwy)) +
      geom_point(data = transform(mpg, class = NULL), 
                 colour = "grey85") +
      geom_point() +
      facet_wrap(~ class)
    

    - 网格型 grid

    # . 的作用表示的是不想在行或者列的维度上进行分面
    p1 <- ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy)) +
      facet_grid(drv ~ .) # 列 ~ 行
    
    p2 <- ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy)) +
      facet_grid(. ~ cyl)
    
    p1 + p2
    

    Geometric objects

    - 不显示图例和置信区间

    p1 <- ggplot(mpg) +
      geom_smooth(aes(x = displ, y = hwy))
    
    p2 <- ggplot(mpg,aes(x = displ, y = hwy, group = drv)) +
      geom_smooth(se = FALSE)
    
    p3 <- ggplot(mpg) +
      geom_smooth(
        aes(x = displ, y = hwy, color = drv),
        show.legend = FALSE)
    
    p1 + p2 + p3
    

    - 配合filter

    ggplot(mpg, aes(x = displ, y = hwy)) + 
      geom_point(aes(color = class)) + 
      geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
    

    - 细节画图

    同样是外白内其他颜色的点,一种重叠后有白色,一种无白色在内

    p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(aes(fill=drv),shape=21,color='white',size=2.5,stroke=1.5)
    
    p2 <- ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(color='white',size=3.5)+
      geom_point(aes(color=drv),shape=16,size=2.3)
    
    p1 + p2
    

    Statistical transformations

    barcharts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
    smoothers fit a model to your data and then plot predictions from the model.
    boxplots compute a robust summary of the distribution and then display a specially formatted box.

    - 几种常用互换

    You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar()

    ggplot(data = diamonds) + 
      stat_count(mapping = aes(x = cut))
    # 等价于
    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut), stat = 'identity') # 默认stat可以不写
    
    ggplot(data = diamonds) +
      geom_pointrange(
        mapping = aes(x = cut, y = depth),
        stat = "summary",
        fun.ymin = min,
        fun.ymax = max,
        fun.y = median
      )
    # 等价于
    ggplot(data = diamonds) +
      stat_summary(
        mapping = aes(x = cut, y = depth),
        fun.ymin = min,
        fun.ymax = max,
        fun.y = median
      )
    
    # 也可以手动复现
    ggplot(diamonds, aes(cut,depth)) + 
      geom_line(size=1) + 
      # 更换data需要重新指名data = xxx
      geom_point(data = diamonds %>%   
                   group_by(cut) %>% 
                   summarise(median(depth)),
                   aes(cut, `median(depth)`), size=2) 
    

    - 覆盖默认映射

    ggplot(diamonds) + 
      geom_bar(aes(x = cut, y = stat(prop), group = 1, fill = stat(prop)))
    # 等价于
    p1 <- ggplot(diamonds) + 
      geom_bar(aes(x = cut, y = ..prop.., group = 1, fill = ..prop..))
    
    p2 <- ggplot(diamonds) + 
      geom_bar(aes(x = cut, y = ..prop.., group = color, fill = color))
    
    p1 + p2
    

    - What does geom_col() do? How is it different to geom_bar()?

    1. geom_col() 函数也是用来绘制柱状图,"identity" 表示不做统计变换
    2. geom_bar() 函数默认是 count,表示计数

    - Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

    Position adjustments

    position = "identity" 将每个对象直接显示在图中,这样数据会彼此重叠,不适合展示结果
    position = "fill" 堆叠百分比条形图
    position = "dodge" 并列条形图
    position = "stack" 堆叠起来
    position = "jitter" 数据随机抖动,一般应用于散点图

    用一下刘博的案例

    library(ggplot2)
    library(patchwork)
    
    v <- data.frame(x = 1:20, 
                    y = runif(40,min = 10,max = 20),
                    z = rep(c("A","B"),each = 20))
                    
    p1 <- ggplot(v, aes(x, y, fill = z))+
      geom_area(position = position_dodge(), alpha = 0.5) +
      labs(title = "position_dodge()")
    
    p2 <- ggplot(v, aes(x, y, fill = z))+
      geom_area(position = position_fill(), alpha = 0.5) +
      labs(title = "position_fill()")
    
    p3 <- ggplot(v, aes(x, y, fill = z))+
      geom_area(position = position_stack(), alpha = 0.5) +
      labs(title = "position_stack()")
    
    p4 <- ggplot(v, aes(x, y, fill = z))+
      geom_area(position = position_identity(), alpha = 0.5) +
      labs(title = "position_identity()")
    
    p5 <- ggplot(v, aes(x, y, fill = z))+
      geom_area(position = position_jitter(), alpha = 0.5) +
      labs(title = "position_jitter(), usually for point")
    
    (p1 + p2 + p3)/(p4 + p5) 
    
    • geom_jitter() 抖动

    geom_jitter() 对数据进行随机抖动
    geom_count() 将重叠的位置数目进行计数

    p1 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_point()
    
    ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_jitter()
    # 等价于
    ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_point(position = position_jitter())
    # 等价于
    p2 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_point(position = 'jitter')
    
    # geom_count()
    p3 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_count()
    

    Coordinate systems

    - coord_flip()

    coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.

    p1 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
      geom_boxplot()
    
    p2 <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
      geom_boxplot() +
      coord_flip()
    
    p1 + p2 
    

    - coord_quickmap()

    帮助地图设置成正确比例

    coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2.

    nz <- map_data("nz")
    
    p1 <- ggplot(nz, aes(long, lat, group = group)) +
      geom_polygon(fill = "white", colour = "black")
    
    p2 <- ggplot(nz, aes(long, lat, group = group)) +
      geom_polygon(fill = "white", colour = "black") +
      coord_quickmap()
    
    p1 + p2 
    

    - coord_polar()

    bar <- ggplot(data = diamonds) + 
      geom_bar(
        mapping = aes(x = cut, fill = cut), 
        show.legend = FALSE,
        width = 1
      ) + 
      theme(aspect.ratio = 1) +
      labs(x = NULL, y = NULL)
    
    p1 <- bar + coord_flip()
    p2 <- bar + coord_polar()
    
    p1 + p2 
    

    进一步拓展:

    - Turn a stacked bar chart into a pie chart using coord_polar()

    p1 <- ggplot(diamonds) +
      geom_bar(aes(x = cut, fill = clarity)) + 
      coord_polar()
    
    p2 <- ggplot(diamonds) +
      geom_bar(aes(x = cut, fill = clarity),
               position = 'fill') + 
      coord_polar()
    
    # theta 参数表示 variable to map angle to (x or y)
    # 意思就是根据值计算出所占的比例,然后再映射到角度
    p3 <- ggplot(diamonds) +
      geom_bar(aes(x = cut, fill = clarity),
               position = 'fill') + 
      coord_polar(theta = "y")
    
    p1 + p2 + p3
    

    - What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

    '''
    城市和公路燃油效率之间呈现正相关。
    coord_fixed()能够固定x轴和y轴的比例。
    geom_abline()是绘制斜线,默认45度,截距适应图形
    可以指定intercept截距,slope坡度
    '''
    
    p1 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_point() + 
      geom_abline() +
      coord_fixed()
    
    p2 <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
      geom_point() +
      geom_abline(intercept=-5,slope=1) +
      coord_fixed()
    
    p1 + p2
    

    相关文章

      网友评论

        本文标题:[R语言] ggplot2包 可视化《R for data sc

        本文链接:https://www.haomeiwen.com/subject/xdvsphtx.html