美文网首页R ggplotR 语言R语言可视化
R 数据可视化 —— 集合可视化 UpSetR

R 数据可视化 —— 集合可视化 UpSetR

作者: 名本无名 | 来源:发表于2021-05-06 20:58 被阅读0次

    前言

    上一节,我们介绍了如何绘制韦恩图来显示集合间的交叠关系

    但是,随着集合的增多,韦恩图显示的关系会越来越复杂,很难一眼看出其中的信息。

    今天,我们要介绍的是,当集合数目较多时,该如何绘制

    我们将使用 UpSetR 包来绘制下面这种图

    该图由三个子图组成:

    1. 表示交集大小的柱状图(上方)
    2. 表示集合大小的条形图(下左)
    3. 表示集合之间的交叠矩阵(下右),矩阵的列表示每种交集组合,对应于柱状图的横坐标;矩阵的行表示集合,对应于条形图的纵坐标

    通过这样一张图,可以展示多个集合之间的交叠关系,且很容易从图中看出集合之间的交集信息

    那怎么绘制出这样一张图呢?

    基础

    1. 安装导入

    install.packages("UpSetR")
    
    library(UpSetR)
    

    我们使用该包自带的示例数据

    movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"), 
        header = T, sep = ";")
    

    2. 数据

    在开始绘制之前,我们需要知道输入数据的格式。

    UpSetR 提供了两个转换函数 fromListfromExpression 用于格式化数据

    • fromList 函数接受一个 list(每个变量表示一个集合),并将其转换为数据框,例如
    listInput <- list(
            one = c(1, 2, 3, 5, 7, 8, 11, 12, 13), 
            two = c(1, 2, 4, 5, 10), 
            three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))
    
    • fromExpression 函数接受一个命名向量表达式,包含了每个集合的大小,以及交集的大小,交集的名称通过 & 符号相连,例如
    expressionInput <- c(
            one = 2, two = 1, three = 2, 
            `one&two` = 1, `one&three` = 4, 
            `two&three` = 1, `one&two&three` = 2)
    

    根据上面的数据,可以绘制如下图形

    upset(fromList(listInput), order.by = "freq")
    # upset(fromExpression(expressionInput), order.by = "freq")
    

    3. 绘制部分集合

    在这里,我们通过设置 nsets = 6 将集合范围限制在最大的 6 个集合

    upset(movies, nsets = 6, 
          number.angles = 30, 
          point.size = 3.5, 
          line.size = 2, 
          mainbar.y.label = "Genre Intersections", 
          sets.x.label = "Movies Per Genre", 
          text.scale = c(1.3, 1.3, 1, 1, 2, 0.75))
    

    同时,可以指定参数,来调整图形属性,例如,使用 number.angles 来设置柱状图柱子上方数字的倾斜角度;使用 point.sizeline.size 来设置矩阵点图中点和线的大小;mainbar.y.labelsets.x.label 可以设置柱状图和条形图的轴标签;text.scale 包含 6 个值,用于指定图上所有文本标签的大小。

    text.scale 参数值的顺序为:

    • 柱状图的轴标签和刻度
    • 条形图的轴标签和刻度
    • 集合名称
    • 柱子上方表示交集大小的数值

    我们也可以指定需要展示的集合

    upset(movies, 
          sets = c("Action", "Comedy", "Drama", 
                   "Mystery", "Thriller", "Romance", "War"),
          mb.ratio = c(0.55, 0.45)
          )
    

    mb.ratio 用于控制上下图形所占比例

    4. 排序

    我们可以设置 order.by 参数,来对交集进行排序。

    upset(movies, 
          sets = c("Action", "Comedy", "Drama", 
                   "Mystery", "Thriller", "Romance", "War"),
          mb.ratio = c(0.55, 0.45),
          order.by = "freq",
          decreasing = TRUE
          )
    

    freq 默认是升序,可以使用 decreasing = TRUE 让其降序排列

    upset(movies, 
          sets = c("Action", "Comedy", "Drama", 
                   "Mystery", "Thriller", "Romance", "War"),
          mb.ratio = c(0.55, 0.45),
          order.by = "degree",
          decreasing = FALSE
          )
    

    degree 默认为降序排序,设置 decreasing = FALSE 使其升序排列

    也可以同时指定这两个值

    upset(movies, 
          sets = c("Action", "Comedy", "Drama", 
                   "Mystery", "Thriller", "Romance", "War"),
          mb.ratio = c(0.55, 0.45),
          order.by = c("degree", "freq"),
          decreasing = c(TRUE, FALSE)
          )
    

    如果想要让集合按照 sets 参数中指定的出现的顺序排列,可以设置 keep.order = TRUE

    upset(movies, 
          sets = c("Action", "Comedy", "Drama", 
                   "Mystery", "Thriller", "Romance", "War"),
          mb.ratio = c(0.55, 0.45),
          order.by = c("degree", "freq"),
          decreasing = c(TRUE, FALSE),
          keep.order = TRUE
          )
    

    如果想要显示交集为空的组合,可以设置 empty.intersections 参数

    upset(movies, 
          sets = c("Action", "Comedy", "Drama", 
                   "Mystery", "Thriller", "Romance", "War"),
          empty.intersections = "on"
          )
    

    查询

    查询通过 queries 参数来执行,接受一个嵌套的 list 来表示多个查询条件,每个查询条件包含四个字段:

    • query:需要执行的查询
    • params:查询参数列表
    • color:设置满足查询条件的元素在图中的颜色
    • active:如果为 TRUE,柱状图颜色将会被覆盖,为 FALSE 则会在柱子上添加带有随机扰动的点

    例如

    1. 内置交集查询

    我们使用内置的交集查询:intersects,用来寻找或显示特定的交集,并将找到的交集进行上色

    upset(movies, queries = list(
      list(
        query = intersects, 
        params = list("Drama", "Comedy", "Action"), 
        color = "orange", 
        active = T), 
      list(
        query = intersects, 
        params = list("Drama"), 
        color = "red", 
        active = F), 
      list(
        query = intersects,
        params = list("Action", "Drama"), 
        active = T)
      )
      )
    

    2. 内置元素查询

    我们使用 elements 来进行元素查询,来展示元素在交集中的分布情况

    upset(movies, 
          queries = list(
            list(
              query = elements, 
              params = list("AvgRating",  3.5, 4.1), 
              color = "blue", 
              active = T), 
            list(
              query = elements, 
              params = list("ReleaseDate", 1980, 1990, 2000), 
              color = "red", 
              active = F)
            )
          )
    

    3. 使用表达式

    我们可以为 expression 参数设置过滤表达式来提取查询结果的子集。

    upset(movies, 
          queries = list(
            list(
              query = intersects, 
              params = list("Action", "Drama"), 
              active = T), 
            list(
              query = elements, 
              params = list("ReleaseDate", 1980, 1990, 2000), 
              color = "red", 
              active = F)), 
          expression = "AvgRating > 3 & Watches > 100"
          )
    

    4. 自定义查询

    查询函数会应用于数据的每一行中,我们可以定义如下查询函数

    Myfunc <- function(row, release, rating) {
      data <- (row["ReleaseDate"] %in% release) & (row["AvgRating"] > rating)
    }
    

    筛选发行日期在 release 内,且平均评分大于某个值的电影

    执行查询

    upset(movies, 
          queries = list(
            list(
              query = Myfunc, 
              params = list(c(1970, 1980, 1990, 1999, 2000), 2.5), 
              color = "blue", 
              active = T)
            )
          )
    

    5. 添加查询图例

    可以使用 query.legend 参数来指定查询图例的位置,topbottom

    在查询条件中,使用 query.name 来设置查询的名称,如果为设置,会自动生成

    upset(movies, 
          query.legend = "top", 
          queries = list(
            list(
              query = intersects, 
              params = list("Drama", "Comedy", "Action"), 
              color = "orange", active = T, 
              query.name = "Funny action"), 
            list(
              query = intersects, 
              params = list("Drama"), 
              color = "red", active = F), 
            list(
              query = intersects, 
              params = list("Action", "Drama"), 
              active = T, 
              query.name = "Emotional action")
            )
          )
    

    属性图

    attribute.plots 参数用于执行属性图的绘制,包含 3 个字段:

    • gridrows:设置属性图的空间大小,UpSet plot 默认为 100 X 100,如果设置为 50,则整个图形变成 150 X 100
    • plots:图形列表,每个元素包含 4 个参数:
      • plot:返回 ggplot 对象的函数
      • x:图形的 x 轴变量
      • y:图形的 y 轴变量
      • queries:是否使用已经存在的查询来覆盖绘图数据
    • ncols:设置列数

    1. 内置绘图函数

    我们使用包中自带的 histogram 函数来绘制直方图

    upset(movies, 
          main.bar.color = "black", 
          queries = list(
            list(
              query = intersects, 
              params = list("Drama"), 
              active = T)
            ), 
          attribute.plots = list(
            gridrows = 50, 
            plots = list(
              list(
                plot = histogram, 
                x = "ReleaseDate", 
                queries = F), 
              list(
                plot = histogram,
                x = "AvgRating", 
                queries = T)
              ), 
            ncols = 2
            )
          )
    

    使用 scatter_plot 函数绘制散点图

    upset(movies, 
          main.bar.color = "black", 
          queries = list(
            list(
              query = intersects, 
              params = list("Drama"), 
              color = "red", 
              active = F), 
            list(
              query = intersects, 
              params = list("Drama", "Comedy", "Action"), 
              color = "orange", 
              active = T)
            ), 
          attribute.plots = list(
            gridrows = 45, 
            plots = list(
              list(
                plot = scatter_plot, 
                x = "ReleaseDate", 
                y = "AvgRating", 
                queries = T), 
              list(plot = scatter_plot, 
                   x = "AvgRating", 
                   y = "Watches", 
                   queries = F)
              ), 
            ncols = 2), 
          query.legend = "bottom"
          )
    

    2. 自定义绘图函数

    我们先定义两个基于 ggplot2 的函数,用于绘制散点图和密度图

    my_scatter <- function(data, x, y) {
      p <- ggplot(data, aes_string(x, y, colour = "color")) +
        geom_point() +
        scale_colour_identity() +
        theme(
          plot.margin = unit(c(0, 0, 0, 0), "cm")
        )
      p
    }
    
    my_density <- function(data, x, y) {
      data$decades <- data[, y] %/% 10 * 10
      data <- data[which(data$decades >= 1970), ]
      p <- ggplot(data, aes_string(x)) +
        geom_density(aes(fill = factor(decades)), alpha = 0.3) +
        theme(
          plot.margin = unit(c(0, 0, 0, 0), "cm"), 
          legend.key.size = unit(0.4, "cm")
        )
      p
    }
    

    然后应用在属性图中

    upset(movies, 
          main.bar.color = "black", 
          queries = list(
            list(
              query = intersects, 
              params = list("Drama"), 
              color = "red", active = F), 
            list(
              query = intersects, 
              params = list("Action", "Drama"), 
              active = T),
            list(
              query = intersects, 
              params = list("Drama", "Comedy", "Action"), 
              color = "orange", active = T)
            ), 
          attribute.plots = list(
            gridrows = 45, 
            plots = list(
              list(
                plot = my_scatter, 
                x = "ReleaseDate", 
                y = "AvgRating", 
                queries = T),
              list(
                plot = my_density,
                x = "AvgRating",
                y = "ReleaseDate",
                queries = F)
              ),
            ncols = 2)
          )
    

    3. 绘制箱线图

    想要绘制箱线图,可以使用 boxplot.summary 参数,最多只能同时绘制两个变量的箱线图。

    upset(movies, boxplot.summary = c("AvgRating", "ReleaseDate"))
    

    当然,用自定义的方式也能实现

    集合元数据

    set.metadata 参数可以用来设置集合的元数据,包含 3 个字段:

    • data:数据框,第一列为集合名,后面的列为对应的集合属性
    • ncols:列数
    • plots:也是一个 list,每个元素包含 4 个字段 column, type, assigncolors
      • columndata 中用于绘制的列名

      • type:需要绘制的图像类型,如果指定的列为数值型,则可以是 histheat;如果是布尔型,则可以绘制 bool 热图;如果是分类类型(字符串),则可以是 heattext;如果想在矩阵中绘制,可以使用 matrix_rows

      • assign:该元数据图分配的列数,如果绘制 2 列数据,并分别分配了 2010,则 UpSet 图变为 100 X 130

      • colors:元数据图颜色,如果是条形图,则会应用于整个元数据图;如果是 heatbool,则可以设置一个颜色向量;如果是 factor 则没有 colors 参数,并且图像为渐变色;如果是 text 则可以为每个唯一的字符串设置一个颜色,不设置会自动分配颜色

    1. 条形图

    我们为每个集合添加元数据属性,为每部电影随机设置烂番茄的电影评分

    sets <- names(movies[3:19])
    avgRottenTomatoesScore <- round(runif(17, min = 0, max = 90))
    metadata <- as.data.frame(cbind(sets, avgRottenTomatoesScore))
    names(metadata) <- c("sets", "avgRottenTomatoesScore")
    

    要绘制条形图,需要保证对应列的数据类型必须是数值型

    > str(metadata)
    'data.frame':   17 obs. of  2 variables:
     $ sets                  : Factor w/ 17 levels "Action","Adventure",..: 1 2 3 4 5 6 7 8 12 9 ...
     $ avgRottenTomatoesScore: Factor w/ 12 levels "13","16","21",..: 6 10 12 5 1 1 3 2 11 11 ...
    

    我们看到,评分列为 factor,所以需要先进行转换

    metadata$avgRottenTomatoesScore <- as.numeric(as.character(metadata$avgRottenTomatoesScore))
    

    现在可以绘制元数据图了

    upset(movies, 
          set.metadata = list(
            data = metadata, 
            plots = list(
              list(
                type = "hist", 
                column = "avgRottenTomatoesScore", 
                assign = 20)
              )
            )
          )
    

    2. 热图

    我们再构造电影的元数据,为电影添加城市属性,同时确保该列为字符串类型而不是 factor

    Cities <- sample(c("Boston", "NYC", "LA"), 17, replace = T)
    metadata <- cbind(metadata, Cities)
    metadata$Cities <- as.character(metadata$Cities)
    

    我们绘制两幅热图,一幅指定了颜色,另一幅不指定颜色

    upset(movies, 
          set.metadata = list(
            data = metadata, 
            plots = list(
              list(
                type = "heat",
                column = "Cities", 
                assign = 10, 
                colors = c(
                  Boston = "green", 
                  NYC = "navy",
                  LA = "purple")
                ), 
              list(
                type = "heat", 
                column = "avgRottenTomatoesScore", 
                assign = 10)
              )
            )
          )
    

    可以看到,不指定颜色的热图为灰色渐变色

    布尔型热图

    我们为电影添加一列 accepted 信息,值为 01

    accepted <- round(runif(17, min = 0, max = 1))
    metadata <- cbind(metadata, accepted)
    

    设置方式与上面类似

    upset(movies, 
          set.metadata = list(
            data = metadata, 
            plots = list(
              list(
                type = "bool", 
                column = "accepted", 
                assign = 5, 
                colors = c("#FF3333", "#006400")
                )
              )
            )
          )
    

    如果将 bool 换成 heat

    upset(movies, 
          set.metadata = list(
            data = metadata, 
            plots = list(
              list(
                type = "heat", 
                column = "accepted", 
                assign = 5, 
                colors = c("#FF3333", "#006400")
                )
              )
            )
          )
    

    会将 01 布尔型数据视为数值型,并绘制渐变色

    3. 文本

    对于城市信息元数据,可能显示文本比热图更合适一些

    upset(movies, 
          set.metadata = list(
            data = metadata, 
            plots = list(
              list(
                type = "text", 
                column = "Cities", 
                assign = 10, 
                colors = c(
                  Boston = "green", 
                  NYC = "navy",        
                  LA = "purple")
                )
              )
            )
          )
    

    4. 在矩阵中应用元数据

    有时候,我们可能想将元数据信息直接体现在 UpSet 图中,可以设置 type = "matrix_rows",在矩阵中为不同城市设置不同的颜色

    upset(movies, 
          set.metadata = list(
            data = metadata, 
            plots = list(
              list(
                type = "hist", 
                column = "avgRottenTomatoesScore", 
                assign = 20), 
              list(
                type = "matrix_rows", 
                column = "Cities", 
                colors = c(
                  Boston = "green", 
                  NYC = "navy", 
                  LA = "purple"),
                alpha = 0.5)
              )
            )
          )
    

    汇总

    最后,我们将这些图合并在一起

    upset(movies, 
          # 查询
          queries = list(
            list(
              query = intersects, 
              params = list("Drama"), 
              color = "red", 
              active = F), 
            list(
              query = intersects, 
              params = list("Action", "Drama"), 
              active = T), 
            list(
              query = intersects,
              params = list("Drama", "Comedy", "Action"), 
              color = "orange", 
              active = T)), 
          # 元数据图
          set.metadata = list(
            data = metadata, 
            plots = list(
              list(
                type = "hist", 
                column = "avgRottenTomatoesScore", 
                assign = 20), 
              list(
                type = "bool", 
                column = "accepted",
                assign = 5, 
                colors = c("#FF3333", "#006400")), 
              list(
                type = "text", 
                column = "Cities",
                assign = 5, 
                colors = c(
                  Boston = "green", 
                  NYC = "navy", 
                  LA = "purple")), 
              list(
                type = "matrix_rows", 
                column = "Cities", 
                colors = c(
                  Boston = "green", 
                  NYC = "navy", 
                  LA = "purple"), 
                alpha = 0.5)
              )
            ), 
          # 属性图
          attribute.plots = list(
            gridrows = 45, 
            plots = list(
              list(
                plot = my_scatter, 
                x = "ReleaseDate", 
                y = "AvgRating", 
                queries = T), 
              list(plot = my_density, 
                   x = "AvgRating", 
                   y = "ReleaseDate", 
                   queries = F)), 
            ncols = 2), 
          query.legend = "bottom"
          )
    

    代码:
    https://github.com/dxsbiocc/learn/blob/main/R/plot/upset_plot.R

    参数详情


    相关文章

      网友评论

        本文标题:R 数据可视化 —— 集合可视化 UpSetR

        本文链接:https://www.haomeiwen.com/subject/azlwrltx.html