箱线图
箱线图用于展示 5
个统计量:最大值、最小值、中位数、第一分位数和第三分位数。
从箱线图中可以很容易的看出数据是否对称分布、以及是否包含离散数据,分布的离散程度。也可以用于比较不同变量的分布
示例
来个最简单的箱线图
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot()
data:image/s3,"s3://crabby-images/3a1dd/3a1dd54bdc80854f54408b0f78895737c66a48b1" alt=""
翻转方向
ggplot(mpg, aes(hwy, class)) + geom_boxplot()
data:image/s3,"s3://crabby-images/f050e/f050eba3aeeedcd14363ad3aff18084e571a368e" alt=""
设置凹槽
p + geom_boxplot(notch = TRUE)
data:image/s3,"s3://crabby-images/792e1/792e142beb77347310c98a552ebe369a76cf8699" alt=""
凹槽的宽度可以通过 notchwidth
参数设置,默认为 0.5
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot(notch = TRUE, notchwidth = 0.9)
data:image/s3,"s3://crabby-images/7c1bb/7c1bb68a42f7f8d00d0f114355210b88666df8a8" alt=""
默认情况下,每个箱子的宽度是一样的,我们可以设置 varwidth = TRUE
,使得宽度与组内观测值的平方根成正比
p + geom_boxplot(varwidth = TRUE)
data:image/s3,"s3://crabby-images/bebd9/bebd97384a32bd11e679fbf56244ab19fa8aaf00" alt=""
为箱子设置颜色
p + geom_boxplot(fill = "white", colour = "#3366FF")
data:image/s3,"s3://crabby-images/65342/653425895cf0e42c0659380a23962a7b9b65d982" alt=""
geom_boxplot
函数中有专门的几个参数用于设置离散值的属性:
-
outlier.colour = NULL
, -
outlier.color = NULL
, -
outlier.fill = NULL
, -
outlier.shape = 19
, -
outlier.size = 1.5
, -
outlier.stroke = 0.5
, -
outlier.alpha = NULL
,
例如
p + geom_boxplot(outlier.colour = "red", outlier.shape = 1)
data:image/s3,"s3://crabby-images/7658d/7658db8e36e8a7eee1882613436b2cf417b3f4ae" alt=""
设置透明度
p + geom_boxplot(outlier.fill = "blue", outlier.shape = 21, alpha = 0.5)
data:image/s3,"s3://crabby-images/c0123/c0123fb03736ff248972eba628be9bbded0f54aa" alt=""
但是,当我们想要组合绘制箱线图和散点图时,可能需要将离散点删除
p + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.2)
data:image/s3,"s3://crabby-images/cf17b/cf17babb75b9c2a1501ddd0334a32d07cbcafb57" alt=""
当我们绘制分组箱线图时,默认以并列的方式排列
p + geom_boxplot(aes(colour = drv))
data:image/s3,"s3://crabby-images/a00dc/a00dc1d8145b6f63b4a0a63c67d9549b5614c6fd" alt=""
而对于连续型 x
变量,需要指定分组。可以搭配 cut_width
使用
ggplot(diamonds, aes(carat, price)) +
geom_boxplot(aes(group = cut_width(carat, 0.5)))
data:image/s3,"s3://crabby-images/d2982/d2982ecd3ef8f48190e6f4cbeef10a2d032d63e0" alt=""
如果数据中已经计算过这些统计量,那么也可以将这些变量传递进去。
例如
tibble(
x = rep(LETTERS[1:10], 10),
y = rnorm(100)
) %>% group_by(x) %>%
summarise(y0 = min(y), y25 = quantile(y, 0.25), y50 = median(y),
y75 = quantile(y, 0.75), y100 = max(y)) %>%
ggplot(aes(x)) +
geom_boxplot(
aes(ymin = y0, lower = y25, middle = y50, upper = y75, ymax = y100,
fill = x),
stat = "identity"
)
data:image/s3,"s3://crabby-images/7d514/7d51466d531cfec6d9852f8a6d2b919bff32ff05" alt=""
组合图形
- 添加均值标记点
ggplot(mpg, aes(class, hwy)) +
geom_boxplot(aes(fill = class)) +
stat_summary(fun = "mean", fill = "white", size = 2, geom = "point", shape = 23)
data:image/s3,"s3://crabby-images/bce9d/bce9dc7c13012158d36c8409070fc7e39f6e36eb" alt=""
- 再添加误差线
ggplot(mpg, aes(class, hwy)) +
stat_boxplot(geom = "errorbar", width = 0.2) +
geom_boxplot(aes(fill = class)) +
stat_summary(fun = "mean", fill = "white",
size = 2, geom = "point", shape = 23)
data:image/s3,"s3://crabby-images/afcf1/afcf1cde98f438c4170413fb53f015e86d511729" alt=""
- 为分组箱线图添加误差线
ggplot(mpg, aes(class, hwy)) +
stat_boxplot(aes(colour = drv), geom = "errorbar",
position = position_dodge2(preserve = 'single', padding = 0.5)) +
geom_boxplot(aes(fill = drv), position = position_dodge2(preserve = 'single'))
data:image/s3,"s3://crabby-images/19069/190696c791e49790182ac35f01b1c9320d408fbe" alt=""
注意:由于每种类型的分组并不是都存在,所以会出现箱线图宽度不一致的情况,所以设置了 preserve = 'single'
同时,在添加分组误差线时,需要指定分组 colour = drv
。
- 定制箱线图
# 先绘制一个虚线箱线图
p1 <- ggplot(mpg, aes(class, hwy)) +
geom_boxplot(linetype = 'dashed', outlier.colour = "red")
# 再绘制带颜色的中心矩形,覆盖原来的矩形
p2 <- p1 +
stat_boxplot(aes(ymin = after_stat(lower), ymax = after_stat(upper),
fill = class))
# 设置上误差线,误差线的最小值设置为数据最大值
p3 <- p2 +
stat_boxplot(aes(ymin = after_stat(ymax)), geom = "errorbar",
width = 0.2, colour = "#4daf4a")
# 设置下误差线
p3 +
stat_boxplot(aes(ymax = after_stat(ymin)), geom = "errorbar",
width = 0.2, colour = "#377eb8")
plot_grid(p1, p2, p3, p4)
data:image/s3,"s3://crabby-images/307be/307be9767b3e3e37f28d8ae27197ba3d673b8677" alt=""
小提琴图
小提琴图用于显示数据的分布状态和概率密度,它同时具有箱线图和密度图的特征,用于显示数据的分布形状。
示例
来个最简单的例子
p <- ggplot(mtcars, aes(factor(cyl), mpg))
p + geom_violin()
data:image/s3,"s3://crabby-images/d7c3a/d7c3adcba99a3c5c297f5285302265787e691abf" alt=""
更改方向
ggplot(mtcars, aes(mpg, factor(cyl))) +
geom_violin()
data:image/s3,"s3://crabby-images/d7fc8/d7fc8dc281d67fb650239a63a1a0795dc51aff4d" alt=""
可以通过设置 scale
参数的值来更改图像大小,支持三个参数值:
-
area
:默认,保持所有图形大小一样 -
count
:设置最大宽度与样本大小成正比 -
width
:所有图形的最大宽度一样
p + geom_violin(scale = "count")
data:image/s3,"s3://crabby-images/b79c3/b79c36c546f975b7a1706efbb541b719ee8783c4" alt=""
默认情况下,会删除图形的尾部数据,如果不想删除可以设置 trim = FALSE
p + geom_violin(trim = FALSE)
data:image/s3,"s3://crabby-images/5a32b/5a32b20d638be09bd8649b8d870d274c5f459fba" alt=""
设置更小的 bandwidth(adjust)
来绘制更近似的拟合,默认为 1
p + geom_violin(adjust = .5)
data:image/s3,"s3://crabby-images/5aedc/5aedc874bb5dc90d4b802cb42ece14fb0b1e0f69" alt=""
分组小提琴图也是并列的方式排列
p <- ggplot(mtcars, aes(factor(cyl), mpg))
p1 <- p + geom_violin(aes(fill = cyl))
p2 <- p + geom_violin(aes(fill = factor(cyl)))
p3 <- p + geom_violin(aes(fill = factor(vs)))
p4 <- p + geom_violin(aes(fill = factor(am)))
plot_grid(p1, p2, p3, p4)
data:image/s3,"s3://crabby-images/69ca7/69ca7649e8d0361d8113edde59a25544161892fe" alt=""
显示分位数线
p + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))
data:image/s3,"s3://crabby-images/4675e/4675ed264494a4b48046e362d3cb44bea8a980a6" alt=""
组合图形
- 添加中值点
ggplot(mpg, aes(class, displ)) +
geom_violin(aes(fill = class), show.legend = FALSE) +
stat_summary(fun = median, geom = "point", shape = 23,
size = 2, fill = "white")
data:image/s3,"s3://crabby-images/f1884/f1884b4952674335ef2c55e4a812bcea7335878c" alt=""
- 添加均值和标准差
ggplot(mpg, aes(class, displ)) +
geom_violin(aes(fill = class), show.legend = FALSE) +
stat_summary(fun.data = "mean_sdl", fun.args = list(mult = 1),
colour = "white")
data:image/s3,"s3://crabby-images/65470/6547044737bda80e48c688a894029fd95178149c" alt=""
注意:如果在运行上述代码报错了
Hmisc package required for this function
需要安装一下 Hmisc
,因为 mean_sdl
函数来自 Hmisc
install.packages("Hmisc")
- 添加箱线图
ggplot(mpg, aes(class, displ)) +
geom_violin(aes(fill = class), show.legend = FALSE) +
geom_boxplot(width = 0.1)
data:image/s3,"s3://crabby-images/f59e9/f59e971ebb5adc7680e9a15322a6200bd1d21155" alt=""
- 添加抖动散点图
ggplot(mpg, aes(class, displ)) +
geom_violin(aes(fill = class), show.legend = FALSE) +
geom_jitter(width = 0.1)
data:image/s3,"s3://crabby-images/bca0e/bca0eabe68042dc1b6e7d58c00eef7599a8900c9" alt=""
- 转换为极坐标
ggplot(mpg, aes(class, displ)) +
geom_violin(aes(fill = class), show.legend = FALSE) +
geom_boxplot(width = 0.1) +
coord_polar()
data:image/s3,"s3://crabby-images/1b89a/1b89a8ccd4d487e468b88652fb71d5f5a4b45726" alt=""
网友评论