inspectdf主要针对数据框(data frames)的列信息汇总、对比和可视化。具体内容包括报告缺失值,因子水平,数值范围,相关性,列信息和内容使用情况。
Base-R中类似的功能有,str(),dim()等。
下面用starwars进行数据结构演示:
rm(list = ls()) #清空环境
p_load(dplyr)
data("starwars")
head(starwars)
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Dart~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen~ 178 120 brown, gr~ light blue 52 male mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
insepct_types()
单个数据框信息汇总
p_load(inspectdf)
x <- inspect_types(starwars)
x
## # A tibble: 4 x 4
## type cnt pcnt col_name
## <chr> <int> <dbl> <named list>
## 1 character 8 57.1 <chr [8]>
## 2 list 3 21.4 <chr [3]>
## 3 numeric 2 14.3 <chr [2]>
## 4 integer 1 7.14 <chr [1]>
列名称储存在col_name中
x$col_name$list
## 12 13 14
## "films" "vehicles" "starships"
show_plot:环形展示数据类型
x %>%
show_plot()
![](https://img.haomeiwen.com/i17982813/5fb28c67bd42fe27.png)
两个数据框信息对比
为了便于展示,直接从starwars数据集中随机抽取数行
star_1 <- starwars %>%
sample_n(50) #随机抽取50行
star_1$mass <- as.character(star_1$mass)
star_2 <- starwars %>%
sample_n(50) %>%
select(-1)
当提供第2个数据框时,inspect_types()返回对比后发现的列名称和数据类型
x <- inspect_types(star_1,star_2)
x
## # A tibble: 4 x 6
## type equal cnt_1 cnt_2 columns issues
## <chr> <chr> <int> <int> <named list> <list>
## 1 character <U+2718> 9 7 <tibble [16 x 2]> <chr [2]>
## 2 list <U+2714> 3 3 <tibble [6 x 2]> <NULL>
## 3 integer <U+2714> 1 1 <tibble [2 x 2]> <NULL>
## 4 numeric <U+2718> 1 2 <tibble [3 x 2]> <chr [1]>
x$columns$numeric #展示star_1和star_2中数据类型为numeric的列名称
## # A tibble: 3 x 2
## col_name data_arg
## <chr> <chr>
## 1 birth_year star_1
## 2 mass star_2
## 3 birth_year star_2
inspct_types产生的issues列包含两个数据框不匹配的内容的相关信息。可视化这些信息的 方法用tidyr::unnest()功能:
p_load(tidyr)
x %>%
select(type,issues) %>%
unnest(issues)
## # A tibble: 3 x 2
## type issues
## <chr> <chr>
## 1 character star_1::mass ~ character <!> star_2::mass ~ numeric
## 2 character star_1::name ~ character missing from star_2
## 3 numeric star_1::mass ~ character <!> star_2::mass ~ numeric
此外,还可以用show_plot()可视化对比的结果:
inspect_types(star_1,star_2) %>%
show_plot()
![](https://img.haomeiwen.com/i17982813/91d76dfff2c75c26.png)
inspect_mem()
展示每列数据消耗的内存
head(inspect_mem(starwars),3)
## # A tibble: 3 x 4
## col_name bytes size pcnt
## <chr> <int> <chr> <dbl>
## 1 films 20008 19.54 Kb 35.9
## 2 starships 7448 7.27 Kb 13.4
## 3 name 6280 6.13 Kb 11.3
其他重要的功能:
-
inspect_na():汇总NA信息
-
inspect_cor():数值列的相关性信息
-
inspect_imb():展示分类列的特征不平衡情况
-
inspect_num():数值列信息汇总
-
inspect_cat():因子列信息汇总
参考文献:
Alastair Rushworth (2021). inspectdf: Inspection, Comparison and Visualisation of Data Frames. R package version 0.0.10. https://CRAN.R-project.org/package=inspectdf
网友评论