探索性数据分析（R）：xda包

作者: 柳叶刀与小鼠标 | 来源:发表于2019-09-14 09:38 被阅读0次

探索性数据分析（R）：xda包
factoextra 主成分分析（1）
Python从零开始第三章数据处理与分析①python中的dpl
数据的探索性分析
新闻推荐02——探索性数据分析
探索性数据分析和数据可视化（上）
ML入门——EDA探索性数据分析（中）(Seaborn)
ML入门——EDA探索性数据分析（下）(特征工程)
ML入门——EDA探索性数据分析（上）
非参数探索性空间数据分析法（ESDA）笔记

该软件包包含多个工具，可对任何输入数据集执行初始探索性分析。它包括用于绘制数据以及执行不同类型分析的自定义函数，例如单变量，双变量和多变量调查，这是任何预测建模管道的第一步。在开始构建预测模型之前，此包可用于充分了解任何数据集。

目前包含的功能如下：

numSummary（mydata）函数自动检测数据帧mydata中的所有数字列并提供其摘要统计信息。

charSummary（mydata）函数自动检测数据帧mydata中的所有字符列，并提供其摘要统计信息。

Plot（mydata，dep.var）将数据帧mydata中的所有自变量与dep.var参数指定的因变量进行对比。

removeSpecial（mydata，vec）用NA替换数据帧mydata中的所有特殊字符（由vector vec指定）。

bivariate（mydata，dep.var，indep.var）在数据帧mydata中执行因变量dep.var和自变量indep.var之间的双变量分析。

注意：上面提到的所有函数都要求mydata是data.frame - 请在使用此包中的任何函数之前将输入数据集转换为data.frame。

Installation

*安装xda包的最佳方法是首先安装devtools包。要安装devtools，请按照[此处]（https://github.com/hadley/devtools）的说明进行操作。然后，使用以下命令安装xda：

```source-x86
library(devtools)
install_github("ujjwalkarn/xda")
```

Alternatively, you may also use the githubinstall package for installing xda:

install.packages("githubinstall")
library(githubinstall)
githubinstall("xda")

Usage

请参阅每个功能的文档以了解如何使用它。例如，要查看numSummary（）函数的文档，请使用？numSummary。

## load the package into the current session

library(xda)

numSummary()

## to view a comprehensive summary for all numeric columns in the iris dataset

numSummary(iris)

## n = total number of rows for that variable
## nunique = number of unique values
## nzeroes = number of zeroes
## iqr = interquartile range
## noutlier = number of outliers
## miss = number of rows with missing value
## miss% = percentage of total rows with missing values ((miss/n)*100)
## 5% = 5th percentile value of that variable (value below which 5 percent of the observations may be found)
## the percentile values are helpful in detecting outliers

Output

> numSummary(iris)

                n mean    sd max min range nunique nzeros  iqr lowerbound upperbound noutlier kurtosis skewness mode miss miss%   1%   5% 25%  50% 75%  95%  99%
 Sepal.Length 150 5.84 0.828 7.9 4.3   3.6      35      0 1.30       3.15       8.35        0   -0.606    0.309  5.0    0     0 4.40 4.60 5.1 5.80 6.4 7.25 7.70
 Sepal.Width  150 3.06 0.436 4.4 2.0   2.4      23      0 0.50       2.05       4.05        4    0.139    0.313  3.0    0     0 2.20 2.34 2.8 3.00 3.3 3.80 4.15
 Petal.Length 150 3.76 1.765 6.9 1.0   5.9      43      0 3.55      -3.72      10.42        0   -1.417   -0.269  1.4    0     0 1.15 1.30 1.6 4.35 5.1 6.10 6.70
 Petal.Width  150 1.20 0.762 2.5 0.1   2.4      22      0 1.50      -1.95       4.05        0   -1.358   -0.101  0.2    0     0 0.10 0.20 0.3 1.30 1.8 2.30 2.50

charSummary()

## to view a comprehensive summary for all character columns in the warpbreaks dataset

charSummary(warpbreaks)

## n = total number of rows for that variable
## miss = number of rows with missing value
## miss% = percentage of total rows with missing values ((n/miss)*100)
## unique = number of unique levels of that variable
## top5levels:count = top 5 levels (unique values) in each column sorted by count
## for example, wool has 2 unique levels 'A' and 'B' each with count of 27

Output

> charSummary(warpbreaks)

          n miss miss% unique top5levels:count
 wool    54    0     0      2       A:27, B:27
 tension 54    0     0      3 H:18, L:18, M:18

bivariate()

## to perform bivariate analysis between 'Species' and 'Sepal.Length' in the iris dataset

bivariate(iris,'Species','Sepal.Length')

## bin_Sepal.Length = 'Sepal.Length' variable has been binned into 4 equal intervals (original range is [4.3,7.9])
## for each interval of 'Sepal.Length', the number of samples from each category of 'Species' is shown 
## i.e. 39 of the 50 samples of Setosa have Sepal.Length is in the range (4.3,5.2], and so on. 
## the number of intervals (4 in this case) can be customized (see documentation)

Output

> bivariate(iris,'Species','Sepal.Length')

   bin_Sepal.Length setosa versicolor virginica
 1        (4.3,5.2]     39          5         1
 2        (5.2,6.1]     11         29        10
 3          (6.1,7]      0         16        27
 4          (7,7.9]      0          0        12

Plot()

## to plot all other variables against the 'Petal.Length' variable in the iris dataset

Plot(iris,'Petal.Length')

## some interesting patterns can be seen in the plots below and these insights can be used for predictive modeling