该软件包包含多个工具,可对任何输入数据集执行初始探索性分析。它包括用于绘制数据以及执行不同类型分析的自定义函数,例如单变量,双变量和多变量调查,这是任何预测建模管道的第一步。在开始构建预测模型之前,此包可用于充分了解任何数据集。
目前包含的功能如下:
numSummary(mydata)函数自动检测数据帧mydata中的所有数字列并提供其摘要统计信息。
charSummary(mydata)函数自动检测数据帧mydata中的所有字符列,并提供其摘要统计信息。
Plot(mydata,dep.var)将数据帧mydata中的所有自变量与dep.var参数指定的因变量进行对比。
removeSpecial(mydata,vec)用NA替换数据帧mydata中的所有特殊字符(由vector vec指定)。
bivariate(mydata,dep.var,indep.var)在数据帧mydata中执行因变量dep.var和自变量indep.var之间的双变量分析。
注意:上面提到的所有函数都要求mydata是data.frame - 请在使用此包中的任何函数之前将输入数据集转换为data.frame。
Installation
*安装xda
包的最佳方法是首先安装devtools
包。 要安装devtools
,请按照[此处](https://github.com/hadley/devtools)的说明进行操作。 然后,使用以下命令安装xda
:
```source-x86
library(devtools)
install_github("ujjwalkarn/xda")
```
-
Alternatively, you may also use the
githubinstall
package for installingxda
:install.packages("githubinstall") library(githubinstall) githubinstall("xda")
Usage
请参阅每个功能的文档以了解如何使用它。 例如,要查看numSummary()
函数的文档,请使用?numSummary
。
## load the package into the current session
library(xda)
numSummary()
## to view a comprehensive summary for all numeric columns in the iris dataset
numSummary(iris)
## n = total number of rows for that variable
## nunique = number of unique values
## nzeroes = number of zeroes
## iqr = interquartile range
## noutlier = number of outliers
## miss = number of rows with missing value
## miss% = percentage of total rows with missing values ((miss/n)*100)
## 5% = 5th percentile value of that variable (value below which 5 percent of the observations may be found)
## the percentile values are helpful in detecting outliers
Output
> numSummary(iris)
n mean sd max min range nunique nzeros iqr lowerbound upperbound noutlier kurtosis skewness mode miss miss% 1% 5% 25% 50% 75% 95% 99%
Sepal.Length 150 5.84 0.828 7.9 4.3 3.6 35 0 1.30 3.15 8.35 0 -0.606 0.309 5.0 0 0 4.40 4.60 5.1 5.80 6.4 7.25 7.70
Sepal.Width 150 3.06 0.436 4.4 2.0 2.4 23 0 0.50 2.05 4.05 4 0.139 0.313 3.0 0 0 2.20 2.34 2.8 3.00 3.3 3.80 4.15
Petal.Length 150 3.76 1.765 6.9 1.0 5.9 43 0 3.55 -3.72 10.42 0 -1.417 -0.269 1.4 0 0 1.15 1.30 1.6 4.35 5.1 6.10 6.70
Petal.Width 150 1.20 0.762 2.5 0.1 2.4 22 0 1.50 -1.95 4.05 0 -1.358 -0.101 0.2 0 0 0.10 0.20 0.3 1.30 1.8 2.30 2.50
charSummary()
## to view a comprehensive summary for all character columns in the warpbreaks dataset
charSummary(warpbreaks)
## n = total number of rows for that variable
## miss = number of rows with missing value
## miss% = percentage of total rows with missing values ((n/miss)*100)
## unique = number of unique levels of that variable
## top5levels:count = top 5 levels (unique values) in each column sorted by count
## for example, wool has 2 unique levels 'A' and 'B' each with count of 27
Output
> charSummary(warpbreaks)
n miss miss% unique top5levels:count
wool 54 0 0 2 A:27, B:27
tension 54 0 0 3 H:18, L:18, M:18
bivariate()
## to perform bivariate analysis between 'Species' and 'Sepal.Length' in the iris dataset
bivariate(iris,'Species','Sepal.Length')
## bin_Sepal.Length = 'Sepal.Length' variable has been binned into 4 equal intervals (original range is [4.3,7.9])
## for each interval of 'Sepal.Length', the number of samples from each category of 'Species' is shown
## i.e. 39 of the 50 samples of Setosa have Sepal.Length is in the range (4.3,5.2], and so on.
## the number of intervals (4 in this case) can be customized (see documentation)
Output
> bivariate(iris,'Species','Sepal.Length')
bin_Sepal.Length setosa versicolor virginica
1 (4.3,5.2] 39 5 1
2 (5.2,6.1] 11 29 10
3 (6.1,7] 0 16 27
4 (7,7.9] 0 0 12
Plot()
## to plot all other variables against the 'Petal.Length' variable in the iris dataset
Plot(iris,'Petal.Length')
## some interesting patterns can be seen in the plots below and these insights can be used for predictive modeling
Output
> Plot(iris,'Petal.Length')
网友评论