探索性数据分析(R):xda包

作者: 柳叶刀与小鼠标 | 来源:发表于2019-09-14 09:38 被阅读0次

    该软件包包含多个工具,可对任何输入数据集执行初始探索性​​分析。它包括用于绘制数据以及执行不同类型分析的自定义函数,例如单变量,双变量和多变量调查,这是任何预测建模管道的第一步。在开始构建预测模型之前,此包可用于充分了解任何数据集。

    目前包含的功能如下:

    numSummary(mydata)函数自动检测数据帧mydata中的所有数字列并提供其摘要统计信息。

    charSummary(mydata)函数自动检测数据帧mydata中的所有字符列,并提供其摘要统计信息。

    Plot(mydata,dep.var)将数据帧mydata中的所有自变量与dep.var参数指定的因变量进行对比。

    removeSpecial(mydata,vec)用NA替换数据帧mydata中的所有特殊字符(由vector vec指定)。

    bivariate(mydata,dep.var,indep.var)在数据帧mydata中执行因变量dep.var和自变量indep.var之间的双变量分析。

    注意:上面提到的所有函数都要求mydata是data.frame - 请在使用此包中的任何函数之前将输入数据集转换为data.frame。

    Installation

    *安装xda包的最佳方法是首先安装devtools包。 要安装devtools,请按照[此处](https://github.com/hadley/devtools)的说明进行操作。 然后,使用以下命令安装xda

    ```source-x86
    library(devtools)
    install_github("ujjwalkarn/xda")
    ```
    
    • Alternatively, you may also use the githubinstall package for installing xda:

      install.packages("githubinstall")
      library(githubinstall)
      githubinstall("xda")
      

    Usage

    请参阅每个功能的文档以了解如何使用它。 例如,要查看numSummary()函数的文档,请使用?numSummary

    ## load the package into the current session
    
    library(xda)
    

    numSummary()

    ## to view a comprehensive summary for all numeric columns in the iris dataset
    
    numSummary(iris)
    
    ## n = total number of rows for that variable
    ## nunique = number of unique values
    ## nzeroes = number of zeroes
    ## iqr = interquartile range
    ## noutlier = number of outliers
    ## miss = number of rows with missing value
    ## miss% = percentage of total rows with missing values ((miss/n)*100)
    ## 5% = 5th percentile value of that variable (value below which 5 percent of the observations may be found)
    ## the percentile values are helpful in detecting outliers
    
    Output
    > numSummary(iris)
    
                    n mean    sd max min range nunique nzeros  iqr lowerbound upperbound noutlier kurtosis skewness mode miss miss%   1%   5% 25%  50% 75%  95%  99%
     Sepal.Length 150 5.84 0.828 7.9 4.3   3.6      35      0 1.30       3.15       8.35        0   -0.606    0.309  5.0    0     0 4.40 4.60 5.1 5.80 6.4 7.25 7.70
     Sepal.Width  150 3.06 0.436 4.4 2.0   2.4      23      0 0.50       2.05       4.05        4    0.139    0.313  3.0    0     0 2.20 2.34 2.8 3.00 3.3 3.80 4.15
     Petal.Length 150 3.76 1.765 6.9 1.0   5.9      43      0 3.55      -3.72      10.42        0   -1.417   -0.269  1.4    0     0 1.15 1.30 1.6 4.35 5.1 6.10 6.70
     Petal.Width  150 1.20 0.762 2.5 0.1   2.4      22      0 1.50      -1.95       4.05        0   -1.358   -0.101  0.2    0     0 0.10 0.20 0.3 1.30 1.8 2.30 2.50
    
    

    charSummary()

    ## to view a comprehensive summary for all character columns in the warpbreaks dataset
    
    charSummary(warpbreaks)
    
    ## n = total number of rows for that variable
    ## miss = number of rows with missing value
    ## miss% = percentage of total rows with missing values ((n/miss)*100)
    ## unique = number of unique levels of that variable
    ## top5levels:count = top 5 levels (unique values) in each column sorted by count
    ## for example, wool has 2 unique levels 'A' and 'B' each with count of 27 
    
    
    Output
    > charSummary(warpbreaks)
    
              n miss miss% unique top5levels:count
     wool    54    0     0      2       A:27, B:27
     tension 54    0     0      3 H:18, L:18, M:18
    
    

    bivariate()

    ## to perform bivariate analysis between 'Species' and 'Sepal.Length' in the iris dataset
    
    bivariate(iris,'Species','Sepal.Length')
    
    ## bin_Sepal.Length = 'Sepal.Length' variable has been binned into 4 equal intervals (original range is [4.3,7.9])
    ## for each interval of 'Sepal.Length', the number of samples from each category of 'Species' is shown 
    ## i.e. 39 of the 50 samples of Setosa have Sepal.Length is in the range (4.3,5.2], and so on. 
    ## the number of intervals (4 in this case) can be customized (see documentation)
    
    
    Output
    > bivariate(iris,'Species','Sepal.Length')
    
       bin_Sepal.Length setosa versicolor virginica
     1        (4.3,5.2]     39          5         1
     2        (5.2,6.1]     11         29        10
     3          (6.1,7]      0         16        27
     4          (7,7.9]      0          0        12
    
    

    Plot()

    ## to plot all other variables against the 'Petal.Length' variable in the iris dataset
    
    Plot(iris,'Petal.Length')
    
    ## some interesting patterns can be seen in the plots below and these insights can be used for predictive modeling
    
    Output
    > Plot(iris,'Petal.Length')
    

    相关文章

      网友评论

        本文标题:探索性数据分析(R):xda包

        本文链接:https://www.haomeiwen.com/subject/ermnectx.html