【r<-高级|统计|方案】汇总数据

作者: 王诗翔 | 来源:发表于2017-11-15 18:10 被阅读20次

    问题

    你想要按照组别总结你的数据(均值、标准差等等)。

    方案

    有三种方法描述基于一些特定变量的分组数据,然后对每一组使用总结函数(像均值、标准差等等)。

    • ddply()函数:它比较容易使用,但需要载入plyr包。这种方法可能就是你要找的(说明很多人用呗,好用呗)。
    • summaryBy()函数:它也比较容易使用,然而它需要载入doBy包。
    • aggregate()函数,它比较难使用一点但内置于R中。

    假设你有以下数据并想求得每一组样本大小、均值的改变、标准差以及均值的标准误,而这里的组别是根据性别和条件指定的:F-placebo, F-aspirin, M-placeboM-aspirin

    data <- read.table(header=TRUE, text='
     subject sex condition before after change
           1   F   placebo   10.1   6.9   -3.2
           2   F   placebo    6.3   4.2   -2.1
           3   M   aspirin   12.4   6.3   -6.1
           4   F   placebo    8.1   6.1   -2.0
           5   M   aspirin   15.2   9.9   -5.3
           6   F   aspirin   10.9   7.0   -3.9
           7   F   aspirin   11.6   8.5   -3.1
           8   M   aspirin    9.5   3.0   -6.5
           9   F   placebo   11.5   9.0   -2.5
          10   M   placebo   11.9  11.0   -0.9
          11   F   aspirin   11.4   8.0   -3.4
          12   M   aspirin   10.0   4.4   -5.6
          13   M   aspirin   12.5   5.4   -7.1
          14   M   placebo   10.6  10.6    0.0
          15   M   aspirin    9.1   4.3   -4.8
          16   F   placebo   12.1  10.2   -1.9
          17   F   placebo   11.0   8.8   -2.2
          18   F   placebo   11.9  10.2   -1.7
          19   M   aspirin    9.1   3.6   -5.5
          20   M   placebo   13.5  12.4   -1.1
          21   M   aspirin   12.0   7.5   -4.5
          22   F   placebo    9.1   7.6   -1.5
          23   M   placebo    9.9   8.0   -1.9
          24   F   placebo    7.6   5.2   -2.4
          25   F   placebo   11.8   9.7   -2.1
          26   F   placebo   11.8  10.7   -1.1
          27   F   aspirin   10.1   7.9   -2.2
          28   M   aspirin   11.6   8.3   -3.3
          29   F   aspirin   11.3   6.8   -4.5
          30   F   placebo   10.3   8.3   -2.0
     ')
    
    

    使用 ddply

    library(plyr)
    
    # 给每一组运行长度、均值、标准差等函数
    # 每一组依据性别+条件划分
    cdata <- ddply(data, c("sex", "condition"), summarise,
                   N    = length(change),
                   mean = mean(change),
                   sd   = sd(change),
                   se   = sd / sqrt(N)
    )
    cdata
    #>   sex condition  N      mean        sd        se
    #> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230
    #> 2   F   placebo 12 -2.058333 0.5247655 0.1514867
    #> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190
    #> 4   M   placebo  4 -0.975000 0.7804913 0.3902456
    
    

    处理缺失值

    如果数据中存在NA值,需要给每个函数添加na.rm=TRUE标记去除缺失值。因为函数length()没有na.rm选项,所以可以使用sum(!is.na(...))的方式对非缺失值进行计数。

    # 给数据加些NA值
    dataNA <- data
    dataNA$change[11:14] <- NA
    
    cdata <- ddply(dataNA, c("sex", "condition"), summarise,
                   N    = sum(!is.na(change)),
                   mean = mean(change, na.rm=TRUE),
                   sd   = sd(change, na.rm=TRUE),
                   se   = sd / sqrt(N)
    )
    cdata
    #>   sex condition  N      mean        sd        se
    #> 1   F   aspirin  4 -3.425000 0.9979145 0.4989572
    #> 2   F   placebo 12 -2.058333 0.5247655 0.1514867
    #> 3   M   aspirin  7 -5.142857 1.0674848 0.4034713
    #> 4   M   placebo  3 -1.300000 0.5291503 0.3055050
    
    

    自动总结函数

    不像我们刚才手动地指定想要的值然后计算标准误,这个函数可以自动处理所有的细节。它可以干以下的事情:

    • 寻找均值、标准差和计数
    • 寻找均值的标准误(强调,如果你处理的是被试内变量这可能不是你想要的,,参见../../Graphs/Plotting means and error bars (ggplot2)获取怎么绘制组内变量的误差棒信息。关于被试设计问题,可以点击详细了解
    • 寻找95%的置信区间(也可以指定其他值)
    • 重命令结果数据集的变量名,这样更方便后续处理

    要使用的话,把函数放你的代码中然后像下面一样调用它。

    ## Summarizes data.
    ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
    ##   data: a data frame.
    ##   measurevar: the name of a column that contains the variable to be summariezed
    ##   groupvars: a vector containing names of columns that contain grouping variables
    ##   na.rm: a boolean that indicates whether to ignore NA's
    ##   conf.interval: the percent range of the confidence interval (default is 95%)
    summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                          conf.interval=.95, .drop=TRUE) {
        library(plyr)
    
        # New version of length which can handle NA's: if na.rm==T, don't count them
        length2 <- function (x, na.rm=FALSE) {
            if (na.rm) sum(!is.na(x))
            else       length(x)
        }
    
        # This does the summary. For each group's data frame, return a vector with
        # N, mean, and sd
        datac <- ddply(data, groupvars, .drop=.drop,
          .fun = function(xx, col) {
            c(N    = length2(xx[[col]], na.rm=na.rm),
              mean = mean   (xx[[col]], na.rm=na.rm),
              sd   = sd     (xx[[col]], na.rm=na.rm)
            )
          },
          measurevar
        )
    
        # Rename the "mean" column    
        datac <- rename(datac, c("mean" = measurevar))
    
        datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean
    
        # Confidence interval multiplier for standard error
        # Calculate t-statistic for confidence interval: 
        # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
        ciMult <- qt(conf.interval/2 + .5, datac$N-1)
        datac$ci <- datac$se * ciMult
    
        return(datac)
    }
    
    

    举个例子使用它(用95%的置信区间)。与之前手动计算这些步骤相反,summarySE函数一步搞定:

    summarySE(data, measurevar="change", groupvars=c("sex", "condition"))
    #>   sex condition  N    change        sd        se        ci
    #> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
    #> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
    #> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767
    #> 4   M   placebo  4 -0.975000 0.7804913 0.3902456 1.2419358
    
    # With a data set with NA's, use na.rm=TRUE
    summarySE(dataNA, measurevar="change", groupvars=c("sex", "condition"), na.rm=TRUE)
    #>   sex condition  N    change        sd        se        ci
    #> 1   F   aspirin  4 -3.425000 0.9979145 0.4989572 1.5879046
    #> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
    #> 3   M   aspirin  7 -5.142857 1.0674848 0.4034713 0.9872588
    #> 4   M   placebo  3 -1.300000 0.5291503 0.3055050 1.3144821
    
    

    用零填满空组合

    有时候总结的数据框中存在因子的空组合 - 这意思是,因子组合可能存在,但原始数据框里又没有实际出现。它在自动填满有NA值的数据框时有用。要做到这一点,当调用ddplysummarySE时设置.drop=FALSE。(这里我翻译的不是很如意,大家可以查看原文)

    例子:

    # 首先移除所有 Male+Placebo 条目
    dataSub <- subset(data, !(sex=="M" & condition=="placebo"))
    
    # 如果我们总结数据,在本来有 Male+Placebo 的地方会存在空行
    # 因为这个组合已经被我们删除了
    summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"))
    #>   sex condition  N    change        sd        se        ci
    #> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
    #> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
    #> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767
    
    # 设置 .drop=FALSE 指定不要扔掉这个组合
    summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"), .drop=FALSE)
    #> Warning in qt(conf.interval/2 + 0.5, datac$N - 1): NaNs produced
    #>   sex condition  N    change        sd        se        ci
    #> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
    #> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
    #> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767
    #> 4   M   placebo  0       NaN        NA        NA        NA
    
    

    使用summaryBy

    使用summarizeBy()函数瓦解数据:

    library(doBy)
    
    # 给每一组运行长度、均值、标准差等函数
    # 每一组依据性别+条件划分
    cdata <- summaryBy(change ~ sex + condition, data=data, FUN=c(length,mean,sd))
    cdata
    #>   sex condition change.length change.mean change.sd
    #> 1   F   aspirin             5   -3.420000 0.8642916
    #> 2   F   placebo            12   -2.058333 0.5247655
    #> 3   M   aspirin             9   -5.411111 1.1307569
    #> 4   M   placebo             4   -0.975000 0.7804913
    
    # Rename column change.length to just N
    names(cdata)[names(cdata)=="change.length"] <- "N"
    
    # Calculate standard error of the mean
    cdata$change.se <- cdata$change.sd / sqrt(cdata$N)
    cdata
    #>   sex condition  N change.mean change.sd change.se
    #> 1   F   aspirin  5   -3.420000 0.8642916 0.3865230
    #> 2   F   placebo 12   -2.058333 0.5247655 0.1514867
    #> 3   M   aspirin  9   -5.411111 1.1307569 0.3769190
    #> 4   M   placebo  4   -0.975000 0.7804913 0.3902456
    
    

    注意如果你有任何被试内变量,这些标准误值在比对组别差异时就没用了。参见 ../../Graphs/Plotting means and error bars (ggplot2) 获取如何绘制被试内变量的误差棒。

    处理缺失值

    如果数据中存在NA值,你需要添加na.rm=TRUE选项。通常你可以在summaryBy()函数中设置,但length()函数识别不了这个选项。一种解决方式是根据length()函数定义一个新的取长度函数去处理NA值。

    # 新版的length函数可以处理NA值,如果na.rm=T,则不对NA计数
    length2 <- function (x, na.rm=FALSE) {
        if (na.rm) sum(!is.na(x))
        else       length(x)
    }
    
    # 给数据添加一些NA值
    dataNA <- data
    dataNA$change[11:14] <- NA
    
    cdataNA <- summaryBy(change ~ sex + condition, data=dataNA,
                         FUN=c(length2, mean, sd), na.rm=TRUE)
    cdataNA
    #>   sex condition change.length2 change.mean change.sd
    #> 1   F   aspirin              4   -3.425000 0.9979145
    #> 2   F   placebo             12   -2.058333 0.5247655
    #> 3   M   aspirin              7   -5.142857 1.0674848
    #> 4   M   placebo              3   -1.300000 0.5291503
    
    # Now, do the same as before
    
    

    自动总结函数

    (注意这里的自动总结函数与之前的不同,它是通过summaryBy实现的)

    不像我们刚才手动地指定想要的值然后计算标准误,这个函数可以自动处理所有的细节。它可以干以下的事情:

    • 寻找均值、标准差和计数
    • 寻找均值的标准误(强调,如果你处理的是被试内变量这可能不是你想要的,,参见../../Graphs/Plotting means and error bars (ggplot2)获取怎么绘制组内变量的误差棒信息。关于被试设计问题,可以点击详细了解
    • 寻找95%的置信区间(也可以指定其他值)
    • 重命令结果数据集的变量名,这样更方便后续处理

    要使用的话,把函数放你的代码中然后像下面一样调用它。

    ## Summarizes data.
    ## Gives count, mean, standard deviation, standard error of the mean, and confidence 
    ## interval (default 95%).
    ##   data: a data frame.
    ##   measurevar: the name of a column that contains the variable to be summariezed
    ##   groupvars: a vector containing names of columns that contain grouping variables
    ##   na.rm: a boolean that indicates whether to ignore NA's
    ##   conf.interval: the percent range of the confidence interval (default is 95%)
    summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) {
        library(doBy)
    
        # New version of length which can handle NA's: if na.rm==T, don't count them
        length2 <- function (x, na.rm=FALSE) {
            if (na.rm) sum(!is.na(x))
            else       length(x)
        }
    
        # Collapse the data
        formula <- as.formula(paste(measurevar, paste(groupvars, collapse=" + "), sep=" ~ "))
        datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm)
    
        # Rename columns
        names(datac)[ names(datac) == paste(measurevar, ".mean",    sep="") ] <- measurevar
        names(datac)[ names(datac) == paste(measurevar, ".sd",      sep="") ] <- "sd"
        names(datac)[ names(datac) == paste(measurevar, ".length2", sep="") ] <- "N"
        
        datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean
        
        # Confidence interval multiplier for standard error
        # Calculate t-statistic for confidence interval: 
        # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
        ciMult <- qt(conf.interval/2 + .5, datac$N-1)
        datac$ci <- datac$se * ciMult
        
        return(datac)
    }
    
    

    举个例子使用它(用95%的置信区间)。与之前手动计算这些步骤相反,summarySE函数一步搞定:

    summarySE(data, measurevar="change", groupvars=c("sex","condition"))
    #>   sex condition  N    change        sd        se        ci
    #> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
    #> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
    #> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767
    #> 4   M   placebo  4 -0.975000 0.7804913 0.3902456 1.2419358
    
    # With a data set with NA's, use na.rm=TRUE
    summarySE(dataNA, measurevar="change", groupvars=c("sex","condition"), na.rm=TRUE)
    #>   sex condition  N    change        sd        se        ci
    #> 1   F   aspirin  4 -3.425000 0.9979145 0.4989572 1.5879046
    #> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
    #> 3   M   aspirin  7 -5.142857 1.0674848 0.4034713 0.9872588
    #> 4   M   placebo  3 -1.300000 0.5291503 0.3055050 1.3144821
    
    

    用零填满空组合

    有时候总结的数据框中存在因子的空组合 - 这意思是,因子组合可能存在,但原始数据框里又没有实际出现。它在自动填满有NA值的数据框时有用。

    这个例子将会用0填满缺失的组合:

    fillMissingCombs <- function(df, factors, measures) {
    
        # Make a list of the combinations of factor levels
        levelList <- list()
        for (f in factors) {  levelList[[f]] <- levels(df[,f])  }
        
        fullFactors <- expand.grid(levelList)
        
        dfFull <- merge(fullFactors, df, all.x=TRUE)
        
        # Wherever there is an NA in the measure vars, replace with 0
        for (m in measures) {
          dfFull[is.na(dfFull[,m]), m] <- 0
        }
    
        return(dfFull)
    }
    
    

    使用例子:

    # First remove some all Male+Placebo entries from the data
    dataSub <- subset(data, !(sex=="M" & condition=="placebo"))
    
    # If we summarize the data, there will be a missing row for Male+Placebo,
    # since there were no cases with this combination.
    cdataSub <- summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"))
    cdataSub
    #>   sex condition  N    change        sd        se        ci
    #> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
    #> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
    #> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767
    
    # This will fill in the missing combinations with zeros
    fillMissingCombs(cdataSub, factors=c("sex","condition"), measures=c("N","change","sd","se","ci"))
    #>   sex condition  N    change        sd        se        ci
    #> 1   F   aspirin  5 -3.420000 0.8642916 0.3865230 1.0731598
    #> 2   F   placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
    #> 3   M   aspirin  9 -5.411111 1.1307569 0.3769190 0.8691767
    #> 4   M   placebo  0  0.000000 0.0000000 0.0000000 0.0000000
    
    

    使用 aggregate

    aggregate函数比较难用,但它内置于R,所以不需要按照其他包。

    # 对每个目录 (sex*condition) 中的对象计数
    cdata <- aggregate(data["subject"], by=data[c("sex","condition")], FUN=length)
    cdata
    #>   sex condition subject
    #> 1   F   aspirin       5
    #> 2   M   aspirin       9
    #> 3   F   placebo      12
    #> 4   M   placebo       4
    
    # 重命名 "subject" 列为 "N"
    names(cdata)[names(cdata)=="subject"] <- "N"
    cdata
    #>   sex condition  N
    #> 1   F   aspirin  5
    #> 2   M   aspirin  9
    #> 3   F   placebo 12
    #> 4   M   placebo  4
    
    # 按性别排序
    cdata <- cdata[order(cdata$sex),]
    cdata
    #>   sex condition  N
    #> 1   F   aspirin  5
    #> 3   F   placebo 12
    #> 2   M   aspirin  9
    #> 4   M   placebo  4
    
    # 我们也保留 before 和 after列:
    # 得到性别和条件下的平均影响大小
    # Get the average effect size by sex and condition
    cdata.means <- aggregate(data[c("before","after","change")], 
                             by = data[c("sex","condition")], FUN=mean)
    cdata.means
    #>   sex condition   before     after    change
    #> 1   F   aspirin 11.06000  7.640000 -3.420000
    #> 2   M   aspirin 11.26667  5.855556 -5.411111
    #> 3   F   placebo 10.13333  8.075000 -2.058333
    #> 4   M   placebo 11.47500 10.500000 -0.975000
    
    # 融合数据框
    cdata <- merge(cdata, cdata.means)
    cdata
    #>   sex condition  N   before     after    change
    #> 1   F   aspirin  5 11.06000  7.640000 -3.420000
    #> 2   F   placebo 12 10.13333  8.075000 -2.058333
    #> 3   M   aspirin  9 11.26667  5.855556 -5.411111
    #> 4   M   placebo  4 11.47500 10.500000 -0.975000
    
    # 得到标准差
    cdata.sd <- aggregate(data["change"],
                          by = data[c("sex","condition")], FUN=sd)
    # 重命名列
    names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd"
    cdata.sd
    #>   sex condition change.sd
    #> 1   F   aspirin 0.8642916
    #> 2   M   aspirin 1.1307569
    #> 3   F   placebo 0.5247655
    #> 4   M   placebo 0.7804913
    
    # 融合
    cdata <- merge(cdata, cdata.sd)
    cdata
    #>   sex condition  N   before     after    change change.sd
    #> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916
    #> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655
    #> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569
    #> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913
    
    # 计算标准误
    cdata$change.se <- cdata$change.sd / sqrt(cdata$N)
    cdata
    #>   sex condition  N   before     after    change change.sd change.se
    #> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916 0.3865230
    #> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655 0.1514867
    #> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569 0.3769190
    #> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913 0.3902456
    
    

    如果你有NA值想要跳过,设置 na.rm=TRUE:

    cdata.means <- aggregate(data[c("before","after","change")], 
                             by = data[c("sex","condition")],
                             FUN=mean, na.rm=TRUE)
    cdata.means
    #>   sex condition   before     after    change
    #> 1   F   aspirin 11.06000  7.640000 -3.420000
    #> 2   M   aspirin 11.26667  5.855556 -5.411111
    #> 3   F   placebo 10.13333  8.075000 -2.058333
    #> 4   M   placebo 11.47500 10.500000 -0.975000
    
    

    原文链接:http://www.cookbook-r.com/Manipulating_data/Summarizing_data/

    相关文章

      网友评论

        本文标题:【r<-高级|统计|方案】汇总数据

        本文链接:https://www.haomeiwen.com/subject/uxyhvxtx.html