美文网首页R数据清洗ggplot2Data science
R语言基础入门(11) summarise汇总数据

R语言基础入门(11) summarise汇总数据

作者: R语言数据分析指南 | 来源:发表于2021-06-18 15:41 被阅读0次

    本节来介绍dplyr中的重要函数countsummarisegroup_by

    count

    count 统计观察次数

    library(tidyverse)
    
    msleep %>%
      count(order, sort = TRUE)
    
       order               n
       <chr>           <int>
     1 Rodentia           22
     2 Carnivora          12
     3 Primates           12
     4 Artiodactyla        6
     5 Soricomorpha        5
    

    也可以在一个count()语句中添加多个变量

    msleep %>%
      count(order, vore, sort = TRUE)
    
       order          vore        n
       <chr>          <chr>   <int>
     1 Rodentia       herbi      16
     2 Carnivora      carni      12
     3 Primates       omni       10
     4 Artiodactyla   herbi       5
    

    summarize

    dplyr 中的summarize函数使用直观易读的代码对统计数据进行汇总

    msleep %>%
      summarise(n = n(), average = mean(sleep_total), maximum = max(sleep_total))
    
    ## # A tibble: 1 x 3
    ##       n average maximum
    ##   <int>   <dbl>   <dbl>
    ## 1    83    10.4    19.9
    

    group_by( )按分组进行汇总

    msleep %>%
      group_by(vore) %>%
      summarise(n = n(), average = mean(sleep_total), maximum = max(sleep_total))
    
    ## # A tibble: 5 x 4
    ##   vore        n average maximum
    ##   <chr>   <int>   <dbl>   <dbl>
    ## 1 carni      19   10.4     19.4
    ## 2 herbi      32    9.51    16.6
    ## 3 insecti     5   14.9     19.9
    ## 4 omni       20   10.9     18.0
    ## 5 <NA>        7   10.2     13.7
    

    summarise( )几乎适用于任何聚合函数,并允许进行额外的算术运算:

    • n() - 给出观察次数
    • n_distinct(var) - 给出唯一值的数量 var
    • sum(var), max(var), min(var), ...
    • mean(var), median(var), sd(var), IQR(var)

    将平均 sleep_total 并除以 24,以获得一天的睡眠量

    msleep %>%
      group_by(vore) %>%
      summarise(avg_sleep_day = mean(sleep_total)/24)
    
    ## # A tibble: 5 x 2
    ##   vore    avg_sleep_day
    ##   <chr>           <dbl>
    ## 1 carni           0.432
    ## 2 herbi           0.396
    ## 3 insecti         0.622
    ## 4 omni            0.455
    ## 5 <NA>            0.424
    

    summarise_all()需要一个函数作为参数,它将应用于所有列;示例代码计算每列的平均值

    msleep %>%
      group_by(vore) %>%
      summarise_all(mean, na.rm=TRUE)
    
    ## # A tibble: 5 x 11
    ##   vore     name genus order conservation sleep_total sleep_rem sleep_cycle
    ##   <chr>   <dbl> <dbl> <dbl>        <dbl>       <dbl>     <dbl>       <dbl>
    ## 1 carni      NA    NA    NA           NA       10.4       2.29       0.373
    ## 2 herbi      NA    NA    NA           NA        9.51      1.37       0.418
    ## 3 insecti    NA    NA    NA           NA       14.9       3.52       0.161
    

    给每列的值加5

    msleep %>%
      group_by(vore) %>%
      summarise_all(~mean(., na.rm = TRUE) + 5)
    
    ##   vore     name genus order conservation sleep_total sleep_rem sleep_cycle
    ##   <chr>   <dbl> <dbl> <dbl>        <dbl>       <dbl>     <dbl>       <dbl>
    ## 1 carni      NA    NA    NA           NA        15.4      7.29        5.37
    ## 2 herbi      NA    NA    NA           NA        14.5      6.37        5.42
    ## 3 insecti    NA    NA    NA           NA        19.9      8.52        5.16
    

    summarise_if()

    计算所有数字列的平均值

    msleep %>%
      group_by(vore) %>%
      summarise_if(is.numeric, mean, na.rm=TRUE)
    

    rename_if( )对列进行重命名

    msleep %>%
      group_by(vore) %>%
      summarise_if(is.numeric, mean, na.rm=TRUE) %>%
      rename_if(is.numeric, ~paste0("avg_", .))
    
    ##   vore    avg_sleep_total avg_sleep_rem avg_sleep_cycle avg_awake
    ##   <chr>             <dbl>         <dbl>           <dbl>     <dbl>
    ## 1 carni             10.4           2.29           0.373     13.6
    ## 2 herbi              9.51          1.37           0.418     14.5
    ## 3 insecti           14.9           3.52           0.161      9.06
    

    summarise_at()

    下面的代码将返回平均含有单词“睡眠”的所有列,并且还它们重命名为“AVG_ VAR"

    msleep %>%
      group_by(vore) %>%
      summarise_at(vars(contains("sleep")), mean, na.rm=TRUE) %>%
      rename_at(vars(contains("sleep")), ~paste0("avg_", .))
    

    top_n( )

    保留值最高的5个

    msleep %>%
      group_by(order) %>%
      summarise(average = mean(sleep_total)) %>%
      top_n(5)
    

    保留值最低的5个

    msleep %>%
      group_by(order) %>%
      summarise(average = mean(sleep_total)) %>%
      top_n(-5)
    

    示例代码将保留average_sleep 的5 个最高值

    msleep %>%
      group_by(order) %>%
      summarise(average_sleep = mean(sleep_total), max_sleep = max(sleep_total)) %>%
      top_n(5, average_sleep)
    
    ##   order           average_sleep max_sleep
    ##   <chr>                   <dbl>     <dbl>
    ## 1 Afrosoricida             15.6      15.6
    ## 2 Chiroptera               19.8      19.9
    ## 3 Cingulata                17.8      18.1
    ## 4 Didelphimorphia          18.7      19.4
    

    sample_frac()允许随机选择一部分行(此处为 10%)

    msleep %>% sample_frac(.1)
    

    喜欢的小伙伴欢迎关注我的公众号

    R语言数据分析指南,持续分享数据可视化的经典案例及一些生信知识,希望对大家有所帮助

    相关文章

      网友评论

        本文标题:R语言基础入门(11) summarise汇总数据

        本文链接:https://www.haomeiwen.com/subject/vhyseltx.html