美文网首页
DataCamp课程 <用dplyr操作数据> Chapter2

DataCamp课程 <用dplyr操作数据> Chapter2

作者: Jason数据分析生信教室 | 来源:发表于2021-07-15 12:17 被阅读0次

    用dplyr操作数据课程目录

    Chapter1. 数据变形
    Chapter2. 数据统计
    Chapter3. 数据选择和变形
    Chapter4. 实战演练

    Chapter2. 数据统计

    count()函数计算频次

    续Chapter1,首先选取多个变量形成新的数据集。然后计算各个region出现的频次。使用sort进行排序。

    counties_selected <- counties %>%
      select(county, region, state, population, citizens)
    
    counties_selected %>%
      count(region,sort=T)
    # A tibble: 4 x 2
      region            n
      <chr>         <int>
    1 South          1420
    2 North Central  1054
    3 West            447
    4 Northeast       217
    

    设置wt来给排序添加加权参数。可以让数据按照wt来排列。

    counties_selected %>%
      # Add population_walk containing the total number of people who walk to work 
      mutate(population_walk = population * walk / 100) %>%
      # Count weighted by the new column
      count(state, wt = population_walk, sort = TRUE)
    # A tibble: 50 x 2
       state                n
       <chr>            <dbl>
     1 New York      1237938.
     2 California    1017964.
     3 Pennsylvania   505397.
     4 Texas          430783.
     5 Illinois       400346.
     6 Massachusetts  316765.
     7 Florida        284723.
     8 New Jersey     273047.
     9 Ohio           266911.
    10 Washington     239764.
    # ... with 40 more rows
    

    mutatecount组合使用

    新增一个变量,根据新增的变量给state加权,并排序。

    counties_selected %>%
      # Add population_walk containing the total number of people who walk to work 
      mutate(population_walk = population * walk / 100) %>%
      # Count weighted by the new column
      count(state, wt = population_walk, sort = TRUE)
    

    summarizegroup_by进行描述行统计

    group_by根据state进行分组,然后用summarize进行描述行统计。计算出对象数据的和。

    # Group by state and find the total area and population
    counties_selected %>%
       select(state, county, population, land_area) %>% 
       group_by(state) %>% 
       summarize(total_area=sum(land_area),total_population=sum(population))
    # A tibble: 1 x 3
      min_population max_unemployment average_income
               <dbl>            <dbl>          <dbl>
    1             85             29.4         46832.
    

    再来一个稍微复杂的练习,先根据region, state对数据进行分组,然后对每组的population进行求和统计结果命名为total_pop,最后计算total_pop的平均值和中位数。

    # Calculate the average_pop and median_pop columns 
    counties_selected %>%
      group_by(region, state) %>%
      summarize(total_pop = sum(population)) %>% 
      summarize(average_pop = mean(total_pop),median_pop=median(total_pop))
    # A tibble: 4 x 3
      region        average_pop median_pop
      <chr>               <dbl>      <dbl>
    1 North Central    5627687.    5580644
    2 Northeast        6221058.    3593222
    3 South            7370486     4804098
    4 West             5722755.    2798636
    

    top_n 的用法

    top_n相当于分组以后选取每个小组里某个变量名列前n的数据,n可以是任意数,根据自己需求设置。
    比方说下面的例子,根据region分组,并选取每个组里,walk里的最大值。这时n=1。

    counties_selected %>%
      select(region, state, county, metro, population, walk) %>% 
      group_by(region) %>% 
      top_n(1,walk)
    # A tibble: 4 x 6
    # Groups:   region [4]
      region        state        county                 metro    population  walk
      <chr>         <chr>        <chr>                  <chr>         <dbl> <dbl>
    1 West          Alaska       Aleutians East Borough Nonmetro       3304  71.2
    2 Northeast     New York     New York               Metro       1629507  20.7
    3 North Central North Dakota McIntosh               Nonmetro       2759  17.5
    4 South         Virginia     Lexington city         Nonmetro       7071  31.7
    

    同理,如果想要看每个组里排名前2的数据的话,n=2就可以。

    相关文章

      网友评论

          本文标题:DataCamp课程 <用dplyr操作数据> Chapter2

          本文链接:https://www.haomeiwen.com/subject/bogrpltx.html