美文网首页
DataCamp课程 <用dplyr操作数据> Chapter3

DataCamp课程 <用dplyr操作数据> Chapter3

作者: Jason数据分析生信教室 | 来源:发表于2021-07-16 13:59 被阅读0次

    用dplyr操作数据课程目录

    Chapter1. 数据变形
    Chapter2. 数据统计
    Chapter3. 数据选择和变形
    Chapter4. 实战演练

    本章节前半部分内容在之前的<Tidyverse>有出现过一些,重复的内容就不详细讲解了,简单带过。但是后半部分出现了一些新的内容会稍微详细的说明一下。希望能对大家有所帮助。

    select()选择变量

    select()选择变量,并用arrange()根据某变量进行排序。

    counties %>%
      # Select state, county, population, and industry-related columns
      select(state,county,population,professional,service,office,construction,production) %>% 
      # Arrange service in descending order 
      arrange(desc(service))
    # A tibble: 3,138 x 8
       state   county population professional service office construction production
       <chr>   <chr>       <dbl>        <dbl>   <dbl>  <dbl>        <dbl>      <dbl>
     1 Missis~ Tunica      10477         23.9    36.6   21.5          3.5       14.5
     2 Texas   Kinney       3577         30      36.5   11.6         20.5        1.3
     3 Texas   Kenedy        565         24.9    34.1   20.5         20.5        0  
     4 New Yo~ Bronx     1428357         24.3    33.3   24.2          7.1       11  
     5 Texas   Brooks       7221         19.6    32.4   25.3         11.1       11.5
     6 Colora~ Fremo~      46809         26.6    32.2   22.8         10.7        7.6
     7 Texas   Culbe~       2296         20.1    32.2   24.2         15.7        7.8
     8 Califo~ Del N~      27788         33.9    31.5   18.8          8.9        6.8
     9 Minnes~ Mahno~       5496         26.8    31.5   18.7         13.1        9.9
    10 Virgin~ Lanca~      11129         30.3    31.2   22.8          8.1        7.6
    # ... with 3,128 more rows
    

    filter()对数据进行筛选。

    counties %>%
      # Select the state, county, population, and those ending with "work"
      select(state,county,population,ends_with("work")) %>% 
      # Filter for counties that have at least 50% of people engaged in public work
      filter( public_work >= 50) 
    # A tibble: 7 x 6
      state      county              population private_work public_work family_work
      <chr>      <chr>                    <dbl>        <dbl>       <dbl>       <dbl>
    1 Alaska     Lake and Peninsula~       1474         42.2        51.6         0.2
    2 Alaska     Yukon-Koyukuk Cens~       5644         33.3        61.7         0  
    3 California Lassen                   32645         42.6        50.5         0.1
    4 Hawaii     Kalawao                     85         25          64.1         0  
    5 North Dak~ Sioux                     4380         32.9        56.8         0.1
    6 South Dak~ Todd                      9942         34.4        55           0.8
    7 Wisconsin  Menominee                 4451         36.8        59.1         0.4
    

    select()的其他用法

    当数据变量很多的时候,手动一个一个输入变量明显会降低神产销率。select()支持批量性的选择变量。

    counties %>%
      select(state, county, drive:work_at_home)
    
    • contains包含xx的变量
    • starts_with以xx开始的变量
    • ends_with以xx结尾的变量
      举个例子

    也可以用select()删除某个变量

    rename()给变量重新命名

    rename()是第一次出现,用法可以参照下面的代码。

    counties %>%
      count(state)
    # A tibble: 50 x 2
       state           n
       <chr>       <int>
     1 Alabama        67
     2 Alaska         28
     3 Arizona        15
     4 Arkansas       75
     5 California     58
     6 Colorado       64
     7 Connecticut     8
     8 Delaware        3
     9 Florida        67
    10 Georgia       159
    # ... with 40 more rows
      
    # Rename the n column to num_counties
    counties %>%
      count(state) %>% 
      rename(num_counties=n)
    # A tibble: 50 x 2
       state       num_counties
       <chr>              <int>
     1 Alabama               67
     2 Alaska                28
     3 Arizona               15
     4 Arkansas              75
     5 California            58
     6 Colorado              64
     7 Connecticut            8
     8 Delaware               3
     9 Florida               67
    10 Georgia              159
    # ... with 40 more rows
    

    也可以不用rename()直接简单粗暴点。

    # Select state, county, and poverty as poverty_rate
    counties %>%
      select(state,county,poverty_rate=poverty)
    # A tibble: 3,138 x 3
       state   county   poverty_rate
       <chr>   <chr>           <dbl>
     1 Alabama Autauga          12.9
     2 Alabama Baldwin          13.4
     3 Alabama Barbour          26.7
     4 Alabama Bibb             16.8
     5 Alabama Blount           16.7
     6 Alabama Bullock          24.6
     7 Alabama Butler           25.4
     8 Alabama Calhoun          20.5
     9 Alabama Chambers         21.6
    10 Alabama Cherokee         19.2
    # ... with 3,128 more rows
    

    transmute()变换和产生新的变量

    transmute()的特点

    • 选择变量&转换变量
    • 产生的新变量会替换之前的变量

    比方说我们要根据population/land_area来产生新的变量density。用transmute就不需要先select()mutate()了。

    counties %>%
      # Keep the state, county, and populations columns, and add a density column
      transmute(state,county,population,density=population/land_area) %>% 
      # Filter for counties with a population greater than one million 
      filter(population > 1000000) %>% 
      # Sort density in ascending order 
      arrange(density)
    # A tibble: 41 x 4
       state      county         population density
       <chr>      <chr>               <dbl>   <dbl>
     1 California San Bernardino    2094769    104.
     2 Nevada     Clark             2035572    258.
     3 California Riverside         2298032    319.
     4 Arizona    Maricopa          4018143    437.
     5 Florida    Palm Beach        1378806    700.
     6 California San Diego         3223096    766.
     7 Washington King              2045756    967.
     8 Texas      Travis            1121645   1133.
     9 Florida    Hillsborough      1302884   1277.
    10 Florida    Orange            1229039   1360.
    # ... with 31 more rows
    

    语法总结

    只保留特定的变量 同时保留别的变量
    不改变变量值 select() rename()
    改变变量值 transmute() mutate()

    相关文章

      网友评论

          本文标题:DataCamp课程 <用dplyr操作数据> Chapter3

          本文链接:https://www.haomeiwen.com/subject/jitbpltx.html