美文网首页
DataCamp课程 Chapter.3

DataCamp课程 Chapter.3

作者: Jason数据分析生信教室 | 来源:发表于2021-07-09 10:14 被阅读0次

    Tidyverse课程目录

    Chapter 1. 数据整形
    Chapter 2. 数据可视化
    Chapter 3. 分组和概括
    Chapter 4. 可视化类型

    Chapter 3. 分组和概括

    summarize进行描述性统计


    summarize的功能就是对某个变量根据指定(比方说平均数,中位数)就行概述。
    举个例子,我们要看一下lifeExp的中位数。
    # Summarize to find the median life expectancy
    gapminder %>% 
    summarize(medianLifeExp=median(lifeExp))
    # A tibble: 1 x 1
      medianLifeExp
              <dbl>
    1          60.7
    

    接下来结合一下之前学到的filter,统计分析一下year1957的数据里的lifeExp的中位数。

    # Filter for 1957 then summarize the median life expectancy
    gapminder %>% 
    filter(year==1957)%>% 
    summarize(medianLifeExp=median(lifeExp))
    # A tibble: 1 x 1
      medianLifeExp
              <dbl>
    1          48.4
    

    当然也可以同时统计两个变量。比方说用max()查看最大值。

    # Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita
    gapminder %>% 
    filter(year==1957)%>% 
    summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))
    # A tibble: 1 x 2
      medianLifeExp maxGdpPercap
              <dbl>        <dbl>
    1          48.4      113523.
    

    group_by进行分组描述性统计


    summarize之前用group_by的话可以实现根据某个变量的种类进行分类描述性统计。比方说,根据下面的代码寻找每一年的lifeExp的中位数和gdpPercap的最大值。
    # Find median life expectancy and maximum GDP per capita in each year
    gapminder %>% 
    group_by(year) %>% 
    summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))
    # A tibble: 12 x 3
        year medianLifeExp maxGdpPercap
     * <int>         <dbl>        <dbl>
     1  1952          45.1      108382.
     2  1957          48.4      113523.
     3  1962          50.9       95458.
     4  1967          53.8       80895.
     5  1972          56.5      109348.
     6  1977          59.7       59265.
     7  1982          62.4       33693.
     8  1987          65.8       31541.
     9  1992          67.7       34933.
    10  1997          69.4       41283.
    11  2002          70.8       44684.
    12  2007          71.9       49357.
    

    结合filter,这次我们需要寻找year1957数据里每个continent里的lifeExp的中位数和gdpPercap的最大值。

    gapminder %>% 
    filter(year==1957) %>% 
    group_by(continent) %>% 
    summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))
    

    当然,group_by里的变量也可以是多个,比方说

    # Find median life expectancy and maximum GDP per capita in each continent/year combination
    gapminder %>% 
    group_by(continent,year) %>% 
    summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))
    `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
    # A tibble: 60 x 4
    # Groups:   continent [5]
       continent  year medianLifeExp maxGdpPercap
       <fct>     <int>         <dbl>        <dbl>
     1 Africa     1952          38.8        4725.
     2 Africa     1957          40.6        5487.
     3 Africa     1962          42.6        6757.
     4 Africa     1967          44.7       18773.
     5 Africa     1972          47.0       21011.
     6 Africa     1977          49.3       21951.
     7 Africa     1982          50.8       17364.
     8 Africa     1987          51.6       11864.
     9 Africa     1992          52.4       13522.
    10 Africa     1997          52.8       14723.
    # … with 50 more rows
    

    描述性统计的可视化

    先根据year总结出每年lifeExp的中位数和gdpPercap的最大值。然后用ggplot2对其进行可视化,此处加入了expand_limits(y = 0)这条指令,这是为了让y轴包含0值。

    by_year <- gapminder %>%
      group_by(year) %>%
      summarize(medianLifeExp = median(lifeExp),
                maxGdpPercap = max(gdpPercap))
    
    # Create a scatter plot showing the change in medianLifeExp over time
    ggplot(by_year,aes(x=year,y=medianLifeExp))+
    geom_point()+
    expand_limits(y = 0)
    

    接下来画一个稍微复杂的图,此处会用到Chapter.2数据可视化的知识。首先根据yearcontinent将数据进行组化,并且计算gdpPercap的中位数。然后将数据可视化,横轴是year,纵轴是medianGdpPercap。并且根据continent进行上色。

    # Summarize medianGdpPercap within each continent within each year: by_year_continent
    by_year_continent <- gapminder %>% 
    group_by(year,continent) %>% 
      summarize(medianGdpPercap = median(gdpPercap))
    
    # Plot the change in medianGdpPercap in each continent over time
    ggplot(by_year_continent,aes(x=year,y=medianGdpPercap,color=continent)) +
    geom_point()+
    expand_limits(y=0)
    

    还可以可视化两个变量的描述性统计的关联。比方说根据下面的代码可以对2007年的gdpPercap的中位数和lifeExp的中位数进行可视化,并根据continent给图形上色。

    # Summarize the median GDP and median life expectancy per continent in 2007
    by_continent_2007<- gapminder %>% 
    filter(year==2007) %>%
    group_by(continent) %>% 
    summarize(medianGdpPercap=median(gdpPercap),medianLifeExp=median(lifeExp))
    
    
    # Use a scatter plot to compare the median GDP and median life expectancy
    ggplot(by_continent_2007,aes(x=medianGdpPercap,y=medianLifeExp,color=continent))+
    geom_point()+
    expand_limits(y = 0)
    

    相关文章

      网友评论

          本文标题:DataCamp课程 Chapter.3

          本文链接:https://www.haomeiwen.com/subject/tcooultx.html