DataCamp课程 Chapter.3

作者: Jason数据分析生信教室 | 来源:发表于2021-07-09 10:14 被阅读0次

DataCamp课程 Chapter.3
DataCamp课程 Chapter.4
DataCamp课程 Chapter.1
DataCamp课程 Chapter.2
13份Python数据科学必备备忘录，高清原版放送！
基于R的网络分析(一): 基本操作
Python数据分析的起手式（2）Python 列表 list
Python数据分析的起手式（3）函数、方法和包
Python数据分析的起手式（4）Numpy入门
Python数据分析的起手式（1）Python 基础

Tidyverse课程目录

Chapter 1. 数据整形
Chapter 2. 数据可视化
Chapter 3. 分组和概括
Chapter 4. 可视化类型

Chapter 3. 分组和概括

用`summarize`进行描述性统计

summarize的功能就是对某个变量根据指定(比方说平均数，中位数)就行概述。
举个例子，我们要看一下lifeExp的中位数。

# Summarize to find the median life expectancy
gapminder %>% 
summarize(medianLifeExp=median(lifeExp))
# A tibble: 1 x 1
  medianLifeExp
          <dbl>
1          60.7

接下来结合一下之前学到的filter，统计分析一下year为1957的数据里的lifeExp的中位数。

# Filter for 1957 then summarize the median life expectancy
gapminder %>% 
filter(year==1957)%>% 
summarize(medianLifeExp=median(lifeExp))
# A tibble: 1 x 1
  medianLifeExp
          <dbl>
1          48.4

当然也可以同时统计两个变量。比方说用max()查看最大值。

# Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita
gapminder %>% 
filter(year==1957)%>% 
summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))
# A tibble: 1 x 2
  medianLifeExp maxGdpPercap
          <dbl>        <dbl>
1          48.4      113523.

用`group_by`进行分组描述性统计

在summarize之前用group_by的话可以实现根据某个变量的种类进行分类描述性统计。比方说，根据下面的代码寻找每一年的lifeExp的中位数和gdpPercap的最大值。

# Find median life expectancy and maximum GDP per capita in each year
gapminder %>% 
group_by(year) %>% 
summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))
# A tibble: 12 x 3
    year medianLifeExp maxGdpPercap
 * <int>         <dbl>        <dbl>
 1  1952          45.1      108382.
 2  1957          48.4      113523.
 3  1962          50.9       95458.
 4  1967          53.8       80895.
 5  1972          56.5      109348.
 6  1977          59.7       59265.
 7  1982          62.4       33693.
 8  1987          65.8       31541.
 9  1992          67.7       34933.
10  1997          69.4       41283.
11  2002          70.8       44684.
12  2007          71.9       49357.

结合filter，这次我们需要寻找year为1957数据里每个continent里的lifeExp的中位数和gdpPercap的最大值。

gapminder %>% 
filter(year==1957) %>% 
group_by(continent) %>% 
summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))

当然，group_by里的变量也可以是多个，比方说

# Find median life expectancy and maximum GDP per capita in each continent/year combination
gapminder %>% 
group_by(continent,year) %>% 
summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))
`summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
# A tibble: 60 x 4
# Groups:   continent [5]
   continent  year medianLifeExp maxGdpPercap
   <fct>     <int>         <dbl>        <dbl>
 1 Africa     1952          38.8        4725.
 2 Africa     1957          40.6        5487.
 3 Africa     1962          42.6        6757.
 4 Africa     1967          44.7       18773.
 5 Africa     1972          47.0       21011.
 6 Africa     1977          49.3       21951.
 7 Africa     1982          50.8       17364.
 8 Africa     1987          51.6       11864.
 9 Africa     1992          52.4       13522.
10 Africa     1997          52.8       14723.
# … with 50 more rows

描述性统计的可视化

先根据year总结出每年lifeExp的中位数和gdpPercap的最大值。然后用ggplot2对其进行可视化，此处加入了expand_limits(y = 0)这条指令，这是为了让y轴包含0值。

by_year <- gapminder %>%
  group_by(year) %>%
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))

# Create a scatter plot showing the change in medianLifeExp over time
ggplot(by_year,aes(x=year,y=medianLifeExp))+
geom_point()+
expand_limits(y = 0)

接下来画一个稍微复杂的图，此处会用到Chapter.2数据可视化的知识。首先根据year和continent将数据进行组化，并且计算gdpPercap的中位数。然后将数据可视化，横轴是year，纵轴是medianGdpPercap。并且根据continent进行上色。

# Summarize medianGdpPercap within each continent within each year: by_year_continent
by_year_continent <- gapminder %>% 
group_by(year,continent) %>% 
  summarize(medianGdpPercap = median(gdpPercap))

# Plot the change in medianGdpPercap in each continent over time
ggplot(by_year_continent,aes(x=year,y=medianGdpPercap,color=continent)) +
geom_point()+
expand_limits(y=0)

还可以可视化两个变量的描述性统计的关联。比方说根据下面的代码可以对2007年的gdpPercap的中位数和lifeExp的中位数进行可视化，并根据continent给图形上色。

# Summarize the median GDP and median life expectancy per continent in 2007
by_continent_2007<- gapminder %>% 
filter(year==2007) %>%
group_by(continent) %>% 
summarize(medianGdpPercap=median(gdpPercap),medianLifeExp=median(lifeExp))


# Use a scatter plot to compare the median GDP and median life expectancy
ggplot(by_continent_2007,aes(x=medianGdpPercap,y=medianLifeExp,color=continent))+
geom_point()+
expand_limits(y = 0)

网友评论

本文标题：DataCamp课程 Chapter.3

本文链接：https://www.haomeiwen.com/subject/tcooultx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！