Tidyverse

作者: 不学无数YD | 来源:发表于2021-07-16 22:12 被阅读0次

    The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.

    一整套数据处理的方法包-----包含下面的包:

    image.png
    image.png
    image.png
    image.png
    image.png
    image.png
    image.png
    image.png

    处理数据流程:

    1. 数据导入
    2. 数据整理
    3. 数据探索(可视化,统计分析)

    If you’d like to learn how to use the tidyverse effectively, the best place to start is R for data science.

    安装

    # Install from CRAN
    install.packages("tidyverse")
    
    # Or the development version from GitHub
    # install.packages("devtools")
    devtools::install_github("tidyverse/tidyverse")
    

    使用

    library(tidyverse)will load the core tidyverse packages:

    • ggplot2, for data visualisation.
    • dplyr, for data manipulation.
    • tidyr, for data tidying.
    • readr, for data import.
    • purrr, for functional programming.
    • tibble, for tibbles, a modern re-imagining of data frames.
    • stringr, for strings.
    • forcats, for factors.
    library(tidyverse)
    #载入数据
    library(datasets)
    install.packages("gapminder")
    library(gapminder)
    attach(iris)
    #数据过滤dplyr
    #filter()函数可以用来取数据子集。
    iris %>% 
      filter(Species == "virginica") # 指定满足的行
    iris %>% 
      filter(Species == "virginica", Sepal.Length > 6) # 多个条件用,分隔
    #排序
    # arrange()函数用来对观察值排序,默认是升序。
    iris %>% 
      arrange(Sepal.Length)
    iris %>% 
      arrange(desc(Sepal.Length)) # 降序
    # 新增变量
    # mutate()可以更新或者新增数据框一列。
    iris %>% 
      mutate(Sepal.Length = Sepal.Length * 10) # 将该列数值变成以mm为单位
    iris %>% 
      mutate(SLMn = Sepal.Length * 10) # 创建新的一列
    # 整合函数流:
    iris %>% 
      filter(Species == "Virginica") %>% 
      mutate(SLMm = Sepal.Length) %>% 
      arrange(desc(SLMm))
    ## [1] Sepal.Length Sepal.Width  Petal.Length Petal.Width  Species     
    ## [6] SLMm        
    ## <0 行> (或0-长度的row.names)
    # 汇总
    # summarize()函数可以让我们将很多变量汇总为单个的数据点。
    iris %>% 
      summarize(medianSL = median(Sepal.Length))
    ##   medianSL
    ## 1      5.8
    iris %>% 
      filter(Species == "virginica") %>% 
      summarize(medianSL=median(Sepal.Length))
    # 一次性汇总多个变量
    iris %>% 
      filter(Species == "virginica") %>% 
      summarize(medianSL = median(Sepal.Length),
                maxSL = max(Sepal.Length))
    # group_by()可以让我们安装指定的组别进行汇总数据,而不是针对整个数据框
    iris %>% 
      group_by(Species) %>% 
      summarize(medianSL = median(Sepal.Length),
                maxSL = max(Sepal.Length))
    iris %>% 
      filter(Sepal.Length>6) %>% 
      group_by(Species) %>% 
      summarize(medianPL = median(Petal.Length), 
                maxPL = max(Petal.Length))
    # ggplot2
    # 散点图
    # 散点图可以帮助我们理解两个变量的数据关系,使用geom_point()可以绘制散点图:
    iris_small <- iris %>% 
      filter(Sepal.Length > 5)
    
    ggplot(iris_small, aes(x = Petal.Length,
                           y = Petal.Width)) + 
      geom_point()
    # 颜色
    ggplot(iris_small, aes(x = Petal.Length,
                           y = Petal.Width,
                           color = Species)) + 
      geom_point()
    # 大小
    ggplot(iris_small, aes(x = Petal.Length,
                           y = Petal.Width,
                           color = Species,
                           size = Sepal.Length)) + 
      geom_point()
    # 分面
    ggplot(iris_small, aes(x = Petal.Length,
                           y = Petal.Width)) + 
      geom_point() + 
      facet_wrap(~Species)
    #线图
    by_year <- gapminder %>% 
      group_by(year) %>% 
      summarize(medianGdpPerCap = median(gdpPercap))
    
    ggplot(by_year, aes(x = year,
                        y = medianGdpPerCap)) +
      geom_line() + 
      expand_limits(y=0)
    # 条形图
    by_species <- iris %>%  
      filter(Sepal.Length > 6) %>% 
      group_by(Species) %>% 
      summarize(medianPL=median(Petal.Length))
    
    ggplot(by_species, aes(x = Species, y=medianPL)) + 
      geom_col()
    # 直方图
    ggplot(iris_small, aes(x = Petal.Length)) + 
      geom_histogram()
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    # 箱线图
    ggplot(iris_small, aes(x=Species, y=Sepal.Length)) + 
      geom_boxplot()
    

    参考文章:
    https://www.jianshu.com/p/f3c21a5ad10a
    https://tidyverse.tidyverse.org/
    https://zhuanlan.zhihu.com/p/88947457

    相关文章

      网友评论

          本文标题:Tidyverse

          本文链接:https://www.haomeiwen.com/subject/sosmpltx.html