美文网首页
【R】《R for Data Science》学习笔记-数据探索

【R】《R for Data Science》学习笔记-数据探索

作者: 沈梦圆1993 | 来源:发表于2018-10-26 17:30 被阅读29次

本篇的学习目的是快速掌握数据探索的工具。观察数据、提出假设、快速检验,重复、重复、重复。提出越多假设,探索数据就更为深入。

data-science-explore

Data visualisation

R有好几个绘图系统,我们学习用ggplot2进行数据的可视化,因为它最为优雅且全能的。如果想要学它的绘图理论可以看 ”The Layered Grammar of Graphics“。

Question

  • 汽车发动机越大越耗油不?
  • 发动机的大小和油耗是啥关系?正相关?负相关?线性的?非线性的?
# install.packages("tidyverse")
library(tidyverse)
# -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.2.1 --
# √ ggplot2 2.2.1     √ purrr   0.2.4
# √ tibble  1.3.4     √ dplyr   0.7.4
# √ tidyr   0.7.2     √ stringr 1.2.0
# √ readr   1.1.1     √ forcats 0.2.0
# -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
#   x dplyr::filter() masks stats::filter()
# x dplyr::lag()    masks stats::lag()

tidyverse能够帮助我们安装和加载一系列数据处理的包,方便快捷。

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

发动机大小和油耗直接的散点图,可以知道发动机越大,油耗相对越小。

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

以上是绘图模板。

Aesthetic

  • 连续型变量只能aesthetic是连续型的属性 color,size,但 shape就不可以(它是非连续型变量);
  • stroke aesthetic 能够改变点的大小(geom_point);
  • aes(colour = displ < 5)则会根据displ判断是否小于5产生F/T逻辑值来分配颜色;

Facets

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

This is generally a better use of screen space than facet_grid because most displays are roughly rectangular.

facet_grid(class ~ . )class在左边表示按行分面,如果在右边就是按列分面;

Geometric objects

  • ggplot2扩展Y叔的ggtree也在里面;
  • show.legend = FALSE可以把右边图标给去除;

Stat

  • In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
  • geom_colgeom_bar的区别

Position

  • position = "identity"
  • position = "fill"
  • position = "dodge"
  • geom_point(position = "jitter"): geom_jitter()
  • ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack

Coordinate systems

  • coord_flip() switches the x and y axes
  • coord_quickmap() sets the aspect ratio correctly for maps
  • coord_polar() uses polar coordinates

Summary

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>
visualization-grammar-1 visualization-grammar-2 visualization-grammar-3

orkflow: basics

  • Cmd/Ctrl + ↑查找代码记录
  • Alt + Shift + K保存代码

Data transformation

library(nycflights13)
library(tidyverse)
  • int stands for integers.
  • dbl stands for doubles, or real numbers.
  • chr stands for character vectors, or strings.
  • dttm stands for date-times (a date + a time).
  • lgl stands for logical, vectors that contain only TRUE or FALSE.
  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
  • date stands for dates.

dplyr basics

  • Pick observations by their values (filter()).

  • Reorder the rows (arrange()).

  • Pick variables by their names (select()).

  • Create new variables with functions of existing variables (mutate()).

  • Collapse many values down to a single summary (summarise()).

  • group_by() changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

  • Every number you see is an approximation. Instead of relying on ==, use near().

  • Logical operators:

logical
x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects
filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))
filter(flights,year %in% c(2013,2014)
filter(flights,between(year,2013,2014)
filter(flights,year == 2013 | year == 2014)
  • order:arrange() ,desc()
  • select()choose variables:starts_with("abc"),ends_with("xyz"),contains("ijk"),matches("(.)\\1"),num_range("x", 1:3) matches x1, x2 and x3
  • rename(): rename variables

  • move time_hour/air_timeto the start of the data frame:select(flights, time_hour, air_time, everything())

  • comtains()不区分大小写(不知道怎么解决)

  • Add new variables with mutate()

  • only keep the new variables, use transmute()

  • log(), log2(), log10():对数是处理跨越多个数量级的数据的非常有用的转换。他们还将乘法关系转换为可加性,这是我们将在建模中回归的一个特征

  • Offsets: lead() and lag()(暂时没有体会)

  • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean();RcppRoll package

  • Ranking:min_rank()(排序的序号?),row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()

  • Grouped summaries with summarise(),group_by()

  • Combining multiple operations with the pipe:

  by_dest <- group_by(flights, dest)
  delay <- summarise(by_dest,
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  )
  delay <- filter(delay, count > 20, dest != "HNL")

以上等同于,下面的pipe:

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

  # It looks like delays increase with distance up to ~750 miles 
  # and then decrease. Maybe as flights get longer there's more 
  # ability to make up delays in the air?
  ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
    geom_point(aes(size = count), alpha = 1/3) +
    geom_smooth(se = FALSE)
  #> `geom_smooth()` using method = 'loess'

我的微信公众号

如果实在有需要请给我发邮件:mengyuanshen@126.com
也可以关注我的公众号:沈梦圆(PandaBiotrainee)

相关文章

网友评论

      本文标题:【R】《R for Data Science》学习笔记-数据探索

      本文链接:https://www.haomeiwen.com/subject/zwjltqtx.html