美文网首页
【R for Data Science】(3) Data Tra

【R for Data Science】(3) Data Tra

作者: Chanic | 来源:发表于2019-07-25 20:00 被阅读0次

通常我们的数据不能直接用于可视化处理,因此我们要对它们进行转化整理(transform),比如创建新的变量或重命名变量或者重新整理观测值的顺序等等。

1. 安装

这里使用 nycflights13tidyverse 两个包,其中主要用到 dplyr 包中函数:

library(nycflights13)
library(tidyverse)

nycflights13 中的 flights 数据对象含有 336776 个 2013 年纽约的航班信息:

flights.png

注意

  • int stands for integers.
  • dbl stands for doubles, or real numbers.
  • chr stands for character vectors, or strings.
  • dttm stands for date-times (a date + a time).
  • lgl stands for logical, vectors that contain only TRUE or FALSE.
  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
  • date stands for dates.
    dplyr basics
  • filter() : Pick observations by their values.
  • arrange() : Reorder the rows.
  • select() : Pick variables by their names.
  • mutate() : Create new variables with functions of existing variables.
  • summarise() : Collapse many values down to a single summary.

2. filter() 筛选观测值(行)

选取特定值:

filter1.png
2.1 near() 能用来判断两个值是否相等
near().png
2.2 逻辑判断

& is "and", | is "or", and ! is "not". “与或非”
x %in% y: This will select every row where x is one of the values in y.
!(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y.

logical operators.png
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
filter2.png

2.3 缺失值

NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

判断一个值是否是 NA, 使用 is.na()
尝试:

filter(df, is.na(x) | x > 1)

3. arrange() 对行进行重排序

默认情况下按升序排列。使用desc() 可以降序排列, NA 值进行排序时候再末尾:

desc.png

4. select() 筛选特征值(列)

flights 对象有 19 个特征值,可以直接选择所需要的特征值进行后续分析:

select.png

There are a number of helper functions you can use within select():

  • starts_with("abc"): matches names that begin with “abc”.
  • ends_with("xyz"): matches names that end with “xyz”.
  • contains("ijk"): matches names that contain “ijk”.
  • matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters.
  • num_range("x", 1:3): matches x1, x2 and x3.

rename() 函数可以用来重命名变量:

rename.png

5. mutate() 添加新变量

一般把新变量添加在数据末尾:


mutate.png

6. summarise() 分组统计

能将一整个数据框统计成一行。同时会用到 group_by() 来进行分组:

image.png

相关文章

网友评论

      本文标题:【R for Data Science】(3) Data Tra

      本文链接:https://www.haomeiwen.com/subject/hgphrctx.html