- data.table
https://github.com/Rdatatable/data.table
https://www.rstudio.com/resources/cheatsheets/
知乎 | 张敬信 | 【R语言新书】2.7 数据处理神器:data.table包
知乎 | 老俊俊 | data.table 让你高效快速地处理数据
“data.table 高度抽象的语法无疑增加了学习成本,但它的高效性能和处理大数据能力,使得非常有必要学习它。当然,读者如果既想要 data.table 的高性能,又想要 tidyverse 的整洁语法,也可以借助一些衔接二者的中间包,如 dtplyr, tidyfst 等。”
-
dtplyr
https://github.com/tidyverse/dtplyr
https://dtplyr.tidyverse.org/ -
tidyfst
https://github.com/hope-data-science/tidyfst
知乎 | 黄天元 | R语言高效数据框操作:tidyfst (注意有专栏)
知乎 | 黄小伟 | R最快且比dplyr最高效的大数据处理R包:tidyfst
创建
-
data.table()
: as.data.table()
读取
-
fread("file.csv")
select = c("a", "b"))
: 读取指定的列
写出
fwrite(dt, "file.csv")
行操作
dt[1:2,]
dt[a > 5,]
dt[, c := 1:.N, by = b]
dt[, c := shift(a, 1), by = b]
dt[, c := shift(a, 1, type = "lead"), by = b]
>
、<
>=
<=
is.na()
!is.na()
%in%
|
、&
、!
%like%
%between%
列操作
dt[, c(2)]
dt[, .(b, c)]
dt[, .(x = sum(a))]
dt[, c := 1+2]
-
dt[,
`:=
`(c = 1, d = 2)]
dt[, c := NULL]
dt[, b := as.integer(b)]
dt[, lapply(.SD, mean). SDcols = c("a", "b")]
-
cols <- c("a")
dt[, paste0(cols , "_m") := lapply(.SD, mean)]
分组
dt[, j, by = .(a)]
dt[, j, keyby = .(a)]
dt[, .(c = sum(b)), by = a)]
dt[, c := sum(b), by = a]
dt[, .SD[1], by = a]
dt[, .SD[.N], by = a]
dt[...][...]
函数
setorder(dt, a, -b)
-
unique[dt, by = c("a", "b")]
: 去重 -
uniqueN(dt, by = c("a", "b"))
: 计数 -
setnames(dt, c("a", "b"), c("x", "y"))
: 重命名 setkey(dt, a, b)
data.table
中 以set
为前缀的函数和操作符:=
不需要<-
就可以改变数据。
例如,setDT(df)
等同于df < - as.data.table(df)
。
合并
dt_a[dt_b, on = .(b = y)]
dt_a[dt_b, on = .(b = y, c > z)]
rbind(dt_a, dt_b)
cbind(dt_a, dt_b)
重塑
-
dcast()
: 长变宽 -
melt()
: 宽变长
dcast(dt,
id - y,
value.var = c("a", "b"))
melt(dt,
id.vars = c("id"),
measure.vars = patterns("^a", "^b"),
variable.name = "y",
value.name = c("a", "b"))
网友评论