data.table
是R语言超过1万3千个三方包之一。它提供了基础R的data.frame
的高性能版本。截止2018年11月,data.table包是Stack Overflow关于R的第4大标签,R包Github点赞数排第10,有大约650个R或Bioconductor包在使用它。
下面是对流行数据框操作包的性能比对,结果来自代码的自动检测工作: h2oai.github.io/db-benchmark
[图片上传失败...(image-347b49-1546509919296)]
Data Table语法
[图片上传失败...(image-ff8cb0-1546509919296)]
与data.frame等价语法
[图片上传失败...(image-eaa31c-1546509919296)]
这些操作可以形成链条结构:
DT[...][...]
查阅data.table compared to dplyr on Stack Overflow and Quora.
> require(data.table)
> example(data.table)
# 基本的基于行取子集
DT[2] # 第二行
DT[2:3] # 第二行和第三行
w=2:3; DT[w] # 同上
DT[order(x)] # 不需要对x实验DT$前缀
DT[order(x), ] # 同上, ',' 是可选的
DT[y>2] # 所有DT$y > 2的行
DT[y>2 & v>5] # 复合逻辑表达式
DT[!2:4] # 所有不是2:4的行
DT[-(2:4)] # 同上
# 选择或计算列
DT[, v] # v列(结果是向量)
DT[, list(v)] # v列(结构是data.table)
DT[, .(v)] # 同上, .()是list()的别名
DT[, sum(v)] # 对v列求和,返回向量
DT[, .(sum(v))] # 同上,但是返回data.table
DT[, .(sv=sum(v))] # 同上,但是将列名重命名为'sv'
DT[, .(v, v*2)] # 返回两列data.table
# 基于行取子集且计算列
DT[2:3, sum(v)] # 对第2和3行的v列求和
DT[2:3, .(sum(v))] # 同上,但是返回data.table
DT[2:3, .(sv=sum(v))] # 同上,但是重命名为'sv'
DT[2:5, cat(v, "\n")] # j的副作用
# 以data.frame的方式选择列
DT[, 2] # 第2列,总是返回data.table
colNum = 2
DT[, ..colNum] # 与DT[,2]相同,..var => one-up
DT[["v"]] # 与DT[,v]相同但是低开销
# 分组操作 - j 与 by
DT[, sum(v), by=x] # 保留组出现的顺序
DT[, sum(v), keyby=x] # 按组队结果排序
DT[, sum(v), by=x][order(x)] # 结果同上,但以表达式链的方式
# 快速地基于行取子集
DT["a", on="x"] # 与x == "a"但是使用键(更快)
DT["a", on=.(x)] # 同上
DT[.("a"), on="x"] # 同上
DT[x=="a"] # 同上,内部优化
DT[x!="b" | y!=3] # 还未优化
DT[.("b", 3), on=c("x", "y")] # 等价于DT[x=="b" & y==3]
DT[.("b", 3), on=.(x, y)] # 同上
DT[.("b", 1:2), on=c("x", "y")] # 不匹配的返回NA
DT[.("b", 1:2), on=.(x, y), nomatch=0] # 不匹配的行不返回
DT[.("b", 1:2), on=c("x", "y"), roll=Inf] # locf, previous row rolls forward
DT[.("b", 1:2), on=.(x, y), roll=-Inf] # nocb, next row rolls backward
DT["b", sum(v*y), on="x"] # 等价于DT[x=="b", sum(v*y)]
# 所有操作放到一起
DT[x!="a", sum(v), by=x] # 对于每一个i!="a",按x分组获取v列的和
DT[!"a", sum(v), by=.EACHI, on="x"] # 同上,但是使用subsets-as-joins的方式 (译者不太懂具体含义)
DT[c("b","c"), sum(v), by=.EACHI, on="x"] # 同上
DT[c("b","c"), sum(v), by=.EACHI, on=.(x)] # 同上,使用 on=.()
# join子集
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X
DT[X, on="x"] # 右join
X[DT, on="x"] # 左join
DT[X, on="x", nomatch=0] # 内join
DT[!X, on="x"] # 非join
DT[X, on=c(y="v")] # join DT$y到X$v
DT[X, on="y==v"] # 同上
DT[X, on=.(y<=foo)] # 不等 join
DT[X, on="y<=foo"] # 同上
DT[X, on=c("y<=foo")] # 同上
DT[X, on=.(y>=foo)] # 不等 join
DT[X, on=.(x, y<=foo)] # 不等 join
DT[X, .(x,y,x.y,v), on=.(x, y>=foo)] # select x's join columns as well
DT[X, on="x", mult="first"] # 每组的第一行
DT[X, on="x", mult="last"] # 每组的最后一行
DT[X, sum(v), by=.EACHI, on="x"] # join and eval j for each row in i
DT[X, sum(v)*foo, by=.EACHI, on="x"] # join inherited scope
DT[X, sum(v)*i.v, by=.EACHI, on="x"] # 'i,v' refers to X's v column
DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # non-equi join with by=.EACHI
# 设置键
kDT = copy(DT) # data.table拷贝
setkey(kDT,x) # 设置一列为键
setkeyv(kDT,"x") # 同上(v in setkeyv stands for vector)
v="x"
setkeyv(kDT,v) # 同上
haskey(kDT) # TRUE
key(kDT) # "x"
# 快速键方法取子集
kDT["a"] # subset-as-join on *key* column 'x'
kDT["a", on="x"] # same, being explicit using 'on='
# 放到一起
kDT[!"a", sum(v), by=.EACHI] # get sum(v) for each i != "a"
# 设置多列键
setkey(kDT,x,y) # 2-column key
setkeyv(kDT,c("x","y")) # 同上
# 基于多键的取子集
kDT["a"] # 匹配第一个键
kDT["a", on="x"] # on= 可选但推荐
kDT[.("a")] # 同上
kDT[list("a")] # 同上
kDT[.("a", 3)] # 匹配两列
kDT[.("a", 3:6)] # join 4行
kDT[.("a", 3:6), nomatch=0] # 移除缺失
kDT[.("a", 3:6), roll=TRUE] # locf rolling join
kDT[.("a", 3:6), roll=Inf] # 同上
kDT[.("a", 3:6), roll=-Inf] # nocb rolling join
kDT[!.("a")] # not join
kDT[!"a"] # 同上
# 特殊符号,参见 ?"special-symbols"
DT[.N] # 最后一行
DT[, .N] # 行数
DT[, .N, by=x] # 每一组的行数
DT[, .SD, .SDcols=x:y] # 选择'x'和 'y'列
DT[, .SD[1]] # 第一行,等价于DT[1,]
DT[, .SD[1], by=x] # 每组的第一行
DT[, c(.N, lapply(.SD, sum)), by=x] # 分组计算行数和总和
DT[, .I[1], by=x] # 每一组第1行的行序号
DT[, grp := .GRP, by=x] # 添加一个分组计数行
X[, DT[.BY, y, on="x"], by=x] # 组内join,使用更少内存
# 按引用添加、更新和删除,参见?assign
print(DT[, z:=42L]) # 按引用添加新的一行
print(DT[, z:=NULL]) # 按引用移除行
print(DT["a", v:=42L, on="x"]) # 对子集重新赋值
print(DT["b", v2:=84L, on="x"]) # 对子集创建新列
DT[, m:=mean(v), by=x][] # 按引用按组添加行
# []是print的快捷方式
# 高级操作
DT[, sum(v), by=.(y%%2)] # 在by内使用表达式
DT[, sum(v), by=.(bool = y%%2)] # 对表达式附加名字
DT[, .SD[2], by=x] # 得到每组的第二行
DT[, tail(.SD,2), by=x] # 得到每组的最后两行
DT[, lapply(.SD, sum), by=x] # 得到每组所有列的和
DT[, .SD[which.min(v)], by=x] # 嵌套操作
DT[, list(MySum=sum(v),
MyMin=min(v),
MyMax=max(v)),
by=.(x, y%%2)] # 按2个表达式进行分组by
DT[, .(a = .(a), b = .(b)), by=x] # list columns
DT[, .(seq = min(a):max(b)), by=x] # j is not limited to just aggregations
DT[, sum(v), by=x][V1<20] # 组合查询
DT[, sum(v), by=x][order(-V1)] # 对结果排序
DT[, c(.N, lapply(.SD,sum)), by=x] # group size and sums by group
DT[, {tmp <- mean(y); # anonymous lambda in 'j'; j any valid
.(a = a-tmp, b = b-tmp) # expression where every element
}, by=x] # becomes a column in result
pdf("new.pdf")
DT[, plot(a,b), by=x] # can also plot in 'j'
dev.off()
# get max(y) and min of a set of columns for each consecutive run of 'v'
DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b]
[图片上传失败...(image-410df6-1546509919296)] [图片上传失败...(image-f2159a-1546509919296)]
[图片上传失败...(image-497e99-1546509919296)]
[图片上传失败...(image-bd897c-1546509919296)]
Other features include :
-
fast and friendly delimited file reader:
?fread
. It accepts system commands directly (such asgrep
andgunzip
), has other convenience features for small data and is now parallelized on CRAN May 2018 and presented earlier here. -
fast and parallelized file writer:
?fwrite
announced here and on CRAN in Nov 2016. - parallelized row subsets - See this benchmark for timings
- fast aggregation of large data; e.g. 100GB in RAM (see benchmarks on up to two billion rows)
- fast add/update/delete columns by reference by group using no copies at all
- fast ordered joins; e.g. rolling forwards, backwards, nearest and limited staleness
- fast overlapping range joins; similar to
findOverlaps
function from IRanges/GenomicRanges Bioconductor packages, but not limited to genomic (integer) intervals. - fast non-equi (or conditional) joins, i.e., joins using operators
>, >=, <, <=
as well, available from v1.9.8+ - a fast primary ordered index; e.g.
setkey(DT,col1,col2)
-
automatic secondary indexing; e.g.
DT[col==val,]
andDT[col %in% vals,]
- fast and memory efficient combined join and group by; by=.EACHI
- fast reshape2 methods (dcast and melt) without needing reshape2 and its dependency chain installed or loaded
- group summary results may be many rows (e.g. first and last row by group) and each cell value may itself be a vector/object/function (e.g. unique ids by group as a list column of varying length vectors - this is pretty printed with commas)
- special symbols built-in for convenience and raw speed by avoiding the overhead of function calls: .N, .SD, .I, .GRP and .BY
- any R function from any R package can be used in queries not just the subset of functions made available by a database backend
- has no dependencies at all other than base R itself, for simpler production/maintenance
- the R dependency is as old as possible for as long as possible and we test against that version; e.g., v1.9.8 released on 25-Nov-2016 bumped the dependency up from 4.5 year old R 2.14.0 to 3 year old R 3.0.0.
网友评论