【r<-包】data.table wiki

作者: 王诗翔 | 来源:发表于2019-01-03 18:05 被阅读14次

data.table是R语言超过1万3千个三方包之一。它提供了基础R的data.frame的高性能版本。截止2018年11月,data.table包是Stack Overflow关于R的第4大标签,R包Github点赞数排第10,有大约650个R或Bioconductor包在使用它。

下面是对流行数据框操作包的性能比对,结果来自代码的自动检测工作: h2oai.github.io/db-benchmark
[图片上传失败...(image-347b49-1546509919296)]

Data Table语法

[图片上传失败...(image-ff8cb0-1546509919296)]

与data.frame等价语法

[图片上传失败...(image-eaa31c-1546509919296)]

这些操作可以形成链条结构:
DT[...][...]
查阅data.table compared to dplyr on Stack Overflow and Quora.

> require(data.table)
> example(data.table)

# 基本的基于行取子集
DT[2]                                       # 第二行
DT[2:3]                                     # 第二行和第三行
w=2:3; DT[w]                                # 同上
DT[order(x)]                                # 不需要对x实验DT$前缀
DT[order(x), ]                              # 同上, ',' 是可选的
DT[y>2]                                     # 所有DT$y > 2的行
DT[y>2 & v>5]                               # 复合逻辑表达式
DT[!2:4]                                    # 所有不是2:4的行
DT[-(2:4)]                                  # 同上

# 选择或计算列
DT[, v]                                     # v列(结果是向量)
DT[, list(v)]                               # v列(结构是data.table)
DT[, .(v)]                                  # 同上, .()是list()的别名
DT[, sum(v)]                                # 对v列求和,返回向量
DT[, .(sum(v))]                             # 同上,但是返回data.table
DT[, .(sv=sum(v))]                          # 同上,但是将列名重命名为'sv'
DT[, .(v, v*2)]                             # 返回两列data.table

# 基于行取子集且计算列
DT[2:3, sum(v)]                             # 对第2和3行的v列求和
DT[2:3, .(sum(v))]                          # 同上,但是返回data.table
DT[2:3, .(sv=sum(v))]                       # 同上,但是重命名为'sv'
DT[2:5, cat(v, "\n")]                       # j的副作用

# 以data.frame的方式选择列
DT[, 2]                                     # 第2列,总是返回data.table
colNum = 2                                  
DT[, ..colNum]                              # 与DT[,2]相同,..var => one-up
DT[["v"]]                                   # 与DT[,v]相同但是低开销 

# 分组操作 - j 与 by
DT[, sum(v), by=x]                          # 保留组出现的顺序
DT[, sum(v), keyby=x]                       # 按组队结果排序
DT[, sum(v), by=x][order(x)]                # 结果同上,但以表达式链的方式

# 快速地基于行取子集
DT["a", on="x"]                             # 与x == "a"但是使用键(更快) 
DT["a", on=.(x)]                            # 同上
DT[.("a"), on="x"]                          # 同上
DT[x=="a"]                                  # 同上,内部优化
DT[x!="b" | y!=3]                           # 还未优化
DT[.("b", 3), on=c("x", "y")]               # 等价于DT[x=="b" & y==3]
DT[.("b", 3), on=.(x, y)]                   # 同上
DT[.("b", 1:2), on=c("x", "y")]             # 不匹配的返回NA
DT[.("b", 1:2), on=.(x, y), nomatch=0]      # 不匹配的行不返回 
DT[.("b", 1:2), on=c("x", "y"), roll=Inf]   # locf, previous row rolls forward
DT[.("b", 1:2), on=.(x, y), roll=-Inf]      # nocb, next row rolls backward
DT["b", sum(v*y), on="x"]                   # 等价于DT[x=="b", sum(v*y)]

# 所有操作放到一起
DT[x!="a", sum(v), by=x]                    # 对于每一个i!="a",按x分组获取v列的和
DT[!"a", sum(v), by=.EACHI, on="x"]         # 同上,但是使用subsets-as-joins的方式 (译者不太懂具体含义)
DT[c("b","c"), sum(v), by=.EACHI, on="x"]   # 同上
DT[c("b","c"), sum(v), by=.EACHI, on=.(x)]  # 同上,使用 on=.()

# join子集
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X

DT[X, on="x"]                               # 右join
X[DT, on="x"]                               # 左join
DT[X, on="x", nomatch=0]                    # 内join
DT[!X, on="x"]                              # 非join
DT[X, on=c(y="v")]                          # join DT$y到X$v
DT[X, on="y==v"]                            # 同上

DT[X, on=.(y<=foo)]                         # 不等 join
DT[X, on="y<=foo"]                          # 同上
DT[X, on=c("y<=foo")]                       # 同上
DT[X, on=.(y>=foo)]                         # 不等 join
DT[X, on=.(x, y<=foo)]                      # 不等 join
DT[X, .(x,y,x.y,v), on=.(x, y>=foo)]        # select x's join columns as well

DT[X, on="x", mult="first"]                 # 每组的第一行
DT[X, on="x", mult="last"]                  # 每组的最后一行
DT[X, sum(v), by=.EACHI, on="x"]            # join and eval j for each row in i
DT[X, sum(v)*foo, by=.EACHI, on="x"]        # join inherited scope
DT[X, sum(v)*i.v, by=.EACHI, on="x"]        # 'i,v' refers to X's v column
DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # non-equi join with by=.EACHI

# 设置键
kDT = copy(DT)                              # data.table拷贝
setkey(kDT,x)                               # 设置一列为键
setkeyv(kDT,"x")                            # 同上(v in setkeyv stands for vector)
v="x"
setkeyv(kDT,v)                              # 同上
haskey(kDT)                                 # TRUE
key(kDT)                                    # "x"

# 快速键方法取子集
kDT["a"]                                    # subset-as-join on *key* column 'x'
kDT["a", on="x"]                            # same, being explicit using 'on='

# 放到一起
kDT[!"a", sum(v), by=.EACHI]                # get sum(v) for each i != "a"

# 设置多列键
setkey(kDT,x,y)                             # 2-column key
setkeyv(kDT,c("x","y"))                     # 同上

# 基于多键的取子集
kDT["a"]                                    # 匹配第一个键
kDT["a", on="x"]                            # on= 可选但推荐
kDT[.("a")]                                 # 同上
kDT[list("a")]                              # 同上
kDT[.("a", 3)]                              # 匹配两列
kDT[.("a", 3:6)]                            # join 4行
kDT[.("a", 3:6), nomatch=0]                 # 移除缺失
kDT[.("a", 3:6), roll=TRUE]                 # locf rolling join
kDT[.("a", 3:6), roll=Inf]                  # 同上
kDT[.("a", 3:6), roll=-Inf]                 # nocb rolling join
kDT[!.("a")]                                # not join
kDT[!"a"]                                   # 同上

# 特殊符号,参见 ?"special-symbols"
DT[.N]                                      # 最后一行
DT[, .N]                                    # 行数
DT[, .N, by=x]                              # 每一组的行数
DT[, .SD, .SDcols=x:y]                      # 选择'x'和 'y'列
DT[, .SD[1]]                                # 第一行,等价于DT[1,]
DT[, .SD[1], by=x]                          # 每组的第一行
DT[, c(.N, lapply(.SD, sum)), by=x]         # 分组计算行数和总和
DT[, .I[1], by=x]                           # 每一组第1行的行序号
DT[, grp := .GRP, by=x]                     # 添加一个分组计数行
X[, DT[.BY, y, on="x"], by=x]               # 组内join,使用更少内存

# 按引用添加、更新和删除,参见?assign
print(DT[, z:=42L])                         # 按引用添加新的一行
print(DT[, z:=NULL])                        # 按引用移除行
print(DT["a", v:=42L, on="x"])              # 对子集重新赋值
print(DT["b", v2:=84L, on="x"])             # 对子集创建新列

DT[, m:=mean(v), by=x][]                    # 按引用按组添加行
                                            # []是print的快捷方式
                                            
# 高级操作
DT[, sum(v), by=.(y%%2)]                    # 在by内使用表达式
DT[, sum(v), by=.(bool = y%%2)]             # 对表达式附加名字
DT[, .SD[2], by=x]                          # 得到每组的第二行
DT[, tail(.SD,2), by=x]                     # 得到每组的最后两行
DT[, lapply(.SD, sum), by=x]                # 得到每组所有列的和
DT[, .SD[which.min(v)], by=x]               # 嵌套操作

DT[, list(MySum=sum(v),
          MyMin=min(v),
          MyMax=max(v)),
    by=.(x, y%%2)]                          # 按2个表达式进行分组by

DT[, .(a = .(a), b = .(b)), by=x]           # list columns
DT[, .(seq = min(a):max(b)), by=x]          # j is not limited to just aggregations
DT[, sum(v), by=x][V1<20]                   # 组合查询
DT[, sum(v), by=x][order(-V1)]              # 对结果排序
DT[, c(.N, lapply(.SD,sum)), by=x]          # group size and sums by group
DT[, {tmp <- mean(y);                       # anonymous lambda in 'j'; j any valid
      .(a = a-tmp, b = b-tmp)               #   expression where every element
      }, by=x]                              #   becomes a column in result

pdf("new.pdf")
DT[, plot(a,b), by=x]                       # can also plot in 'j'
dev.off()

# get max(y) and min of a set of columns for each consecutive run of 'v'
DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b]

[图片上传失败...(image-410df6-1546509919296)] [图片上传失败...(image-f2159a-1546509919296)]

[图片上传失败...(image-497e99-1546509919296)]

[图片上传失败...(image-bd897c-1546509919296)]

Other features include :

  • fast and friendly delimited file reader: ?fread. It accepts system commands directly (such as grep and gunzip), has other convenience features for small data and is now parallelized on CRAN May 2018 and presented earlier here.
  • fast and parallelized file writer: ?fwrite announced here and on CRAN in Nov 2016.
  • parallelized row subsets - See this benchmark for timings
  • fast aggregation of large data; e.g. 100GB in RAM (see benchmarks on up to two billion rows)
  • fast add/update/delete columns by reference by group using no copies at all
  • fast ordered joins; e.g. rolling forwards, backwards, nearest and limited staleness
  • fast overlapping range joins; similar to findOverlaps function from IRanges/GenomicRanges Bioconductor packages, but not limited to genomic (integer) intervals.
  • fast non-equi (or conditional) joins, i.e., joins using operators >, >=, <, <= as well, available from v1.9.8+
  • a fast primary ordered index; e.g. setkey(DT,col1,col2)
  • automatic secondary indexing; e.g. DT[col==val,] and DT[col %in% vals,]
  • fast and memory efficient combined join and group by; by=.EACHI
  • fast reshape2 methods (dcast and melt) without needing reshape2 and its dependency chain installed or loaded
  • group summary results may be many rows (e.g. first and last row by group) and each cell value may itself be a vector/object/function (e.g. unique ids by group as a list column of varying length vectors - this is pretty printed with commas)
  • special symbols built-in for convenience and raw speed by avoiding the overhead of function calls: .N, .SD, .I, .GRP and .BY
  • any R function from any R package can be used in queries not just the subset of functions made available by a database backend
  • has no dependencies at all other than base R itself, for simpler production/maintenance
  • the R dependency is as old as possible for as long as possible and we test against that version; e.g., v1.9.8 released on 25-Nov-2016 bumped the dependency up from 4.5 year old R 2.14.0 to 3 year old R 3.0.0.

相关文章

网友评论

    本文标题:【r<-包】data.table wiki

    本文链接:https://www.haomeiwen.com/subject/wsjqrqtx.html