❀ 管道函数:%>%
将左边的对象作为第一个参数传递到右边的函数中
x %>% f(y) 等于 f(x,y)
y %>% f(x,z) 等于f(x,y,z)
iris %>% head(5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
❀ 连接、合并、过滤:left_join/right_join/inner_join/full_join/semi_join/anti_join
data1 <- data.frame(ID = 1:2, # Create first example data frame
X1 = c("a1", "a2"),
stringsAsFactors = FALSE)
data2 <- data.frame(ID = 2:3, # Create second example data frame
X2 = c("b1", "b2"),
stringsAsFactors = FALSE)
1. : 左连接,按照左表的关键字匹配,遇到右表没有对应的行时,自动填充NA,即向表a加入匹配的数据集b的记录
left_join(data1, data2, by = "ID") # Apply left_join dplyr function
## ID X1 X2
## 1 1 a1 <NA>
## 2 2 a2 b1
2. : 右连接,按照右表的关键字匹配,向表b加入匹配的数据集a的记录
right_join(data1, data2, by = "ID") # Apply right_join dplyr function
## ID X1 X2
## 1 2 a2 b1
## 2 3 <NA> b2
3. : 内连接,两个表都匹配的记录
inner_join(data1, data2, by = "ID") # Apply inner_join dplyr function
## ID X1 X2
## 1 2 a2 b1
4. : 全连接,合并数据,保留所有记录
full_join(data1, data2, by = "ID") # Apply full_join dplyr function
## ID X1 X2
## 1 1 a1 <NA>
## 2 2 a2 b1
## 3 3 <NA> b2
5. : 右过滤,使用右表作为过滤器,保留左表中的数据
semi_join ( data1, data2, by = "ID" ) #Apply semi_join dplyr function
## ID X1
## 1 2 a2
6. : 左过滤,使用左表作为过滤器,保留右表中的数据
anti_join ( data1, data2, by = "ID" ) # Apply anti_join dplyr function
## ID X1
## 1 1 a1

❀ 切片:slice()
# Create DataFrame
df <- data.frame(
id = c(10,11,12,13,14,15,16,17),
name = c('sai','ram','deepika','sahithi','kumar','scott','Don','Lin'),
gender = c('M','M','F','F','M','M','M','F'),
dob = as.Date(c('1990-10-02','1981-3-24','1987-6-14','1985-8-16',
'1995-03-02','1991-6-21','1986-3-24','1990-8-26')),
state = c('CA','NY',NA,NA,'DC','DW','AZ','PH'),
row.names=c('r1','r2','r3','r4','r5','r6','r7','r8')
)
1. 按切片行
df2 <- df %>% slice(2,3)
df2
## id name gender dob state
## r2 11 ram M 1981-03-24 NY
## r3 12 deepika F 1987-06-14 <NA>
2. 按切片行
df2 <- df %>% slice(c(2,3,5,6))
df2
## id name gender dob state
## r2 11 ram M 1981-03-24 NY
## r3 12 deepika F 1987-06-14 <NA>
## r5 14 kumar M 1995-03-02 DC
## r6 15 scott M 1991-06-21 DW
3. 按切片行
df2 <- df %>% slice(2:6)
df2
## id name gender dob state
## r2 11 ram M 1981-03-24 NY
## r3 12 deepika F 1987-06-14 <NA>
## r4 13 sahithi F 1985-08-16 <NA>
## r5 14 kumar M 1995-03-02 DC
## r6 15 scott M 1991-06-21 DW
4. 行
df2 <- df %>% slice(-2:-6)
df2
## id name gender dob state
## r1 10 sai M 1990-10-02 CA
## r7 16 Don M 1986-03-24 AZ
## r8 17 Lin F 1990-08-26 PH
❀ 去重:distinct
# Create DataFrame
df <- data.frame(
id = c(11,33,44,44),
pages = c(32,33,22,22),
name = c('spark','R','java','jsp'),
chapters = c(76,11,15,15),
price = c(144,321,567,567)
)
1.
df2 <- df %>% distinct()
df2
## id pages name chapters price
## 1 11 32 spark 76 144
## 2 33 33 R 11 321
## 3 44 22 java 15 567
## 4 44 22 jsp 15 567
2. 的不同行
df2 <- df %>% distinct(id,pages)
df2
## id pages
## 1 11 32
## 2 33 33
## 3 44 22
3. “.keep_all=TRUE”:的不同行
df2 <- df %>% distinct(id,pages, .keep_all = TRUE)
df2
## id pages name chapters price
## 1 11 32 spark 76 144
## 2 33 33 R 11 321
## 3 44 22 java 15 567
❀ 样本(随机)选取:sample_n/sample_frac/slice_sample
sample_n()和sample_frac()已被slice_sample() 取代
df <- tibble(x = 1:5, w = c(0.1, 0.1, 0.1, 2, 2))
1. _
/
_
: 随机抽取样本(按照个数取样)
sample_n((df, size, replace = FALSE)
参数说明:df数据,size选取的数据行数,replace=true/false是否替换样本(主要参数),weight 抽样权重
sample_n(df, 3) # now: slice_sample(df, n = 3)
## # A tibble: 3 × 2
## x w
## <int> <dbl>
## 1 1 0.1
## 2 2 0.1
## 3 4 2
sample_n(df, 10, replace = TRUE) # now: slice_sample(df, n = 10, replace = TRUE)
## # A tibble: 10 × 2
## x w
## <int> <dbl>
## 1 2 0.1
## 2 1 0.1
## 3 4 2
## 4 4 2
## 5 1 0.1
## 6 3 0.1
## 7 1 0.1
## 8 3 0.1
## 9 5 2
## 10 1 0.1
sample_n(df, 3, weight = w) # now: slice_sample(df, n = 3, weight_by = w)
## # A tibble: 3 × 2
## x w
## <int> <dbl>
## 1 4 2
## 2 5 2
## 3 3 0.1
2. _
/
_
: 随机抽取样本(按照比例取样)
sample_frac(df, 0.25) # now: slice_sample(df, prop = 0.25)
## # A tibble: 1 × 2
## x w
## <int> <dbl>
## 1 5 2
sample_frac(df, 2, replace = TRUE) # now: slice_sample(df, prop = 2, replace = TRUE)
## # A tibble: 10 × 2
## x w
## <int> <dbl>
## 1 1 0.1
## 2 5 2
## 3 2 0.1
## 4 1 0.1
## 5 4 2
## 6 2 0.1
## 7 4 2
## 8 4 2
## 9 2 0.1
## 10 4 2
❀ 排序函数: row_number/min_rank/dense_rank
student_df <- tibble(
name = c('张三', '李四', '王五', '赵六', '孙七', '周八', '吴九'),
score = c(85, 83, 96, 92, 96, 95, 92)
)
1. _
: 相同值不会重复
student_df %>%
mutate(asc_order = row_number(score),
desc_order = row_number(desc(score)))
## # A tibble: 7 × 4
## name score asc_order desc_order
## <chr> <dbl> <int> <int>
## 1 张三 85 2 6
## 2 李四 83 1 7
## 3 王五 96 6 1
## 4 赵六 92 3 4
## 5 孙七 96 7 2
## 6 周八 95 5 3
## 7 吴九 92 4 5
2. _
: 相同值重复且排序不连续
student_df %>%
mutate(asc_order = min_rank(score))
## # A tibble: 7 × 3
## name score asc_order
## <chr> <dbl> <int>
## 1 张三 85 2
## 2 李四 83 1
## 3 王五 96 6
## 4 赵六 92 3
## 5 孙七 96 6
## 6 周八 95 5
## 7 吴九 92 3
3. _
: 相同值重复且排序连续
student_df %>%
mutate(asc_order = dense_rank(score))
## # A tibble: 7 × 3
## name score asc_order
## <chr> <dbl> <int>
## 1 张三 85 2
## 2 李四 83 1
## 3 王五 96 5
## 4 赵六 92 3
## 5 孙七 96 5
## 6 周八 95 4
## 7 吴九 92 3
❀偏移函数: lead/lag
1. : 向前偏移, 默认偏移1位
student_df %>%
mutate(lead_pop = lead(score))
## # A tibble: 7 × 3
## name score lead_pop
## <chr> <dbl> <dbl>
## 1 张三 85 83
## 2 李四 83 96
## 3 王五 96 92
## 4 赵六 92 96
## 5 孙七 96 95
## 6 周八 95 92
## 7 吴九 92 NA
student_df %>%
mutate(lead_pop = lead(score, n = 2)) # 向前偏移2位
## # A tibble: 7 × 3
## name score lead_pop
## <chr> <dbl> <dbl>
## 1 张三 85 96
## 2 李四 83 92
## 3 王五 96 96
## 4 赵六 92 95
## 5 孙七 96 92
## 6 周八 95 NA
## 7 吴九 92 NA
2. : 向后偏移, 默认偏移1位
student_df %>%
mutate(lead_pop = lag(score))
## # A tibble: 7 × 3
## name score lead_pop
## <chr> <dbl> <dbl>
## 1 张三 85 NA
## 2 李四 83 85
## 3 王五 96 83
## 4 赵六 92 96
## 5 孙七 96 92
## 6 周八 95 96
## 7 吴九 92 95
参考:
- https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti
- https://sparkbyexamples.com/r-programming/r-dplyr-slice-function/
- https://sparkbyexamples.com/r-programming/dplyr-distinct-function-usage/
- https://rdrr.io/github/tidyverse/dplyr/man/sample_n.html
- R语言dplyr包超完整版函数指南(https://blog.csdn.net/weixin_49238165/article/details/107362676)
网友评论