数据处理-2(基于R语言)

作者: 北欧森林 | 来源:发表于2021-02-16 23:03 被阅读0次

数据处理-2(基于R语言)
R语言第二章数据处理④数据框排序和重命名
R语言第二章数据处理⑤数据框列的转化和计算
R语言第二章数据处理⑦dplyr包（2）列处理
R语言第二章数据处理⑥dplyr包（1）列选取
R语言第二章数据处理⑨缺失值判断和填充
R语言第二章数据处理⑧数据采样和离散化
R语言第二章数据处理(9)数据合并
dplyr包笔记
R语言第二章数据处理③删除重复数据

查找不相同的项

library(dplyr)

a <- anti_join(x, y, by = "ID")  # 丢弃x表中与y表中的observation相匹配的所有项

将复制的内容加载入R里

df1 <- read.table("clipboard", header = T, sep = "\t")

查找列名中包含某个字符串的列

test[grep("aa", test$name), ]

注意factor的水平

b <- factor(1:3,levels=1:5);b
## [1] 1 2 3
## Levels: 1 2 3 4 5

改变因子的排列顺序(R中的因子存在着有序和无序两种，默认按照ASCII顺序排序)
对于无序因子：

# 创建一个错误次序的因子 
sizes <- factor(c("small", "large", "large", "small", "medium")) 
sizes 
#> [1] small large large small medium 
#> Levels: large medium small
# 顺序被直接指定
sizes <- factor(sizes, levels = c("small", "medium", "large")) 
sizes 
#> [1] small  large  large  small  medium 
#> Levels: small medium large

对于有序因子：

sizes <- ordered(c("small", "large", "large", "small", "medium")) 
sizes <- ordered(sizes, levels = c("small", "medium", "large")) 
sizes 
#> [1] small large large small medium 
#> Levels: small < medium < large

Bonus:

# 快速逆序排列
sizes <- factor(sizes, levels=rev(levels(sizes)))

source: https://sr-c.github.io/2018/09/16/Changing-the-order-of-levels-of-a-factor/

row.names 和rownames的区别：
There are two functions in the R core library:

row.names: Get and Set Row Names for Data Frames
rownames: Retrieve or set the row names of a matrix-like object.

If you don't want to bother distinguishing the two functions, then it would be logical to just use the generic version row.names() all the time, since it always dispatches the appropriate method. For example, if x is a matrix, then row.names(x) just passes cleanly through to rownames(x) because there is no more specific method for that class of object.

更改列名

library(tidyverse)
plyr::rename(d, c("old2"="two", "old3"="three"))

#Note: plyr中的rename和dplyr中的rename用法是不同的.
## plyr::rename
rename(data, c(old=new))

## dplyr::rename
rename(data, new = old)

#method2
library(reshape) # 加载所需的包
dat <- rename(dat,c(国家 = "country")) 
head(dat)   

#method3: 你想把列名变成x1,x2,...x10
cnames=paste("x",1:10,sep="")
colnames(dat)=cnames

替换数据集里的某些数值

library(stringr)

str_replace_all(a$AFP, c("?1250"), c("1250")) #被替换对象是第二个参数

# 以下二者相同，pattern是被替换对象
gsub("?800", 800, a$AFP)
gsub(pattern = "?800", replacement = "800", a$AFP)

去除高度线性相关变量

datTrain1 = datTrain[,-c(1,6)]
descrCor = cor(datTrain1)
descrCor

highlyCorDescr = findCorrelation(descrCor, cutoff = .75, names = F, verbose = T)
filteredTrain = datTrain1[,-highlyCorDescr]

对测试集标准化：

library(caret)
preProcValues = preProcess(datTrain, method = c("center", "scale"))
trainTransformed = predict(preProcValues, datTrain)
testTransformed = predict(preProcValues,datTest)

删除近零方差变量

nzv = nearZeroVar(datTrain)
nzv

做lasso回归时，对于x和y的数据类型要求

class(x)
class(y)
# [1] "data.frame"
# [1] "numeric"

x <- as.matrix(x)
y <- as.numeric(unlist(y))
class(x)
class(y)
# [1] "matrix" "array" 
# [1] "numeric"

网友评论

本文标题：数据处理-2(基于R语言)

本文链接：https://www.haomeiwen.com/subject/vwhlaktx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

数据处理-2(基于R语言)

相关文章

数据处理-2(基于R语言)

R语言第二章数据处理④数据框排序和重命名

R语言第二章数据处理⑤数据框列的转化和计算

R语言第二章数据处理⑦dplyr包（2）列处理

R语言第二章数据处理⑥dplyr包（1）列选取

R语言第二章数据处理⑨缺失值判断和填充

R语言第二章数据处理⑧数据采样和离散化

R语言第二章数据处理(9)数据合并

dplyr包笔记

R语言第二章数据处理③删除重复数据

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读