《Learning R》笔记 Chapter 13 Clean

作者: 天火燎原天 | 来源:发表于2018-02-25 16:33 被阅读0次

《Learning R》笔记 Chapter 13 Clean
《Learning R》笔记 Chapter 13 Clean
《Learning R》笔记 Chapter 10 R包
《Learning R》笔记 Chapter 12 Retri
《Learning R》笔记 Chapter 15 Modeli
《Learning R》笔记 Chapter 14 Explor
《Learning R》笔记 Chapter 6 上环境
《Learning R》笔记 Chapter 6 下函数
《Learning R》笔记 Chapter 4 上 Vecto
《Learning R》笔记 Chapter 4 下 matri

数据清洗是数据分析中最为繁杂头疼的部分。

字符串清洗

R自带函数

grep，grepl和regexpr是R自带的三个字符串匹配函数。

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
     fixed = FALSE, useBytes = FALSE, invert = FALSE)
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
      fixed = FALSE, useBytes = FALSE)
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
        fixed = FALSE, useBytes = FALSE)

grep返回符合pattern匹配的元素的下标，默认是integer；
grepl返回符合pattern匹配的逻辑值，class为logical；
sub返回和输入长度一致的string，但将符合匹配的pattern替换为replacement。
regexpr返回和输入长度一致的integer vector，指出每个元素中匹配pattern字符的起始位置，如无匹配则返回-1

stringr包

stringr提供了一系列的wrapper，能够更好地操作字符串。

modifier functions

需要指出的是，stringr中的pattern默认是正则表达式(即regex)。如果要进行修改的话，stringr给出了4种modifier functions。ignore_case为是否忽略大小写的开关。

fixed：Compare literal bytes in the string. This is very fast, but not usually what you want for non-ASCII character sets.

fixed(pattern, ignore_case = FALSE)

coll：Compare strings respecting standard collation rules.

coll(pattern, ignore_case = FALSE, locale = "en", ...)

regex：The default. Uses ICU regular expressions.

regex(pattern, ignore_case = FALSE, multiline = FALSE, comments = FALSE,
  dotall = FALSE, ...)

boundary：Match boundaries between things.

boundary(type = c("character", "line_break", "sentence", "word"),
  skip_word_none = NA, ...)

str_detect （grepl）

str_detect()相当于grepl，返回逻辑vector。pattern可以是一个vector

str_detect(string, pattern)
> fruit <- c("apple", "banana", "pear", "pinapple")
> str_detect(fruit, "^a")
[1]  TRUE FALSE FALSE FALSE
> str_detect("aecfg", letters[1:6])
[1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE

str_split（strsplit）

str_split相当于R自带的strsplit。接受string输入，返回分拆后的list。如果确认返回后长度一致，可以改为str_split_fixed，这样会返回一个matrix。

str_split(string, pattern, n = Inf, simplify = FALSE)
str_split_fixed(string, pattern, n) #n为返回结果的长度

str_count

str_count输出pattern的计数，也就是一个interger vector。pattern默认为空字符串。

str_count(string, pattern = "")##
> str_count(fruit)
[1] 5 6 4 8
> str_count(fruit, c("a", "b", "p", "p"))
[1] 1 1 1 3 #注意vector运算的法则！

str_replace(sub)

str_replace相当于R自带的sub，它只替换每个string中每个元素内部第一个匹配。而str_replace_all则替换全部匹配。

str_replace(string, pattern, replacement)
str_replace_all(string, pattern, replacement)
> str_replace(fruit, "[aeiou]", "-")
[1] "-pple"    "b-nana"   "p-ar"     "p-napple"
> str_replace_all(fruit, "[aeiou]", "-")
[1] "-ppl-"    "b-n-n-"   "p--r"     "p-n-ppl-"

str_replace_na函数是一个特殊的wrapper，能将NA转换为字符串‘NA’

str_replace_na(string, replacement = "NA")
> str_replace_na(c(NA, "abc", "def"))
[1] "NA"  "abc" "def"

网友评论

本文标题：《Learning R》笔记 Chapter 13 Clean

本文链接：https://www.haomeiwen.com/subject/sqjrxftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！