美文网首页
《Learning R》笔记 Chapter 13 Clean

《Learning R》笔记 Chapter 13 Clean

作者: 天火燎原天 | 来源:发表于2018-02-25 16:33 被阅读0次

    数据清洗是数据分析中最为繁杂头疼的部分。

    字符串清洗

    R自带函数

    grep,grepl和regexpr是R自带的三个字符串匹配函数。

    grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
         fixed = FALSE, useBytes = FALSE, invert = FALSE)
    grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
          fixed = FALSE, useBytes = FALSE)
    sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
        fixed = FALSE, useBytes = FALSE)
    regexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
            fixed = FALSE, useBytes = FALSE)
    

    grep返回符合pattern匹配的元素的下标,默认是integer;
    grepl返回符合pattern匹配的逻辑值,class为logical;
    sub返回和输入长度一致的string,但将符合匹配的pattern替换为replacement。
    regexpr返回和输入长度一致的integer vector,指出每个元素中匹配pattern字符的起始位置,如无匹配则返回-1

    stringr包

    stringr提供了一系列的wrapper,能够更好地操作字符串。

    modifier functions

    需要指出的是,stringr中的pattern默认是正则表达式(即regex)。如果要进行修改的话,stringr给出了4种modifier functions。ignore_case为是否忽略大小写的开关。

    fixed:Compare literal bytes in the string. This is very fast, but not usually what you want for non-ASCII character sets.

    fixed(pattern, ignore_case = FALSE)
    

    coll:Compare strings respecting standard collation rules.

    coll(pattern, ignore_case = FALSE, locale = "en", ...)
    

    regex:The default. Uses ICU regular expressions.

    regex(pattern, ignore_case = FALSE, multiline = FALSE, comments = FALSE,
      dotall = FALSE, ...)
    

    boundary:Match boundaries between things.

    boundary(type = c("character", "line_break", "sentence", "word"),
      skip_word_none = NA, ...)
    

    str_detect (grepl)

    str_detect()相当于grepl,返回逻辑vector。pattern可以是一个vector

    str_detect(string, pattern)
    > fruit <- c("apple", "banana", "pear", "pinapple")
    > str_detect(fruit, "^a")
    [1]  TRUE FALSE FALSE FALSE
    > str_detect("aecfg", letters[1:6])
    [1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE
    

    str_split(strsplit)

    str_split相当于R自带的strsplit。接受string输入,返回分拆后的list。如果确认返回后长度一致,可以改为str_split_fixed,这样会返回一个matrix。

    str_split(string, pattern, n = Inf, simplify = FALSE)
    str_split_fixed(string, pattern, n) #n为返回结果的长度
    

    str_count

    str_count输出pattern的计数,也就是一个interger vector。pattern默认为空字符串。

    str_count(string, pattern = "")##
    > str_count(fruit)
    [1] 5 6 4 8
    > str_count(fruit, c("a", "b", "p", "p"))
    [1] 1 1 1 3 #注意vector运算的法则!
    

    str_replace(sub)

    str_replace相当于R自带的sub,它只替换每个string中每个元素内部第一个匹配。而str_replace_all则替换全部匹配。

    str_replace(string, pattern, replacement)
    str_replace_all(string, pattern, replacement)
    > str_replace(fruit, "[aeiou]", "-")
    [1] "-pple"    "b-nana"   "p-ar"     "p-napple"
    > str_replace_all(fruit, "[aeiou]", "-")
    [1] "-ppl-"    "b-n-n-"   "p--r"     "p-n-ppl-"
    

    str_replace_na函数是一个特殊的wrapper,能将NA转换为字符串‘NA’

    str_replace_na(string, replacement = "NA")
    > str_replace_na(c(NA, "abc", "def"))
    [1] "NA"  "abc" "def"

    相关文章

      网友评论

          本文标题:《Learning R》笔记 Chapter 13 Clean

          本文链接:https://www.haomeiwen.com/subject/sqjrxftx.html