美文网首页R语言学习笔记
R语言stringr包处理字符串

R语言stringr包处理字符串

作者: RSP小白之路 | 来源:发表于2021-10-17 21:07 被阅读0次

    stringr包是R数据处理神器Tidyverse包中的工具之一,是处理字符串很好用的工具,结合正则表达式,可以发挥巨大作用。

    字符串长度

    stringr包的操作对象是向量str_length()函数用于确定字符串长度。

    > x <- c("why", "video", "cross", "extra", "deal", "authority")
    > str_length(x)
    #> [1] 3 5 5 5 4 9
    

    但是如果是下面这种写法,便会出现语法错误,因为输入对象是非向量。

    > str_length("why", "video", "cross", "extra", "deal", "authority")
    Error in str_length("why", "video", "cross", "extra", "deal", "authority") : 
      unused arguments ("video", "cross", "extra", "deal", "authority")
    

    字符串拼接

    str_c()函数用于进行字符串的拼接,主要参数有待拼接字符串向量sep=''collapse=NULL

    > x = c('apple', 'banana','peach')
    > y = c('one', 'two', 'three')
    > str_c(x, y)
    # [1] "appleone"   "bananatwo"  "peachthree"
    > str_c(x, y, sep = '_')
    # [1] "apple_one"   "banana_two"  "peach_three"
    > str_c(x, y, collapse = "_")
    # [1] "appleone_bananatwo_peachthree"
    

    注意,上述例子中sepcollapse的作用后的不同,sep作用后还是多个字符串collapse作用后则变为了一个字符串。

    > case1 <- str_c(x, y, sep = '_')
    > str_length(case1)
    # [1]  9 10 11
    > case2 <- str_c(x, y, collapse = "_")
    > str_length(case2)
    # [1] 29
    

    字符串拆分

    str_split()stringr包中进行字符串拆分的函数,根据特定字符或者子集数量进行字符串拆分,选取特定子集。

    # 构建一个由'_'分割的字符串向量
    > x <- c('aajs_123_dkks', 'ahda_236_akdk', 'ahdj_178_ajdj', 'agsh_109_auqyr', 'qwp_2635_qnjx')
    > str_split(x, pattern = '_')
    [[1]]
    [1] "aajs" "123"  "dkks"
    
    [[2]]
    [1] "ahda" "236"  "akdk"
    
    [[3]]
    [1] "ahdj" "178"  "ajdj"
    
    [[4]]
    [1] "agsh"  "109"   "auqyr"
    
    [[5]]
    [1] "qwp"  "2635" "qnjx"
    

    主要参数如下pattern = , n = Inf , simplify = FALSE,默认返回值类型为listsimplify = True则返回值类型为matrix,array

    > class(str_split(x, pattern = '_'))
    [1] "list"
    > str_split(x, pattern = '_', simplify = TRUE)
         [,1]   [,2]   [,3]   
    [1,] "aajs" "123"  "dkks" 
    [2,] "ahda" "236"  "akdk" 
    [3,] "ahdj" "178"  "ajdj" 
    [4,] "agsh" "109"  "auqyr"
    [5,] "qwp"  "2635" "qnjx" 
    > class(str_split(x, pattern = '_', simplify = TRUE))
    # [1] "matrix" "array" 
    

    字符串向量拆分后选择特定的列,用于后续操作,比如本例中拆分后选取数字列,则可以使用矩阵和数组选取子集的操作。

    > str_split(x, pattern = '_', simplify = TRUE)[,2]
    [1] "123"  "236"  "178"  "109"  "2635"
    

    字符串子集

    可以使用str_subset()根据某一特征选取向量中的特定的字符串,也可以结合正则表达式进行选择。参数包括patternnegatenegate默认是FALSE,如果是TRUE,作用是反选

    > x <- c("why", "video", "cross", "extra", "deal", "authority")
    > str_subset(x, pattern = 'o')
    [1] "video"     "cross"     "authority"
    > str_subset(x, pattern = 'o', negate = T)
    [1] "why"   "extra" "deal" 
    

    如果使用正则表达式,则和不使用存在不同,如下举例。

    > x <- c("why", "video", "cross", "extra", "deal", "authority")
    > str_subset(x, pattern = 'oi')
    # character(0)
    > str_subset(x, pattern = '[oi]')
    [1] "video"     "cross"     "authority"
    

    字符串替换

    使用str_replace()进行特定字符的替换,参数包括要替换的模式pattern和替换成的模式replacement

    > x <- c("why", "video", "cross", "extra", "deal", "authority")
    > str_replace(x, 'i', '@')
    [1] "why"       "v@deo"     "cross"     "extra"     "deal"      "author@ty"
    

    使用正则表达式后则有所不同:

    > str_replace(x, 'ie', '@@')
    [1] "why"       "video"     "cross"     "extra"     "deal"      "authority"
    > str_replace(x, '[ie]', '@@')
    [1] "why"        "v@@deo"     "cross"      "@@xtra"     "d@@al"      "author@@ty"
    

    注意,str_replace只替换匹配到的第一个,使用str_replace_all()进行全部替换:

    > x <- c('apple', 'happy')
    > x
    [1] "apple" "happy"
    > str_replace(string = x, pattern = 'p', replacement = '%')
    [1] "a%ple" "ha%py"
    > str_replace_all(string = x, pattern = 'p', replacement = '%')
    [1] "a%%le" "ha%%y"
    

    另外,使用str_replace_na()缺失值的替换,

    > x <- c('one',  NA,'ten', NA, 'eleven',NA)
    > x
    [1] "one"    NA       "ten"    NA       "eleven" NA   
    > str_replace_na(string = x, replacement = '%')
    [1] "one"    "%"      "ten"    "%"      "eleven" "%" 
    

    字符串填补

    使用str_pad()函数进行字符串的填补,参数包括string, width, side = c("left", "right", "both"), pad = " "),举例如下:

    > str_pad(string = letters[1:7], width = 5, side = 'left', pad = '#')
    [1] "####a" "####b" "####c" "####d" "####e" "####f" "####g"
    
    > str_pad(string = letters[1:7], width = 5, side = 'both', pad = '#')
    [1] "##a##" "##b##" "##c##" "##d##" "##e##" "##f##" "##g##"
    
    > str_pad(string = letters[1:7], width = 5, side = 'right', pad = '#')
    [1] "a####" "b####" "c####" "d####" "e####" "f####" "g####"
    

    相关文章

      网友评论

        本文标题:R语言stringr包处理字符串

        本文链接:https://www.haomeiwen.com/subject/ewjwoltx.html