准备工作:安装R包载入数据
rm(list = ls())
if(!require(stringr))install.packages('stringr')
library(stringr)
x <- "The birch canoe slid on the smooth planks."
1.检测字符串长度
length(x)
str_length(x)
str_length(" ")
最后一行代码说明空格也占一个字符
2.字符串拆分与组合
str_split(x," ")
class(str_split(x," "))
可以看出拆分后,向量变成了列表,可以通过列表取子集的方式来重新提取向量。
x2 = str_split(x," ")[[1]]
class(x2)
x2
用下列代码拆分后生成的是矩阵
str_split(x," ",simplify = T)
class(str_split(x," ",simplify = T))
下面我们把拆分的字符合并起来
x2
str_c(x2,collapse = " ")
str_c(x2,1234,sep = "+")
3.提取字符串的一部分
x
str_sub(x,5,9)
很明显空格占一个字符。
4.大小写转换
#全部转换成大写
str_to_upper(x2)
#全部转换成小写
str_to_lower(x2)
#全部首字母大写
str_to_title(x2)
5.字符串排序
x2
str_sort(x2)
按26英文字母顺序排序
6.字符检测
str_detect(x2,"h")
str_starts(x2,"T")
str_ends(x2,"e")
与sum和mean连用,可以统计匹配的个数和比例
str_detect(x2,"h")
sum(str_detect(x2,"h"))
mean(str_detect(x2,"h"))
mean(str_detect(x2,"h"))得出的结果为什么是0.5,看下图,先把str_detect(x2,"h")得出的逻辑型向量转换成数值型向量,TURE:1,FALSE:0,其中1占4个,总数为8,4/8=0.5,故TURE占50%,x2向量中含h占总数的50%。
7.提取匹配到的字符串
x2
#方法一
str_subset(x2,"h")
#方法二
x2[str_detect(x2,"h")]
8.字符计数
x
str_count(x," ")
统计x中的空格数,有7个空格
x2
str_count(x2,"o")
x2向量中,每个元素中o的个数
str_count(x)
length(x)
x
str_count(x2)
length(x2)
x2
9.字符串替换
x2
str_replace(x2,"o","A")
str_replace_all(x2,"o","A")
------------------------------------------小练习----------------------------------------
#Bioinformatics is a new subject of genetic data collection,analysis and dissemination to the research community.
#1.将上面这句话作为一个长字符串,赋值给tmp
tmp = "Bioinformatics is a new subject of genetic data collection,analysis and dissemination to the research community."
#2.拆分为一个由单词组成的向量,赋值给tmp2(注意标点符号)
library(stringr)
tmp2 = tmp %>%
str_replace(","," ") %>%
str_remove("[.]") %>%
str_split(" ")
tmp2 = tmp2[[1]]
参考资料:生信技能树-小洁老师
网友评论