美文网首页
R语言计算字符串相似度

R语言计算字符串相似度

作者: 可能性之兽 | 来源:发表于2022-11-14 08:57 被阅读0次

这些都是基于各种距离计算的,要实现其实不是很复杂,很久以前我就写过如何计算距离的简书,但是有包为何要自己造轮子呢?

计算字符串相似度可以使用utils包中的adist函数,或者MKmisc包中的stringdist函数,RecordLinkage包中也有如jarowinkler之类的距离函数

这里主要介绍stringdist包中的stringdist函数和stringdistmatrix函数。

stringdist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, nthread = getOption("sd_num_thread"))

stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, useNames = c("none", "strings", "names"), ncores = 1, cluster = NULL, nthread = getOption("sd_num_thread"))

# Simple example using optimal string alignment
stringdist("ca","abc")

# computing a 'dist' object
d <- stringdistmatrix(c('foo','bar','boo','baz'))
# try plot(hclust(d))

# The following gives a matrix
stringdistmatrix(c("foo","bar","boo"),c("baz","buz"))

# An example using Damerau-Levenshtein distance (multiple editing of substrings allowed)
stringdist("ca","abc",method="dl")

# string distance matching is case sensitive:
stringdist("ABC","abc")

# so you may want to normalize a bit:
stringdist(tolower("ABC"),"abc")

# stringdist recycles the shortest argument:
stringdist(c('a','b','c'),c('a','c'))

# stringdistmatrix gives the distance matrix (by default for optimal string alignment):
stringdist(c('a','b','c'),c('a','c'))

# different edit operations may be weighted; e.g. weighted substitution:
stringdist('ab','ba',weight=c(1,1,1,0.5))

# Non-unit weights for insertion and deletion makes the distance metric asymetric
stringdist('ca','abc')
stringdist('abc','ca')
stringdist('ca','abc',weight=c(0.5,1,1,1))
stringdist('abc','ca',weight=c(0.5,1,1,1))

# Hamming distance is undefined for 
# strings of unequal lengths so stringdist returns Inf
stringdist("ab","abc",method="h")
# For strings of eqal length it counts the number of unequal characters as they occur
# in the strings from beginning to end
stringdist("hello","HeLl0",method="h")

# The lcs (longest common substring) distance returns the number of 
# characters that are not part of the lcs.
#
# Here, the lcs is either 'a' or 'b' and one character cannot be paired:
stringdist('ab','ba',method="lcs")
# Here the lcs is 'surey' and 'v', 'g' and one 'r' of 'surgery' are not paired
stringdist('survey','surgery',method="lcs")


# q-grams are based on the difference between occurrences of q consecutive characters
# in string a and string b.
# Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0:
stringdist('abc','cba',method='qgram',q=1)

# since the first string consists of 'ab','bc' and the second 
# of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common):
stringdist('abc','cba',method='qgram',q=2)

# Wikipedia has the following example of the Jaro-distance. 
stringdist('MARTHA','MATHRA',method='jw')
# Note that stringdist gives a  _distance_ where wikipedia gives the corresponding 
# _similarity measure_. To get the wikipedia result:
1 - stringdist('MARTHA','MATHRA',method='jw')

# The corresponding Jaro-Winkler distance can be computed by setting p=0.1
stringdist('MARTHA','MATHRA',method='jw',p=0.1)
# or, as a similarity measure
1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)

# This gives distance 1 since Euler and Gauss translate to different soundex codes.
stringdist('Euler','Gauss',method='soundex')
# Euler and Ellery translate to the same code and have distance 0
stringdist('Euler','Ellery',method='soundex')




markvanderloo/stringdist: String distance functions for R (github.com)

相关文章

  • R语言计算字符串相似度

    这些都是基于各种距离计算的,要实现其实不是很复杂,很久以前我就写过如何计算距离的简书,但是有包为何要自己造轮子呢?...

  • 文本相似度计算与展示

    文本相似度计算方法归类 基于字符串。该方法从字符串匹配度出发,以字符串共现和重复程序为相似度的衡量标准。如编辑距离...

  • R语言--并行计算包(parallel、foreach)

    R语言是单核计算语言,在数据建模或计算过程中,常常出现相同或相似任务的重复计算,一般操作是for循环处理或采用ap...

  • 计算句子相似度

    计算句子相似度,①常用方法有基于语义和词序相似度计算方法,②基于关系向量模型基于语义和词序的句子相似度计算方法简介...

  • NLP详解

    (一)余弦相似度、向量空间模型 1、相似度 • 相似度度量:计算个体间相似程度• 相似度值越小,距离越大,相似度值...

  • 图像相似度计算

    利用直方图特征计算图像之间的相似度,得到相关矩阵

  • 文章相似度计算

    算法思路 首先看个简单的例子: 句子A: 我喜欢看电视,不喜欢看电影句子B: 我不喜欢看电影,也不喜欢看电视 基本...

  • SNN相似度计算

    共享最近邻相似度SNN原理 如果向个点都与一些相同的点相似,则即使直接的相似度度量不能指出,我们也认为他们相似。 ...

  • 相似度计算统计

    在计算跨语种文本相似度的过程,部分统计数据展示在这里。 耗时25分钟,处理中文130句,英文190句。 计算...

  • 句子相似度计算

    思路一:先求句向量,然后求余弦相似度 1.求得两个句子的句向量 生成文本词频向量用词频来代替,句子,当然这样做忽略...

网友评论

      本文标题:R语言计算字符串相似度

      本文链接:https://www.haomeiwen.com/subject/kobkxdtx.html