前缀树

作者: 尽情的嘲笑我吧 | 来源:发表于2018-12-15 17:28 被阅读0次

    最近看代码,发现了一个敏感词检测是用前缀树写的,看起来速度蛮快,毕竟是拿空间换时间,LOG倍速。但是缺点也很明显,待检测文本需要与敏感词词库中的值完全匹配。所以对于简短的词法比较合适。

    原理:

    1. 每一个节点可以有多个子节点
    2. 节点“存储”字符, 节点与节点之间的连线自动形成单词。 如a节点与d节点,之间的连线就是单词 ad
    3. 节点可能是叶子节点,此时也是一个单词的“终点”,否则是其他拥有相同前缀的节点的“过客”, wordcount要加一。
    4. 删除一个单词,则对应节点上的“过客”都要减一,直至减至叶子节点。
    # coding: utf8
    MAX_TREE_WIDTH = 26
    INIT_CHAR = 'a'
    forbiddenwords = """
    fuck
    fucker
    damn
    silly
    """
    class TrieNode(object):
        def __init__(self):
            self.nodes = [None] * MAX_TREE_WIDTH
            self.wordcount = 0
            self.isend = 0
    
    class TrieTree(object):
        def __init__(self):
            self.root = TrieNode()
    
        def add(self, word):
            word = word.lower()
            curnode = self.root
            for char in word:
                index = ord(char) - ord(INIT_CHAR)
                if curnode.nodes[index] is None:
                    curnode.nodes[index] = TrieNode()
                curnode = curnode.nodes[index]
                curnode.wordcount += 1
            curnode.isend = 1
        
        def search(self, word):
            word = word.lower()
            curnode = self.root
            for char in word:
                index = ord(char) - ord(INIT_CHAR)
                if curnode.nodes[index] is None:
                    return -1
                curnode = curnode.nodes[index]
            return curnode.wordcount
        
        def countstartswith(self, prefix):
            curnode = self.root
            for char in prefix:
                index = ord(char) - ord(INIT_CHAR)
                if curnode.nodes[index] is None:
                    return -1
                curnode = curnode.nodes[index]
            return curnode.wordcount
        
        def delete(self, word):
            if self.countstartswith(word) > 0:
                curnode = self.root
                for char in word:
                    index = ord(char) - ord(INIT_CHAR)
                    curnode.nodes[index].wordcount -= 1
                    if curnode.nodes[index].wordcount == 0:
                        curnode.nodes[index] = None
                        return
                    curnode = curnode.nodes[index]
                curnode.isend = 0
    
    if __name__ == "__main__":
        print("hello trie tree.")
        tree = TrieTree()
        tree.add("h")
        tree.add("He")
        tree.add("hEl")
        tree.add("helL")
        tree.add("hello")
         print(tree.search('he'))
         print(tree.countstartswith("h"))
        print(tree.countstartswith("hel"))
        tree.delete("hel")
        print(tree.countstartswith("hel"))
        print(tree.countstartswith("guo"))
         words = [item for item in forbiddenwords.strip("").split("\n") if item != '']
         for word in words:
             print("adding word: {}".format(word))
             tree.add(word)
         # test sequence
         teststr = "you a silly mother fucker"
         tests = teststr.split(" ")
         for test in tests:
             print("{} --> {}".format(test, tree.search(test)))
    

    运行结果:

    hello trie tree.
    4
    5
    3
    2
    -1
    adding word: fuck
    adding word: fucker
    adding word: damn
    adding word: silly
    you --> -1
    a --> -1
    silly --> 1
    mother --> -1
    fucker --> 1
    

    相较于之前 基于朴素贝叶斯分类算法 的golang 与PHP结合的例子,效率,准确性,适用性都很逊。这也跟场景息息相关,选对了场景,前缀树会给出一个令人满意的答复的。

    相关文章

      网友评论

          本文标题:前缀树

          本文链接:https://www.haomeiwen.com/subject/favahqtx.html