美文网首页
英文特定领域词组切分-借鉴中文的前向,后向,双向最大匹配

英文特定领域词组切分-借鉴中文的前向,后向,双向最大匹配

作者: IntoTheVoid | 来源:发表于2020-01-18 15:37 被阅读0次

    由于在不同的领域中,存在一些专有词汇,例如化学领域中,IUPAC的化学式命名,电商领域的商品词,这些词如果按照常规的英文空格切分,往往会给后续的任务带来困扰,因为如果词组被错误的切分会导致其所蕴含的意思发生变化,因此,往往需要借助行业词典,给非结构化文本中的词进行词组切分。

    考虑到中文中为了分词的目的,往往会使用最大匹配法进行短语的切分。值的借鉴的是,可以将英文按照空格切分,把每个词当成中文中的一个字,借助相关词典,就可以实现最大匹配分词, 一下是代码的具体实施:

    • 词典格式
    machine learning
    is
    the
    science 
    of
    getting
    computers
    to
    act
    without
    being
    explicitly programmed
    a
    method 
    data analysis
    that
    automates analytical model building
    an application
    the ability
    
    
    • 前向最大匹配, 需要将./words.dic替换成你自己的词典,词典格式是一行一个词的txt
    import re
    class ForwardMaxMatch:
        def __init__(self, dic='./words.dic', max_len=5):
            self.max_len = max_len
            self.dic_file = dic
            self.dic = self.load_dic()
            self.tokens = []
            
        def load_dic(self):
            with open(self.dic_file, 'r', encoding='utf-8') as f:
                data = f.readlines()
                dic = [i.strip() for i in data]
                return dic   
        
        def segment(self, word_obj):
            if isinstance(word_obj, str):
                character_map = {".": " . ",
                                 ",": ' , '}
                for origin, new in character_map.items():
                    word_obj = word_obj.replace(origin, new)
                self.words_list =word_obj.lower().split()
            elif isinstance(word_obj, list):
                self.words_list = word_obj
            else:
                print("Not support object for segmentation!")
            i = 0
            tokens = []
            while i < len(self.words_list):
                maxWords = []
                reverse = self.max_len + i
                while reverse > i:   
                    grams = self.words_list[i: reverse]
                    reverse -= 1
                    tempWords = ' '.join(grams)
                    if tempWords in self.dic:
                        maxWords = grams
                        break
                if maxWords:
                    i += len(maxWords)
                    tokens.append(' '.join(maxWords))
                else:
                    tokens.append(' '.join(self.words_list[i: i+1]))
                    i += 1
            self.tokens = tokens
            return tokens
        
        def __call__(self, word_obj):
            return self.segment(word_obj)
        
        def __repr__(self):
            return self.tokens
    
    • 后向最大匹配, 需要将./words.dic替换成你自己的词典,词典格式是一行一个词的txt
    import re
    class BackwardMaxMatch:
        def __init__(self, dic='./words.dic', max_len=5):
            self.max_len = max_len
            self.dic_file = dic
            self.dic = self.load_dic()
            self.tokens = []
            
        def load_dic(self):
            with open(self.dic_file, 'r', encoding='utf-8') as f:
                data = f.readlines()
                dic = [i.strip() for i in data]
                return dic   
        
        def segment(self, word_obj):
            if isinstance(word_obj, str):
                character_map = {".": " . ",
                                 ",": ' , '}
                for origin, new in character_map.items():
                    word_obj = word_obj.replace(origin, new)
                self.words_list =word_obj.lower().split()
            elif isinstance(word_obj, list):
                self.words_list = word_obj
            else:
                print("Not support object for segmentation!")
            i = -1
            tokens = []
            while i > -len(self.words_list):
                maxWords = []
                reverse = -self.max_len + i
                while reverse < i:   
                    grams = self.words_list[reverse: i]
                    reverse += 1
                    tempWords = ' '.join(grams)
                    if tempWords in self.dic:
                        maxWords = grams
                        break
                if maxWords:
                    i -= len(maxWords)
                    tokens.append(' '.join(maxWords))
                else:
                    tokens.append(' '.join(self.words_list[i-1: i]))
                    i -= 1
            self.tokens = list(reversed(tokens))
            if self.words_list[-1] == '.': self.tokens.append('.')
            return self.tokens
        
        def __call__(self, word_obj):
            return self.segment(word_obj)
        
        def __repr__(self):
            return self.tokens
    
    • 双向最大匹配

    ————————————————
    分词目标:

    将正向最大匹配算法和逆向最大匹配算法进行比较,从而确定正确的分词方法。

    算法流程:

    比较正向最大匹配和逆向最大匹配结果
    如果分词数量结果不同,那么取分词数量较少的那个
    如果分词数量结果相同
    分词结果相同,可以返回任何一个
    分词结果不同,返回单字数比较少的那个
    原文链接:https://blog.csdn.net/selinda001/article/details/79345072
    ————————————————

    def BidirectMaxMatch(string):
        back_tokenizer = BackwardMaxMatch()
        back_tokens = back_tokenizer.segment(string)
        forward_tokenizer = ForwardMaxMatch()
        forward_tokens = forward_tokenizer.segment(string)
        
        if len(back_tokens) == len(forward_tokens):
            if back_tokens == forward_tokens:
                return back_tokens
            else:
                back_single_w_cnt = sum([0 if len(bt.split(' ')) > 1 else 1 for bt in back_tokens])
                forward_single_w_cnt = sum([0 if len(bt.split(' ')) > 1 else 1 for bt in forward_tokens])
                if back_single_w_cnt < forward_single_w_cnt:
                    return back_tokens
                else:
                    return forward_tokens
        else:
            if len(back_tokens) < len(forward_tokens):
                return back_tokens
            else:
                return forward_tokens  
    

    相关文章

      网友评论

          本文标题:英文特定领域词组切分-借鉴中文的前向,后向,双向最大匹配

          本文链接:https://www.haomeiwen.com/subject/iqikzctx.html