美文网首页
n-gram n元语法

n-gram n元语法

作者: 乐猿 | 来源:发表于2017-04-18 16:19 被阅读735次

    NLP刚入门或还未入门,搜资料时经常碰到的概念就是n-gram,特别是bigram,更加常见。了解它,会省不少事~
    维基百科的定义:

    n元语法(英语:n-gram)指文本中连续出现的n个语词。n元语法模型是基于(n-1)阶马尔可夫链的一种概率语言模型,通过n个语词出现的概率来推断语句的结构。
    当n分别为1、2、3时,又分别称为一元语法(unigram)、二元语法(bigram)与三元语法(trigram)

    所以概念本身非常简单,就是把文本连续出现的n个词都找出来。
    举例:
    文本:我是一个好人
    先做分词:我 是 一个 好人
    unigram:


    一个
    好人

    bigram:
    我 是
    是 一个
    一个 好人

    trigram:
    我 是 一个
    是 一个 好人

    你可能会问,最后面词语的个数不够n个呢?这样的情况,就需要由你确定是在左边补齐还是在右边补齐了。
    nltk的实现挺好的,可以参考它的代码,在此摘录一下

    # 此方法用来做补齐
    def pad_sequence(sequence, n, pad_left=False, pad_right=False,
                  left_pad_symbol=None, right_pad_symbol=None):
          """
          Returns a padded sequence of items before ngram extraction.
              >>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
             ['<s>', 1, 2, 3, 4, 5, '</s>']
             >>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
             ['<s>', 1, 2, 3, 4, 5]
             >>> list(pad_sequence([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
             [1, 2, 3, 4, 5, '</s>']
    
         :param sequence: the source data to be padded
         :type sequence: sequence or iter
         :param n: the degree of the ngrams
         :type n: int
         :param pad_left: whether the ngrams should be left-padded
         :type pad_left: bool
         :param pad_right: whether the ngrams should be right-padded
         :type pad_right: bool
         :param left_pad_symbol: the symbol to use for left padding (default is None)
         :type left_pad_symbol: any
         :param right_pad_symbol: the symbol to use for right padding (default is None)
         :type right_pad_symbol: any
         :rtype: sequence or iter
         """
         sequence = iter(sequence)
         if pad_left:
             sequence = chain((left_pad_symbol,) * (n-1), sequence)
         if pad_right:
             sequence = chain(sequence, (right_pad_symbol,) * (n-1))
         return sequence
    
    
    
    def ngrams(sequence, n, pad_left=False, pad_right=False,
           left_pad_symbol=None, right_pad_symbol=None):
        """
        Return the ngrams generated from a sequence of items, as an iterator.
        For example:
            >>> from nltk.util import ngrams
            >>> list(ngrams([1,2,3,4,5], 3))
            [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
        Wrap with list for a list version of this function.  Set pad_left
        or pad_right to true in order to get additional ngrams:
            >>> list(ngrams([1,2,3,4,5], 2, pad_right=True))
            [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
            >>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
            [(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
            >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
            [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)]
            >>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
            [('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
        :param sequence: the source data to be converted into ngrams
        :type sequence: sequence or iter
        :param n: the degree of the ngrams
        :type n: int
        :param pad_left: whether the ngrams should be left-padded
        :type pad_left: bool
        :param pad_right: whether the ngrams should be right-padded
        :type pad_right: bool
        :param left_pad_symbol: the symbol to use for left padding (default is None)
        :type left_pad_symbol: any
        :param right_pad_symbol: the symbol to use for right padding (default is None)
        :type right_pad_symbol: any
        :rtype: sequence or iter
        """
        sequence = pad_sequence(sequence, n, pad_left, pad_right,
                            left_pad_symbol, right_pad_symbol)
    
        history = []
        while n > 1:
            history.append(next(sequence))
            n -= 1
        for item in sequence:
            history.append(item)
            yield tuple(history)
            del history[0]
    
    
    def bigrams(sequence, **kwargs):
        """
        Return the bigrams generated from a sequence of items, as an iterator.
        For example:
            >>> from nltk.util import bigrams
            >>> list(bigrams([1,2,3,4,5]))
            [(1, 2), (2, 3), (3, 4), (4, 5)]
        Use bigrams for a list version of this function.
        :param sequence: the source data to be converted into bigrams
        :type sequence: sequence or iter
        :rtype: iter(tuple)
        """
    
        for item in ngrams(sequence, 2, **kwargs):
            yield item
    
    def trigrams(sequence, **kwargs):
        """
        Return the trigrams generated from a sequence of items, as an iterator.
        For example:
            >>> from nltk.util import trigrams
            >>> list(trigrams([1,2,3,4,5]))
            [(1, 2, 3), (2, 3, 4), (3, 4, 5)]
        Use trigrams for a list version of this function.
        :param sequence: the source data to be converted into trigrams
        :type sequence: sequence or iter
        :rtype: iter(tuple)
        """
    
        for item in ngrams(sequence, 3, **kwargs):
            yield item

    相关文章

      网友评论

          本文标题:n-gram n元语法

          本文链接:https://www.haomeiwen.com/subject/wugfzttx.html