美文网首页nltk
Python入门:NLTK(二)POS Tag, Stemmin

Python入门:NLTK(二)POS Tag, Stemmin

作者: 不务正业的Yuez | 来源:发表于2016-05-17 08:34 被阅读3127次

    常用操作

    1. Part-Of-Speech Tagging and POS Tagger
      POS主要是用于标注词在文本中的成分,NLTK使用如下:
    >>> import nltk
    >>> text = nltk.word_tokenize(“Dive into NLTK: Part-of-speech tagging and POS Tagger”)
    >>> text
    [‘Dive’, ‘into’, ‘NLTK’, ‘:’, ‘Part-of-speech’, ‘tagging’, ‘and’, ‘POS’, ‘Tagger’]
    >>> nltk.pos_tag(text)
    [(‘Dive’, ‘JJ’), (‘into’, ‘IN’), (‘NLTK’, ‘NNP’), (‘:’, ‘:’), (‘Part-of-speech’, ‘JJ’), (‘tagging’, ‘NN’), (‘and’, ‘CC’), (‘POS’, ‘NNP’), (‘Tagger’, ‘NNP’)]
    

    注意这里先做了word的tokenize,之后才做了pos tagging. NLTK对于每一种Tag都提供了说明文档,相关代码如下:

    >>> nltk.help.upenn_tagset(‘JJ’)
    >>> nltk.help.upenn_tagset(‘IN’)
    >>> nltk.help.upenn_tagset(‘NNP’)
    

    除此之外,NLTK还提供了pos tagging的批处理,代码如下:

    >>> nltk.batch_pos_tag([[‘this’, ‘is’, ‘batch’, ‘tag’, ‘test’], [‘nltk’, ‘is’, ‘text’, ‘analysis’, ‘tool’]])[[(‘this’, ‘DT’), (‘is’, ‘VBZ’), (‘batch’, ‘NN’), (‘tag’, ‘NN’), (‘test’, ‘NN’)], [(‘nltk’, ‘NN’), (‘is’, ‘VBZ’), (‘text’, ‘JJ’), (‘analysis’, ‘NN’), (‘tool’, ‘NN’)]]
    

    NLTK中nltk_data/taggers还提供了已经预先训练好的POS Tagging Model。其中,默认的Tagging Model是maxent_treebanck_pos_tagger model,相关代码在nltk-master/nltk/tag/_init_.py中。除此之外,我们训练其他相应的模型,如crf, hmm, brill, tnt and interfaces with stanford pos tagger, hunpos pos tagger和senna postaggers。Model训练的相关代码如下:
    划分训练数据

    >>> from nltk.corpus import treebank
    >>> len(treebank.tagged_sents())
    3914
    >>> train_data = treebank.tagged_sents()[:3000]
    >>> test_data = treebank.tagged_sents()[3000:]
    

    训练模型

    >>> from nltk.tag import tnt
    >>> tnt_pos_tagger = tnt.TnT()
    >>> tnt_pos_tagger.train(train_data)
    >>> tnt_pos_tagger.evaluate(test_data)
    0.8755881718109216
    

    保存模型

    >>> import pickle
    >>> f = open(‘tnt_treebank_pos_tagger.pickle’, ‘w’)
    >>> pickle.dump(tnt_pos_tagger, f)
    >>> f.close()
    

    应用模型

    >>> tnt_tagger.tag(nltk.word_tokenize(“this is a tnt treebank tnt tagger”))
    [(‘this’, u’DT’), (‘is’, u’VBZ’), (‘a’, u’DT’), (‘tnt’, ‘Unk’), (‘treebank’, ‘Unk’), (‘tnt’, ‘Unk’), (‘tagger’, ‘Unk’)]
    
    1. Stemming
      从我个人的理解,Stemming的作用是提取词根,Lemmatization的作用是提取词的原型。
      2.1 Porter Stemmer
    >>> from nltk.stem.porter import PorterStemmer
    >>> porter_stemmer = PorterStemmer()
    >>> porter_stemmer.stem(‘maximum’)
    u’maximum’
    >>> porter_stemmer.stem(‘presumably’)
    u’presum’
    >>> porter_stemmer.stem(‘multiply’)
    u’multipli’
    >>> porter_stemmer.stem(‘provision’)
    u’provis’
    >>> porter_stemmer.stem(‘owed’)
    u’owe’
    >>> porter_stemmer.stem(‘ear’)
    u’ear’
    >>> porter_stemmer.stem(‘saying’)
    u’say’
    >>> porter_stemmer.stem(‘crying’)
    u’cri’
    >>> porter_stemmer.stem(‘string’)
    u’string’
    >>> porter_stemmer.stem(‘meant’)
    u’meant’
    >>> porter_stemmer.stem(‘cement’)
    u’cement’
    

    2.2 Lancaster Stemmer

    >>> from nltk.stem.lancaster import LancasterStemmer
    >>> lancaster_stemmer = LancasterStemmer()
    >>> lancaster_stemmer.stem(‘maximum’)
    ‘maxim’
    >>> lancaster_stemmer.stem(‘presumably’)
    ‘presum’
    >>> lancaster_stemmer.stem(‘presumably’)
    ‘presum’
    >>> lancaster_stemmer.stem(‘multiply’)
    ‘multiply’
    >>> lancaster_stemmer.stem(‘provision’)
    u’provid’
    >>> lancaster_stemmer.stem(‘owed’)
    ‘ow’
    >>> lancaster_stemmer.stem(‘ear’)
    ‘ear’
    >>> lancaster_stemmer.stem(‘saying’)
    ‘say’
    >>> lancaster_stemmer.stem(‘crying’)
    ‘cry’
    >>> lancaster_stemmer.stem(‘string’)
    ‘string’
    >>> lancaster_stemmer.stem(‘meant’)
    ‘meant’
    >>> lancaster_stemmer.stem(‘cement’)
    ‘cem’
    

    2.3 Snowball Stemmer

    >>> from nltk.stem import SnowballStemmer
    >>> snowball_stemmer = SnowballStemmer(“english”)
    >>> snowball_stemmer.stem(‘maximum’)
    u’maximum’
    >>> snowball_stemmer.stem(‘presumably’)
    u’presum’
    >>> snowball_stemmer.stem(‘multiply’)
    u’multipli’
    >>> snowball_stemmer.stem(‘provision’)
    u’provis’
    >>> snowball_stemmer.stem(‘owed’)
    u’owe’
    >>> snowball_stemmer.stem(‘ear’)
    u’ear’
    >>> snowball_stemmer.stem(‘saying’)
    u’say’
    >>> snowball_stemmer.stem(‘crying’)
    u’cri’
    >>> snowball_stemmer.stem(‘string’)
    u’string’
    >>> snowball_stemmer.stem(‘meant’)
    u’meant’
    >>> snowball_stemmer.stem(‘cement’)
    u’cement’
    
    1. Lemmatization
      NLTK的Lemmatization是基于WordNet实现的。
    >>> from nltk.stem import WordNetLemmatizer
    >>> wordnet_lemmatizer = WordNetLemmatizer()
    >>> wordnet_lemmatizer.lemmatize(‘dogs’)
    u’dog’
    >>> wordnet_lemmatizer.lemmatize(‘churches’)
    u’church’
    >>> wordnet_lemmatizer.lemmatize(‘aardwolves’)
    u’aardwolf’
    >>> wordnet_lemmatizer.lemmatize(‘abaci’)
    u’abacus’
    >>> wordnet_lemmatizer.lemmatize(‘hardrock’)
    ‘hardrock’
    >>> wordnet_lemmatizer.lemmatize(‘are’)
    ‘are’
    >>> wordnet_lemmatizer.lemmatize(‘is’)
    ‘is’
    

    在此基础上,NLTK可以修改默认的pos参数,如从pos='n'改为pos='V'

    >>> wordnet_lemmatizer.lemmatize(‘is’, pos=’v’)
    u’be’
    >>> wordnet_lemmatizer.lemmatize(‘are’, pos=’v’)
    u’be’
    

    相关文章

      网友评论

      本文标题:Python入门:NLTK(二)POS Tag, Stemmin

      本文链接:https://www.haomeiwen.com/subject/pjfirttx.html