美文网首页人工智能小白入门
自然语言处理基础技术之词性标注实战

自然语言处理基础技术之词性标注实战

作者: yuquanle | 来源:发表于2018-11-26 18:16 被阅读0次

    声明:转载请注明出处,谢谢:https://www.jianshu.com/p/904c1833c2fe
    另外,更多实时更新的个人学习笔记分享,请关注:


    知乎https://www.zhihu.com/people/yuquanle/columns
    公众号:StudyForAI
    CSDN地址http://blog.csdn.net/m0_37306360


    jieba词性标注(part of specch)

    安装:pip install jieba

    国内源安装更快:pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple

    先导包:jieba.posseg.dt 为默认词性标注分词器

    标注句子分词后每个词的词性,采用和 ictclas 兼容的标记法。

    jieba貌似不能处理英文,后面会介绍处理英文的

    import jieba.posseg as pseg
    words = pseg.cut("我爱自然语言处理技术!")
    for word, pos in words:
        print(word, pos)
    
    我 r
    爱 v
    自然语言 l
    处理 v
    技术 n
    ! x
    

    SnowNLP词性标注

    安装:pip install snownlp

    国内源安装:pip install snownlp -i https://pypi.tuna.tsinghua.edu.cn/simple

    使用snownlp进行词性标注

    from snownlp import SnowNLP
    model = SnowNLP(u'我爱自然语言处理技术!')
    for word, pos in model.tags:
        print(word, pos)
    
    我 r
    爱 v
    自然 n
    语言 n
    处理 vn
    技术 n
    ! w
    

    THULAC词性标注

    安装:pip install thulac

    国内源安装:pip install thulac -i https://pypi.tuna.tsinghua.edu.cn/simple

    使用thulac进行词性标注

    import thulac
    thulac_model = thulac.thulac()
    wordseg = thulac_model.cut("我爱自然语言处理技术!")
    print(wordseg)
    
    Model loaded succeed
    [['我', 'r'], ['爱', 'v'], ['自然', 'n'], ['语言', 'n'], ['处理', 'v'], ['技术', 'n'], ['!', 'w']]
    

    Stanford CoreNLP分词

    安装:pip install stanfordcorenlp

    国内源安装:pip install stanfordcorenlp -i https://pypi.tuna.tsinghua.edu.cn/simple

    使用stanfordcorenlp进行词性标注

    同时支持英文和中文的词性标注

    from stanfordcorenlp import StanfordCoreNLP
    zh_model = StanfordCoreNLP(r'stanford-corenlp-full-2018-02-27', lang='zh')
    s_zh = '我爱自然语言处理技术!'
    word_pos_zh = zh_model.pos_tag(s_zh)
    print(word_pos_zh)
    
    [('我爱', 'NN'), ('自然', 'AD'), ('语言', 'NN'), ('处理', 'VV'), ('技术', 'NN'), ('!', 'PU')]
    
    eng_model = StanfordCoreNLP(r'stanford-corenlp-full-2018-02-27')
    s_eng = 'I love natural language processing technology!'
    word_pos_eng = eng_model.pos_tag(s_eng)
    print(word_pos_eng)
    
    [('I', 'PRP'), ('love', 'VBP'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('technology', 'NN'), ('!', '.')]
    

    Hanlp词性标注

    安装:pip install pyhanlp

    国内源安装:pip install pyhanlp -i https://pypi.tuna.tsinghua.edu.cn/simple

    使用pyhanlp进行词性标注

    from pyhanlp import *
    s = '我爱自然语言处理技术!'
    word_seg = HanLP.segment(s)
    for term in word_seg:
        print(term.word, term.nature)
    
    我 rr
    爱 v
    自然语言处理 nz
    技术 n
    ! w
    

    NLTK词性标注

    安装:pip install nltk

    国内源安装:pip install nltk -i https://pypi.tuna.tsinghua.edu.cn/simple

    nltk只能处理英文

    import nltk
    s = 'I love natural language processing technology!'
    s = nltk.word_tokenize(s)
    s_pos = nltk.pos_tag(s)
    print(s_pos)
    
    [('I', 'PRP'), ('love', 'VBP'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('technology', 'NN'), ('!', '.')]
    

    spaCy词性标注

    安装:pip install spaCy

    国内源安装:pip install spaCy -i https://pypi.tuna.tsinghua.edu.cn/simple

    下载不了模型,需要python -m spacy download en。
    The easiest solution is to re-run the command as admin(意思是用用户管理权限打开CMD下载即可)

    import spacy 
    
    eng_model = spacy.load('en')
    s = 'I love natural language processing technology!'
    

    词性标注

    s_token = eng_model(s)
    for token in s_token:
        print(token, token.pos_, token.pos)
    
    I PRON 94
    love VERB 99
    natural ADJ 83
    language NOUN 91
    processing NOUN 91
    technology NOUN 91
    ! PUNCT 96
    

    相关文章

      网友评论

        本文标题:自然语言处理基础技术之词性标注实战

        本文链接:https://www.haomeiwen.com/subject/ryfiqqtx.html