美文网首页
NLTK使用记录

NLTK使用记录

作者: 三方斜阳 | 来源:发表于2021-02-22 15:47 被阅读0次

    记录一下使用 nltk 实现简单的命名实体识别

    1. 安装下载

    首先是安装 nltk 出现的问题:

    !pip install nltk
    >>
    Requirement already satisfied: nltk in d:\anaconda3\lib\site-packages (3.5)
    Requirement already satisfied: tqdm in d:\anaconda3\lib\site-packages (from nltk) (4.50.2)
    Requirement already satisfied: regex in d:\anaconda3\lib\site-packages (from nltk) (2020.10.15)
    Requirement already satisfied: joblib in d:\anaconda3\lib\site-packages (from nltk) (0.17.0)
    Requirement already satisfied: click in d:\anaconda3\lib\site-packages (from nltk) (7.1.2)
    >>
    

    anaconda 中已经下载好了,在代码中使用时却出现:

    Resource stopwords not found.
      Please use the NLTK Downloader to obtain the resource:
      >>> import nltk
      >>> nltk.download('stopwords')
      
      For more information see: https://www.nltk.org/data.html
    
      Attempted to load corpora/stopwords
    
      Searched in:
        - 'C:\\Users\\xk17z/nltk_data'
        - 'D:\\anaconda3\\nltk_data'
        - 'D:\\anaconda3\\share\\nltk_data'
        - 'D:\\anaconda3\\lib\\nltk_data'
        - 'C:\\Users\\xk17z\\AppData\\Roaming\\nltk_data'
        - 'C:\\nltk_data'
        - 'D:\\nltk_data'
        - 'E:\\nltk_data'
    **********************************************************************
    

    解决方案,控制台进入python 编译环境,输入:

    >>> import nltk
    >>> nltk.download()
    

    出现nltk 下载界面:



    选中 all 点击 Download ,慢慢等待,这里下载的路径在上面报错的任何一个路径都可以,这里我直接放在了 C 盘。

    2. NLTK 使用

    • nltk.sent_tokenize(text) #对文本按照句子进行分割
    • nltk.word_tokenize(sent) #对句子进行分词
    import nltk
    import re
    string="""FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
    Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
    membership now comprises 211 national associations. """
    string=re.sub("\n",' ',string)
    sentence=nltk.sent_tokenize(string) #对文本按照句子进行分割
    tokenized_sentences=[nltk.word_tokenize(sent) for sent in sentence]#对句子进行分词
    >>
    [['FIFA', 'was', 'founded', 'in', '1904', 'to', 'oversee', 'international', 'competition', 'among', 'the', 'national', 'associations', 'of', 'Belgium', ',', 'Denmark', ',', 'France', ',', 'Germany', ',', 'the', 'Netherlands', ',', 'Spain', ',', 'Sweden', ',', 'and', 'Switzerland', '.'], ['Headquartered', 'in', 'Zürich', ',', 'its', 'membership', 'now', 'comprises', '211', 'national', 'associations', '.']]
    >>
    
    • nltk.pos_tag(sentence) 词性标注
    • nltk.ne_chunk(tagged) 命名实体识别
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
    print(tagged_sentences)
    ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
    print(ne_chunked_sents[0])
    >>
    [[('FIFA', 'NNP'), ('was', 'VBD'), ('founded', 'VBN'), ('in', 'IN'), ('1904', 'CD'), ('to', 'TO'), ('oversee', 'VB'), ('international', 'JJ'), ('competition', 'NN'), ('among', 'IN'), ('the', 'DT'), ('national', 'JJ'), ('associations', 'NNS'), ('of', 'IN'), ('Belgium', 'NNP'), (',', ','), ('Denmark', 'NNP'), (',', ','), ('France', 'NNP'), (',', ','), ('Germany', 'NNP'), (',', ','), ('the', 'DT'), ('Netherlands', 'NNP'), (',', ','), ('Spain', 'NNP'), (',', ','), ('Sweden', 'NNP'), (',', ','), ('and', 'CC'), ('Switzerland', 'NNP'), ('.', '.')], [('Headquartered', 'VBN'), ('in', 'IN'), ('Zürich', 'NNP'), (',', ','), ('its', 'PRP$'), ('membership', 'NN'), ('now', 'RB'), ('comprises', 'VBZ'), ('211', 'CD'), ('national', 'JJ'), ('associations', 'NNS'), ('.', '.')]]
    >>
    [Tree('S', [Tree('ORGANIZATION', [('FIFA', 'NNP')]), ('was', 'VBD'), ('founded', 'VBN'), ('in', 'IN'), ('1904', 'CD'), ('to', 'TO'), ('oversee', 'VB'), ('international', 'JJ'), ('competition', 'NN'), ('among', 'IN'), ('the', 'DT'), ('national', 'JJ'), ('associations', 'NNS'), ('of', 'IN'), Tree('GPE', [('Belgium', 'NNP')]), (',', ','), Tree('GPE', [('Denmark', 'NNP')]), (',', ','), Tree('GPE', [('France', 'NNP')]), (',', ','), Tree('GPE', [('Germany', 'NNP')]), (',', ','), ('the', 'DT'), Tree('GPE', [('Netherlands', 'NNP')]), (',', ','), Tree('GPE', [('Spain', 'NNP')]), (',', ','), Tree('GPE', [('Sweden', 'NNP')]), (',', ','), ('and', 'CC'), Tree('GPE', [('Switzerland', 'NNP')]), ('.', '.')]), Tree('S', [('Headquartered', 'VBN'), ('in', 'IN'), Tree('GPE', [('Zürich', 'NNP')]), (',', ','), ('its', 'PRP$'), ('membership', 'NN'), ('now', 'RB'), ('comprises', 'VBZ'), ('211', 'CD'), ('national', 'JJ'), ('associations', 'NNS'), ('.', '.')])]
    >>
    named_entities=[]
    for ne_tagged_sentence in ne_chunked_sents:
        for tagged_tree in ne_tagged_sentence:
            if hasattr(tagged_tree, 'label'):
                named_entities.append((tagged_tree.leaves()[0][0], tagged_tree.label()))
                named_entities = list(set(named_entities))
    print(named_entities)
    entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
    print(entity_frame)
    >>
    [('Zürich', 'GPE'), ('FIFA', 'ORGANIZATION'), ('Switzerland', 'GPE'), ('Denmark', 'GPE'), ('Sweden', 'GPE'), ('Netherlands', 'GPE'), ('Germany', 'GPE'), ('Belgium', 'GPE'), ('Spain', 'GPE'), ('France', 'GPE')]
    >>
    
    分词
    from nltk import tokenize
    token = tokenize.word_tokenize(content, language='portuguese')
    
    nltk.pos_tag('tem pirulito', lang='portuguese')
    >>
    Currently, NLTK pos_tag only supports English and Russian (i.e. lang='eng' or lang='rus')
    
    

    参考:

    利用NLTK自带方法完成NLP基本任务
    https://www.jianshu.com/p/16e1f6a7aaef

    linux github 下载:https://www.jianshu.com/p/fd501b927b72

    相关文章

      网友评论

          本文标题:NLTK使用记录

          本文链接:https://www.haomeiwen.com/subject/amtkfltx.html