NLTK使用记录

作者: 三方斜阳 | 来源:发表于2021-02-22 15:47 被阅读0次

NLTK使用记录
NLTK下载数据报错
NLTK词性tag含义
手动安装nltk data
语言模型：使用NLTK训练
通过pynput与日志记录实现键盘、鼠标的监听
解决win10环境下python3无法下载nltk_data的问
mac 安装nltk，并解决nltk.download()出错
离线安装NLTK Data
复活NgramModel!-继承'BaseNgramMo

记录一下使用 nltk 实现简单的命名实体识别

1. 安装下载

首先是安装 nltk 出现的问题:

!pip install nltk
>>
Requirement already satisfied: nltk in d:\anaconda3\lib\site-packages (3.5)
Requirement already satisfied: tqdm in d:\anaconda3\lib\site-packages (from nltk) (4.50.2)
Requirement already satisfied: regex in d:\anaconda3\lib\site-packages (from nltk) (2020.10.15)
Requirement already satisfied: joblib in d:\anaconda3\lib\site-packages (from nltk) (0.17.0)
Requirement already satisfied: click in d:\anaconda3\lib\site-packages (from nltk) (7.1.2)
>>

anaconda 中已经下载好了，在代码中使用时却出现：

Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:
  >>> import nltk
  >>> nltk.download('stopwords')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/stopwords

  Searched in:
    - 'C:\\Users\\xk17z/nltk_data'
    - 'D:\\anaconda3\\nltk_data'
    - 'D:\\anaconda3\\share\\nltk_data'
    - 'D:\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\xk17z\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************

解决方案，控制台进入python 编译环境，输入：

>>> import nltk
>>> nltk.download()

出现nltk 下载界面：

选中 all 点击 Download ，慢慢等待，这里下载的路径在上面报错的任何一个路径都可以，这里我直接放在了 C 盘。

2. NLTK 使用

nltk.sent_tokenize(text) #对文本按照句子进行分割
nltk.word_tokenize(sent) #对句子进行分词

import nltk
import re
string="""FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
membership now comprises 211 national associations. """
string=re.sub("\n",' ',string)
sentence=nltk.sent_tokenize(string) #对文本按照句子进行分割
tokenized_sentences=[nltk.word_tokenize(sent) for sent in sentence]#对句子进行分词
>>
[['FIFA', 'was', 'founded', 'in', '1904', 'to', 'oversee', 'international', 'competition', 'among', 'the', 'national', 'associations', 'of', 'Belgium', ',', 'Denmark', ',', 'France', ',', 'Germany', ',', 'the', 'Netherlands', ',', 'Spain', ',', 'Sweden', ',', 'and', 'Switzerland', '.'], ['Headquartered', 'in', 'Zürich', ',', 'its', 'membership', 'now', 'comprises', '211', 'national', 'associations', '.']]
>>

nltk.pos_tag(sentence) 词性标注
nltk.ne_chunk(tagged) 命名实体识别

tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
print(tagged_sentences)
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
print(ne_chunked_sents[0])
>>
[[('FIFA', 'NNP'), ('was', 'VBD'), ('founded', 'VBN'), ('in', 'IN'), ('1904', 'CD'), ('to', 'TO'), ('oversee', 'VB'), ('international', 'JJ'), ('competition', 'NN'), ('among', 'IN'), ('the', 'DT'), ('national', 'JJ'), ('associations', 'NNS'), ('of', 'IN'), ('Belgium', 'NNP'), (',', ','), ('Denmark', 'NNP'), (',', ','), ('France', 'NNP'), (',', ','), ('Germany', 'NNP'), (',', ','), ('the', 'DT'), ('Netherlands', 'NNP'), (',', ','), ('Spain', 'NNP'), (',', ','), ('Sweden', 'NNP'), (',', ','), ('and', 'CC'), ('Switzerland', 'NNP'), ('.', '.')], [('Headquartered', 'VBN'), ('in', 'IN'), ('Zürich', 'NNP'), (',', ','), ('its', 'PRP$'), ('membership', 'NN'), ('now', 'RB'), ('comprises', 'VBZ'), ('211', 'CD'), ('national', 'JJ'), ('associations', 'NNS'), ('.', '.')]]
>>
[Tree('S', [Tree('ORGANIZATION', [('FIFA', 'NNP')]), ('was', 'VBD'), ('founded', 'VBN'), ('in', 'IN'), ('1904', 'CD'), ('to', 'TO'), ('oversee', 'VB'), ('international', 'JJ'), ('competition', 'NN'), ('among', 'IN'), ('the', 'DT'), ('national', 'JJ'), ('associations', 'NNS'), ('of', 'IN'), Tree('GPE', [('Belgium', 'NNP')]), (',', ','), Tree('GPE', [('Denmark', 'NNP')]), (',', ','), Tree('GPE', [('France', 'NNP')]), (',', ','), Tree('GPE', [('Germany', 'NNP')]), (',', ','), ('the', 'DT'), Tree('GPE', [('Netherlands', 'NNP')]), (',', ','), Tree('GPE', [('Spain', 'NNP')]), (',', ','), Tree('GPE', [('Sweden', 'NNP')]), (',', ','), ('and', 'CC'), Tree('GPE', [('Switzerland', 'NNP')]), ('.', '.')]), Tree('S', [('Headquartered', 'VBN'), ('in', 'IN'), Tree('GPE', [('Zürich', 'NNP')]), (',', ','), ('its', 'PRP$'), ('membership', 'NN'), ('now', 'RB'), ('comprises', 'VBZ'), ('211', 'CD'), ('national', 'JJ'), ('associations', 'NNS'), ('.', '.')])]
>>
named_entities=[]
for ne_tagged_sentence in ne_chunked_sents:
    for tagged_tree in ne_tagged_sentence:
        if hasattr(tagged_tree, 'label'):
            named_entities.append((tagged_tree.leaves()[0][0], tagged_tree.label()))
            named_entities = list(set(named_entities))
print(named_entities)
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
print(entity_frame)
>>
[('Zürich', 'GPE'), ('FIFA', 'ORGANIZATION'), ('Switzerland', 'GPE'), ('Denmark', 'GPE'), ('Sweden', 'GPE'), ('Netherlands', 'GPE'), ('Germany', 'GPE'), ('Belgium', 'GPE'), ('Spain', 'GPE'), ('France', 'GPE')]
>>

分词
from nltk import tokenize
token = tokenize.word_tokenize(content, language='portuguese')

nltk.pos_tag('tem pirulito', lang='portuguese')
>>
Currently, NLTK pos_tag only supports English and Russian (i.e. lang='eng' or lang='rus')