美文网首页
NLTK使用记录

NLTK使用记录

作者: 三方斜阳 | 来源:发表于2021-02-22 15:47 被阅读0次

记录一下使用 nltk 实现简单的命名实体识别

1. 安装下载

首先是安装 nltk 出现的问题:

!pip install nltk
>>
Requirement already satisfied: nltk in d:\anaconda3\lib\site-packages (3.5)
Requirement already satisfied: tqdm in d:\anaconda3\lib\site-packages (from nltk) (4.50.2)
Requirement already satisfied: regex in d:\anaconda3\lib\site-packages (from nltk) (2020.10.15)
Requirement already satisfied: joblib in d:\anaconda3\lib\site-packages (from nltk) (0.17.0)
Requirement already satisfied: click in d:\anaconda3\lib\site-packages (from nltk) (7.1.2)
>>

anaconda 中已经下载好了,在代码中使用时却出现:

Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:
  >>> import nltk
  >>> nltk.download('stopwords')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/stopwords

  Searched in:
    - 'C:\\Users\\xk17z/nltk_data'
    - 'D:\\anaconda3\\nltk_data'
    - 'D:\\anaconda3\\share\\nltk_data'
    - 'D:\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\xk17z\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************

解决方案,控制台进入python 编译环境,输入:

>>> import nltk
>>> nltk.download()

出现nltk 下载界面:



选中 all 点击 Download ,慢慢等待,这里下载的路径在上面报错的任何一个路径都可以,这里我直接放在了 C 盘。

2. NLTK 使用

  • nltk.sent_tokenize(text) #对文本按照句子进行分割
  • nltk.word_tokenize(sent) #对句子进行分词
import nltk
import re
string="""FIFA was founded in 1904 to oversee international competition among the national associations of Belgium, 
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its 
membership now comprises 211 national associations. """
string=re.sub("\n",' ',string)
sentence=nltk.sent_tokenize(string) #对文本按照句子进行分割
tokenized_sentences=[nltk.word_tokenize(sent) for sent in sentence]#对句子进行分词
>>
[['FIFA', 'was', 'founded', 'in', '1904', 'to', 'oversee', 'international', 'competition', 'among', 'the', 'national', 'associations', 'of', 'Belgium', ',', 'Denmark', ',', 'France', ',', 'Germany', ',', 'the', 'Netherlands', ',', 'Spain', ',', 'Sweden', ',', 'and', 'Switzerland', '.'], ['Headquartered', 'in', 'Zürich', ',', 'its', 'membership', 'now', 'comprises', '211', 'national', 'associations', '.']]
>>
  • nltk.pos_tag(sentence) 词性标注
  • nltk.ne_chunk(tagged) 命名实体识别
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
print(tagged_sentences)
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
print(ne_chunked_sents[0])
>>
[[('FIFA', 'NNP'), ('was', 'VBD'), ('founded', 'VBN'), ('in', 'IN'), ('1904', 'CD'), ('to', 'TO'), ('oversee', 'VB'), ('international', 'JJ'), ('competition', 'NN'), ('among', 'IN'), ('the', 'DT'), ('national', 'JJ'), ('associations', 'NNS'), ('of', 'IN'), ('Belgium', 'NNP'), (',', ','), ('Denmark', 'NNP'), (',', ','), ('France', 'NNP'), (',', ','), ('Germany', 'NNP'), (',', ','), ('the', 'DT'), ('Netherlands', 'NNP'), (',', ','), ('Spain', 'NNP'), (',', ','), ('Sweden', 'NNP'), (',', ','), ('and', 'CC'), ('Switzerland', 'NNP'), ('.', '.')], [('Headquartered', 'VBN'), ('in', 'IN'), ('Zürich', 'NNP'), (',', ','), ('its', 'PRP$'), ('membership', 'NN'), ('now', 'RB'), ('comprises', 'VBZ'), ('211', 'CD'), ('national', 'JJ'), ('associations', 'NNS'), ('.', '.')]]
>>
[Tree('S', [Tree('ORGANIZATION', [('FIFA', 'NNP')]), ('was', 'VBD'), ('founded', 'VBN'), ('in', 'IN'), ('1904', 'CD'), ('to', 'TO'), ('oversee', 'VB'), ('international', 'JJ'), ('competition', 'NN'), ('among', 'IN'), ('the', 'DT'), ('national', 'JJ'), ('associations', 'NNS'), ('of', 'IN'), Tree('GPE', [('Belgium', 'NNP')]), (',', ','), Tree('GPE', [('Denmark', 'NNP')]), (',', ','), Tree('GPE', [('France', 'NNP')]), (',', ','), Tree('GPE', [('Germany', 'NNP')]), (',', ','), ('the', 'DT'), Tree('GPE', [('Netherlands', 'NNP')]), (',', ','), Tree('GPE', [('Spain', 'NNP')]), (',', ','), Tree('GPE', [('Sweden', 'NNP')]), (',', ','), ('and', 'CC'), Tree('GPE', [('Switzerland', 'NNP')]), ('.', '.')]), Tree('S', [('Headquartered', 'VBN'), ('in', 'IN'), Tree('GPE', [('Zürich', 'NNP')]), (',', ','), ('its', 'PRP$'), ('membership', 'NN'), ('now', 'RB'), ('comprises', 'VBZ'), ('211', 'CD'), ('national', 'JJ'), ('associations', 'NNS'), ('.', '.')])]
>>
named_entities=[]
for ne_tagged_sentence in ne_chunked_sents:
    for tagged_tree in ne_tagged_sentence:
        if hasattr(tagged_tree, 'label'):
            named_entities.append((tagged_tree.leaves()[0][0], tagged_tree.label()))
            named_entities = list(set(named_entities))
print(named_entities)
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
print(entity_frame)
>>
[('Zürich', 'GPE'), ('FIFA', 'ORGANIZATION'), ('Switzerland', 'GPE'), ('Denmark', 'GPE'), ('Sweden', 'GPE'), ('Netherlands', 'GPE'), ('Germany', 'GPE'), ('Belgium', 'GPE'), ('Spain', 'GPE'), ('France', 'GPE')]
>>
分词
from nltk import tokenize
token = tokenize.word_tokenize(content, language='portuguese')

nltk.pos_tag('tem pirulito', lang='portuguese')
>>
Currently, NLTK pos_tag only supports English and Russian (i.e. lang='eng' or lang='rus')

参考:

利用NLTK自带方法完成NLP基本任务
https://www.jianshu.com/p/16e1f6a7aaef

linux github 下载:https://www.jianshu.com/p/fd501b927b72

相关文章

网友评论

      本文标题:NLTK使用记录

      本文链接:https://www.haomeiwen.com/subject/amtkfltx.html