记录一下使用 nltk 实现简单的命名实体识别
1. 安装下载
首先是安装 nltk 出现的问题:
!pip install nltk
>>
Requirement already satisfied: nltk in d:\anaconda3\lib\site-packages (3.5)
Requirement already satisfied: tqdm in d:\anaconda3\lib\site-packages (from nltk) (4.50.2)
Requirement already satisfied: regex in d:\anaconda3\lib\site-packages (from nltk) (2020.10.15)
Requirement already satisfied: joblib in d:\anaconda3\lib\site-packages (from nltk) (0.17.0)
Requirement already satisfied: click in d:\anaconda3\lib\site-packages (from nltk) (7.1.2)
>>
anaconda 中已经下载好了,在代码中使用时却出现:
Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('stopwords')
For more information see: https://www.nltk.org/data.html
Attempted to load corpora/stopwords
Searched in:
- 'C:\\Users\\xk17z/nltk_data'
- 'D:\\anaconda3\\nltk_data'
- 'D:\\anaconda3\\share\\nltk_data'
- 'D:\\anaconda3\\lib\\nltk_data'
- 'C:\\Users\\xk17z\\AppData\\Roaming\\nltk_data'
- 'C:\\nltk_data'
- 'D:\\nltk_data'
- 'E:\\nltk_data'
**********************************************************************
解决方案,控制台进入python 编译环境,输入:
>>> import nltk
>>> nltk.download()
出现nltk 下载界面:
选中 all 点击 Download ,慢慢等待,这里下载的路径在上面报错的任何一个路径都可以,这里我直接放在了 C 盘。
2. NLTK 使用
- nltk.sent_tokenize(text) #对文本按照句子进行分割
- nltk.word_tokenize(sent) #对句子进行分词
import nltk
import re
string="""FIFA was founded in 1904 to oversee international competition among the national associations of Belgium,
Denmark, France, Germany, the Netherlands, Spain, Sweden, and Switzerland. Headquartered in Zürich, its
membership now comprises 211 national associations. """
string=re.sub("\n",' ',string)
sentence=nltk.sent_tokenize(string) #对文本按照句子进行分割
tokenized_sentences=[nltk.word_tokenize(sent) for sent in sentence]#对句子进行分词
>>
[['FIFA', 'was', 'founded', 'in', '1904', 'to', 'oversee', 'international', 'competition', 'among', 'the', 'national', 'associations', 'of', 'Belgium', ',', 'Denmark', ',', 'France', ',', 'Germany', ',', 'the', 'Netherlands', ',', 'Spain', ',', 'Sweden', ',', 'and', 'Switzerland', '.'], ['Headquartered', 'in', 'Zürich', ',', 'its', 'membership', 'now', 'comprises', '211', 'national', 'associations', '.']]
>>
- nltk.pos_tag(sentence) 词性标注
- nltk.ne_chunk(tagged) 命名实体识别
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
print(tagged_sentences)
ne_chunked_sents = [nltk.ne_chunk(tagged) for tagged in tagged_sentences]
print(ne_chunked_sents[0])
>>
[[('FIFA', 'NNP'), ('was', 'VBD'), ('founded', 'VBN'), ('in', 'IN'), ('1904', 'CD'), ('to', 'TO'), ('oversee', 'VB'), ('international', 'JJ'), ('competition', 'NN'), ('among', 'IN'), ('the', 'DT'), ('national', 'JJ'), ('associations', 'NNS'), ('of', 'IN'), ('Belgium', 'NNP'), (',', ','), ('Denmark', 'NNP'), (',', ','), ('France', 'NNP'), (',', ','), ('Germany', 'NNP'), (',', ','), ('the', 'DT'), ('Netherlands', 'NNP'), (',', ','), ('Spain', 'NNP'), (',', ','), ('Sweden', 'NNP'), (',', ','), ('and', 'CC'), ('Switzerland', 'NNP'), ('.', '.')], [('Headquartered', 'VBN'), ('in', 'IN'), ('Zürich', 'NNP'), (',', ','), ('its', 'PRP$'), ('membership', 'NN'), ('now', 'RB'), ('comprises', 'VBZ'), ('211', 'CD'), ('national', 'JJ'), ('associations', 'NNS'), ('.', '.')]]
>>
[Tree('S', [Tree('ORGANIZATION', [('FIFA', 'NNP')]), ('was', 'VBD'), ('founded', 'VBN'), ('in', 'IN'), ('1904', 'CD'), ('to', 'TO'), ('oversee', 'VB'), ('international', 'JJ'), ('competition', 'NN'), ('among', 'IN'), ('the', 'DT'), ('national', 'JJ'), ('associations', 'NNS'), ('of', 'IN'), Tree('GPE', [('Belgium', 'NNP')]), (',', ','), Tree('GPE', [('Denmark', 'NNP')]), (',', ','), Tree('GPE', [('France', 'NNP')]), (',', ','), Tree('GPE', [('Germany', 'NNP')]), (',', ','), ('the', 'DT'), Tree('GPE', [('Netherlands', 'NNP')]), (',', ','), Tree('GPE', [('Spain', 'NNP')]), (',', ','), Tree('GPE', [('Sweden', 'NNP')]), (',', ','), ('and', 'CC'), Tree('GPE', [('Switzerland', 'NNP')]), ('.', '.')]), Tree('S', [('Headquartered', 'VBN'), ('in', 'IN'), Tree('GPE', [('Zürich', 'NNP')]), (',', ','), ('its', 'PRP$'), ('membership', 'NN'), ('now', 'RB'), ('comprises', 'VBZ'), ('211', 'CD'), ('national', 'JJ'), ('associations', 'NNS'), ('.', '.')])]
>>
named_entities=[]
for ne_tagged_sentence in ne_chunked_sents:
for tagged_tree in ne_tagged_sentence:
if hasattr(tagged_tree, 'label'):
named_entities.append((tagged_tree.leaves()[0][0], tagged_tree.label()))
named_entities = list(set(named_entities))
print(named_entities)
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
print(entity_frame)
>>
[('Zürich', 'GPE'), ('FIFA', 'ORGANIZATION'), ('Switzerland', 'GPE'), ('Denmark', 'GPE'), ('Sweden', 'GPE'), ('Netherlands', 'GPE'), ('Germany', 'GPE'), ('Belgium', 'GPE'), ('Spain', 'GPE'), ('France', 'GPE')]
>>
分词
from nltk import tokenize
token = tokenize.word_tokenize(content, language='portuguese')
nltk.pos_tag('tem pirulito', lang='portuguese')
>>
Currently, NLTK pos_tag only supports English and Russian (i.e. lang='eng' or lang='rus')
参考:
利用NLTK自带方法完成NLP基本任务
https://www.jianshu.com/p/16e1f6a7aaef
linux github 下载:https://www.jianshu.com/p/fd501b927b72
网友评论