中文分词组件jieba:https://github.com/fxsjy/jieba
CoNLL2003 语料库官网:https://www.clips.uantwerpen.be/conll2003/ner/
CoNLL2003是命名实体识别中最常见的公开数据集,具体介绍见上述官网。
flair framework的用法:
from flair.data import Sentence
from flair.models import SequenceTagger
sentence = Sentence('I love Berlin,the capital of Germany.')
tagger = SequenceTagger.load('ner')
tagger.predict(sentence)
print(sentence)
print('The following NER tags are found:')
for entity in sentence.get_spans('ner'):
print(entity)

反思:为什么会将 'Berlin,the '一起识别成location ????
我明白了,它对于','分割两个单词分割不了,因此将'Berlin,the'识别成一个token,并标记为location。
用flair framework在CoNLL2003数据集上训练并测试,F1值达到了93.16,以下是训练的具体代码:
(补充说明:需要事先下载CoNLL2003语料库中的文件,并放在resources/tasks文件夹中)
resources/tasks/conll_03/eng.testa
resources/tasks/conll_03/eng.testb
resources/tasks/conll_03/eng.train
from flair.data import Corpus
from flair.datasets import CONLL_03
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, PooledFlairEmbeddings
from typing import List
# 1. get the corpus
corpus: Corpus = CONLL_03(base_path='resources/tasks')
# 2. what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
# initialize embeddings
embedding_types: List[TokenEmbeddings] = [
# GloVe embeddings
WordEmbeddings('glove'),
# contextual string embeddings, forward
PooledFlairEmbeddings('news-forward', pooling='min'),
# contextual string embeddings, backward
PooledFlairEmbeddings('news-backward', pooling='min'),
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
# initialize sequence tagger
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type)
# initialize trainer
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train('resources/taggers/example-ner',
train_with_dev=True,
max_epochs=150)
读论文
1.Pooled Contextualized Embeddings for Named Entity Recognition


这篇文章叫做池化上下文嵌入,如上图1所示,Indra属于罕见的词,因此在做上下文字符嵌入时将其标注成了ORG,它原本应该是PER。为了解决这个问题,作者提出了一种动态聚合上下文嵌入的方法,对于每一个遇到的特殊的单词(例如:Indra),作者使用池操作从所有的上下文实例中提取全局单词表示,具体过程见上图2。
未完。。。。
网友评论