2020-10-08

作者: 小小程序员一枚 | 来源:发表于2020-10-08 21:28 被阅读0次

2021-11-03 CRISPR三巨头"恩怨"大结局？张锋赢
【佛系定投训练营】第4课作业
2020-10-09
2020-10-09
约练感悟
坚持梦想
杀虫公司的人又来了!
倾听的大忌！
2020-10-08
2020-10-08

中文分词组件jieba：https://github.com/fxsjy/jieba

CoNLL2003 语料库官网：https://www.clips.uantwerpen.be/conll2003/ner/

CoNLL2003是命名实体识别中最常见的公开数据集，具体介绍见上述官网。

flair framework的用法：

from flair.data import Sentence
from flair.models import SequenceTagger

sentence = Sentence('I love Berlin,the capital of Germany.')

tagger = SequenceTagger.load('ner')
tagger.predict(sentence)

print(sentence)
print('The following NER tags are found:')

for entity in sentence.get_spans('ner'):
    print(entity)

反思：为什么会将 'Berlin,the '一起识别成location ????
我明白了，它对于'，'分割两个单词分割不了，因此将'Berlin,the'识别成一个token，并标记为location。

用flair framework在CoNLL2003数据集上训练并测试，F1值达到了93.16，以下是训练的具体代码：
（补充说明：需要事先下载CoNLL2003语料库中的文件，并放在resources/tasks文件夹中）

resources/tasks/conll_03/eng.testa
resources/tasks/conll_03/eng.testb
resources/tasks/conll_03/eng.train

from flair.data import Corpus
from flair.datasets import CONLL_03
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, PooledFlairEmbeddings
from typing import List

# 1. get the corpus
corpus: Corpus = CONLL_03(base_path='resources/tasks')

# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# initialize embeddings
embedding_types: List[TokenEmbeddings] = [

    # GloVe embeddings
    WordEmbeddings('glove'),

    # contextual string embeddings, forward
    PooledFlairEmbeddings('news-forward', pooling='min'),

    # contextual string embeddings, backward
    PooledFlairEmbeddings('news-backward', pooling='min'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type)

# initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('resources/taggers/example-ner',
              train_with_dev=True,  
              max_epochs=150)

读论文

1.Pooled Contextualized Embeddings for Named Entity Recognition

这篇文章叫做池化上下文嵌入，如上图1所示，Indra属于罕见的词，因此在做上下文字符嵌入时将其标注成了ORG，它原本应该是PER。为了解决这个问题，作者提出了一种动态聚合上下文嵌入的方法，对于每一个遇到的特殊的单词（例如：Indra），作者使用池操作从所有的上下文实例中提取全局单词表示，具体过程见上图2。

未完。。。。

网友评论

本文标题：2020-10-08

本文链接：https://www.haomeiwen.com/subject/joqhpktx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

2020-10-08

读论文

相关文章

2021-11-03 CRISPR三巨头"恩怨"大结局？张锋赢

【佛系定投训练营】第4课作业

2020-10-09

2020-10-09

约练感悟

坚持梦想

杀虫公司的人又来了!

倾听的大忌！

2020-10-08

2020-10-08

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读