多种标注方式:
1. BIO
2. BIOES
3. IOB 等等
下面以命名实体识别为例,看看区别,主要关注标注方法对最终模型效果的影响。
BIO
B stands for 'beginning' (signifies beginning of an Named Entity, i.e. NE)
I stands for 'inside' (signifies that the word is inside an NE)
O stands for 'outside' (signifies that the word is just a regular word outside of an NE)
2. BIOES
B stands for 'beginning' (signifies beginning of an NE)
I stands for 'inside' (signifies that the word is inside an NE)
O stands for 'outside' (signifies that the word is just a regular word outside of an NE)
E stands for 'end' (signifies that the word is the end of an NE)
S stands for 'singleton'(signifies that the single word is an NE )
3. IOB (即IOB-1)
IOB与BIO字母对应的含义相同,其不同点是IOB中,标签B仅用于两个连续的同类型命名实体的边界区分,不用于命名实体的起始位置,这里举个例子:
词序列:(word)(word)(word)(word)(word)(word)
IOB标注:(I-loc)(I-loc)(B-loc)(I-loc)(o)(o)
BIO标注:(B-loc)(I-loc)(B-loc)(I-loc)(o)(o)
The IOB scheme is similar to the BIO scheme,however, here the tag B- is only used to start a segment if the previous token is of the same class but is not part of the segment.
因为IOB的整体效果不好,所以出现了IOB-2,约定了所有命名实体均以B tag开头。这样IOB-2就与BIO的标注方式等价了。
总的来说:
IOB因为缺少B-tag作为实体标注的头部表示,丢失了部分标注信息,导致很多任务上的效果不佳
BIO解决了IOB的问题,所以整体效果优于IOB
BIOES额外提供了End的信息,并给出了单个词汇的S-tag,提供了更多的信息,可能效果更优,但其需要预测的标签更多(多了E和S),效果也可能受到影响。
论文来源:
命名实体识别的作用:
一般来说,命名实体识别的任务就是识别出待处理文本中三大类【实体类、时间类和数字类】、七小类【人名、机构名、地名、时间、日期、货币、百分比】命名实体。
命名实体识别的过程组成:
1. 实体边界识别
2. 确定实体类别
网友评论