02-文本标注工具brat

作者: 米米不多 | 来源:发表于2018-12-25 14:15 被阅读0次

02-文本标注工具brat
安利一个NLP标注工具：brat
brat多人标注
中文文本标注工具调研以及BRAT安装使用
解决BRAT无法标注中文标签
使用文本标注工具-doccano
Label
音频标注工具
NLP-Tools
02-文本

本文主要介绍两个部分：

A，安装部署 brat

B，配置brat用于中文任务

NLP基本上都是监督学习，而监督学习需要海量人工标注的语料，越多越好。而标注文本是一个很繁琐的工作，有一个好用的工具会帮助很多。

推荐一个标注工具brat，可以用于各种NLP任务，虽然工具是为实体识别和关系抽取设计的。

brat服务器是一个Python（版本2.5+）程序，默认情况下作为CGI应用程序运行，安装脚本假定类似UNIX的环境。如果您在兼容环境中使用支持CGI的现有Web服务器设置brat服务器，则使用CGI的快速入门说明应该有效。

part A

在ubuntu 18.04下面可以如下操作：

1.首先安装apache2

sudo apt install apache2

2.然后是下载brat, 注意：github上面主分支缺少filelock_brat.py这个文件，最好不要用，在release里面下载最新版。

tar xzf brat-v1.3_Crunchy_Frog.tar.gz

sudo mv brat-v1.3_Crunchy_Frog /var/www/html/brat

sudo chmod 777 -R /var/www/html/brat

cd /var/www/html/brat

./install.sh

3.最后是配置httpd的配置文件

vim /etc/apache2/apache2.conf

增加如下

<Directory /var/www/html/brat>

AllowOverride Options Indexes FileInfo Limit

Require all granted

AddType application/xhtml+xml .xhtml

AddType font/ttf .ttf

Options +ExecCGI #开启 ExecCGI

AddHandler cgi-script .cgi # 开启CGI

</Directory>

然后重启apache2服务

sudo service apache2 restart

然后打开http://localhost/brat，可以看到如下画面，说明安装成功。

Part B

brat里面四个配置文件

annotation.conf: 标注类别

visual.conf: 标注显示

tools.conf: 标注工具

kb_shortcuts.conf: 快捷键

annotation.conf 分成四个section：

[entities]

基本结构是

[entities]

Person

Location

Organization

复杂的层级结构如下：

[entities]

Living-thing

                     Person

                     Animal

                     Plant

Nonliving-thing

                     Building

                     Vehicle

[relations]

只能表示二元关系，如下：

[relations]

Family Arg1:Person, Arg2:Person

Employment Arg1:Person, Arg2:Organization

当然，每个二元关系中的实体可以是多种类型，如下：

[relations]

Located        Arg1:Person, Arg2:Building|City|Country

Located        Arg1:Building, Arg2:City|Country

Located        Arg1:City,                 Arg2:Country

[events]

事件与关系有点类似，但是可以是一元到多元的

[events]

Marriage Participant1:Person, Participant2:Person

Bankruptcy Org:Company

[attributes]

属性可以用来标记其他标记，比如对event标记进行标记，标记类型可以是二值的（true/false)或者多值的，如下：

[attributes]

Negated Arg:<EVENT>

Confidence Arg:<EVENT>, Value:L1|L2|L3

Visual configuration (visual.conf)

可视化分成两个部分：

[labels]

标签是用来可视化的，为了空间考虑，有不同的缩写形式

[labels]

Organization | Organization | Org

Immaterial-thing | Immaterial thing | Immaterial | Immat

[drawing]

用来控制显示的颜色，不设置则采用系统默认设置

[drawing]

Person bgColor:#ffccaa

Family fgColor:darkgreen, arrowHead:triangle-5

Tool configuration (tools.conf)

标记工具有五个部分：

[options]

有如下选项可以使用

Tokens tokenizer:VALUE, where VALUE=

        whitespace: split by whitespace characters in source text (only)

        ptblike: emulate Penn Treebank tokenization

        mecab: perform Japanese tokenization using MeCab

Sentences splitter:VALUE, where VALUE=

        regex: regular expression-based sentence splitting

        newline: split by newline characters in source text (only)

Validation validate:VALUE, where VALUE=

        all: perform full validation

        none: don't perform any validation

Annotation-log logfile:VALUE, where VALUE=

        <NONE>: no annotation logging

        NAME: log into file NAME (e.g. "/home/brat/work/annotation.log")

下面是一个实例：

[options]

Tokens tokenizer:mecab

Sentences splitter:newline

Validation validate:all

Annotation-log logfile:/home/brat/work/annotation.log

[search]

[search]

Google <URL>:http://www.google.com/search?q=%s

Wikipedia <URL>:http://en.wikipedia.org/wiki/%s

[normalization]

[normalization]

Wiki DB:dbs/wiki, <URL>:http://en.wikipedia.org, <URLBASE>:http://en.wikipedia.org/?curid=%s

UniProt <URL>:http://www.uniprot.org/, <URLBASE>:http://www.uniprot.org/uniprot/%s

[annotators]

[annotators]

SNER-CoNLL tool:Stanford_NER, model:CoNLL, <URL>:http://example.com:80/tagger/

[disambiguators]

[disambiguators]

simsem-MUC tool:simsem, model:MUC, <URL>:http://example.com:80/simsem/%s

下面是一个小例子：

在data 文件夹下新建目录stock,包括三个文件

stock

--1.txt 待标记文本

--1.ann 空文件

--annotation.conf 配置文件

配置文件如下：

[entities]

OTH

LOC

NAME

ORG

TIME

TIL

NUM

[relations]

[events]

[attributes]

标注过程如下：

参考文献：

https://blog.csdn.net/tcx1992/article/details/80580089

http://brat.nlplab.org

网友评论

Machine Learning & Recommendation & NLP & DL

本文标题：02-文本标注工具brat

本文链接：https://www.haomeiwen.com/subject/frjmkqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

02-文本标注工具brat

part A

Part B

annotation.conf 分成四个section：

[entities]

[relations]

[events]

[attributes]

Visual configuration (visual.conf)

[labels]

[drawing]

Tool configuration (tools.conf)

[options]

[search]

[normalization]

[annotators]

[disambiguators]

相关文章

02-文本标注工具brat

安利一个NLP标注工具：brat

brat多人标注

中文文本标注工具调研以及BRAT安装使用

解决BRAT无法标注中文标签

使用文本标注工具-doccano

Label

音频标注工具

NLP-Tools

02-文本

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Machine Learning & Recommendation & NLP & DL