本文主要介绍两个部分:
A,安装部署 brat
B,配置brat用于中文任务
NLP基本上都是监督学习,而监督学习需要海量人工标注的语料,越多越好。而标注文本是一个很繁琐的工作,有一个好用的工具会帮助很多。
推荐一个标注工具brat,可以用于各种NLP任务,虽然工具是为实体识别和关系抽取设计的。
brat服务器是一个Python(版本2.5+)程序,默认情况下作为CGI应用程序运行,安装脚本假定类似UNIX的环境。 如果您在兼容环境中使用支持CGI的现有Web服务器设置brat服务器,则使用CGI的快速入门说明应该有效。
part A
在ubuntu 18.04下面可以如下操作:
1.首先安装apache2
sudo apt install apache2
2.然后是下载brat, 注意:github上面主分支缺少filelock_brat.py这个文件,最好不要用,在release里面下载最新版。
tar xzf brat-v1.3_Crunchy_Frog.tar.gz
sudo mv brat-v1.3_Crunchy_Frog /var/www/html/brat
sudo chmod 777 -R /var/www/html/brat
cd /var/www/html/brat
./install.sh
3.最后是配置httpd的配置文件
vim /etc/apache2/apache2.conf
增加如下
<Directory /var/www/html/brat>
AllowOverride Options Indexes FileInfo Limit
Require all granted
AddType application/xhtml+xml .xhtml
AddType font/ttf .ttf
Options +ExecCGI #开启 ExecCGI
AddHandler cgi-script .cgi # 开启CGI
</Directory>
然后重启apache2服务
sudo service apache2 restart
然后打开http://localhost/brat,可以看到如下画面,说明安装成功。
Part B
brat里面四个配置文件
annotation.conf: 标注类别
visual.conf: 标注显示
tools.conf: 标注工具
kb_shortcuts.conf: 快捷键
annotation.conf 分成四个section:
[entities]
基本结构是
[entities]
Person
Location
Organization
复杂的层级结构如下:
[entities]
Living-thing
Person
Animal
Plant
Nonliving-thing
Building
Vehicle
[relations]
只能表示二元关系,如下:
[relations]
Family Arg1:Person, Arg2:Person
Employment Arg1:Person, Arg2:Organization
当然,每个二元关系中的实体可以是多种类型,如下:
[relations]
Located Arg1:Person, Arg2:Building|City|Country
Located Arg1:Building, Arg2:City|Country
Located Arg1:City, Arg2:Country
[events]
事件与关系有点类似,但是可以是一元到多元的
[events]
Marriage Participant1:Person, Participant2:Person
Bankruptcy Org:Company
[attributes]
属性可以用来标记其他标记,比如对event标记进行标记,标记类型可以是二值的(true/false)或者多值的,如下:
[attributes]
Negated Arg:<EVENT>
Confidence Arg:<EVENT>, Value:L1|L2|L3
Visual configuration (visual.conf)
可视化分成两个部分:
[labels]
标签是用来可视化的,为了空间考虑,有不同的缩写形式
[labels]
Organization | Organization | Org
Immaterial-thing | Immaterial thing | Immaterial | Immat
[drawing]
用来控制显示的颜色,不设置则采用系统默认设置
[drawing]
Person bgColor:#ffccaa
Family fgColor:darkgreen, arrowHead:triangle-5
Tool configuration (tools.conf)
标记工具有五个部分:
[options]
有如下选项可以使用
Tokens tokenizer:VALUE, where VALUE=
whitespace: split by whitespace characters in source text (only)
ptblike: emulate Penn Treebank tokenization
mecab: perform Japanese tokenization using MeCab
Sentences splitter:VALUE, where VALUE=
regex: regular expression-based sentence splitting
newline: split by newline characters in source text (only)
Validation validate:VALUE, where VALUE=
all: perform full validation
none: don't perform any validation
Annotation-log logfile:VALUE, where VALUE=
<NONE>: no annotation logging
NAME: log into file NAME (e.g. "/home/brat/work/annotation.log")
下面是一个实例:
[options]
Tokens tokenizer:mecab
Sentences splitter:newline
Validation validate:all
Annotation-log logfile:/home/brat/work/annotation.log
[search]
[search]
Google <URL>:http://www.google.com/search?q=%s
Wikipedia <URL>:http://en.wikipedia.org/wiki/%s
[normalization]
[normalization]
Wiki DB:dbs/wiki, <URL>:http://en.wikipedia.org, <URLBASE>:http://en.wikipedia.org/?curid=%s
UniProt <URL>:http://www.uniprot.org/, <URLBASE>:http://www.uniprot.org/uniprot/%s
[annotators]
[annotators]
SNER-CoNLL tool:Stanford_NER, model:CoNLL, <URL>:http://example.com:80/tagger/
[disambiguators]
[disambiguators]
simsem-MUC tool:simsem, model:MUC, <URL>:http://example.com:80/simsem/%s
下面是一个小例子:
在data 文件夹下新建目录stock,包括三个文件
stock
--1.txt 待标记文本
--1.ann 空文件
--annotation.conf 配置文件
配置文件如下:
[entities]
OTH
LOC
NAME
ORG
TIME
TIL
NUM
[relations]
[events]
[attributes]
标注过程如下:
参考文献:
https://blog.csdn.net/tcx1992/article/details/80580089
http://brat.nlplab.org
网友评论