搭建PostgreSQL中文全文索引环境,使用SCWS和zhaparser
1.安装postgreSQL
yum install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
yum install postgresql-server,postgresql-devel,postgresql-contrib
postgresql-setup initdb
systemctl enable postgresql.service
systemctl start postgresql.service
1.1.添加postgreSQL到PATH环境变量
export PATH=$PATH:/usr/pgsql-9.6/bin
2.安装SCWS
wget http://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2
tar -zxvf scws-1.2.3.tar.bz2
./configure --prefix=/usr/local/scws ; make ; make install
2.1 检查是否安装成功
ls -al /usr/local/scws/lib/libscws.la
/usr/local/scws/bin/scws -h
3.安装zhparser
git clone https://github.com/amutu/zhparser.git
cd zhparser
SCWS_HOME=/usr/local/scws make && make install
4.设置postgreSQL
#安装zhparser extension
CREATE EXTENSION zhparser;
#查看安装的解析器
\dFp;
# 查看zhparser将中文切分成的26种token
select ts_token_type('zhparser');
#创建自定义全文解析器
CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);
#往全文搜索配置中增加token映射
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
#设置多个短语分词
set zhparser.multi_short=on;
4.1.测试分词
select to_tsvector('testzhcfg','中国人民大学');
select to_tsvector('testzhcfg','南京大学');
5.缩写问题与解决思路
测试对大学缩写的分词,发现“西大”被忽略了
select to_tsvector('testzhcfg','南大 北大 东大 西大') ;
to_tsvector
-----------------------------------
'东大':3 '北大':2 '南大':1 '大':4
(1 row)
测试缩略词在分词器中的token类型
select ts_debug('testzhcfg','南大 北大 东大 西大') ;
ts_debug
-----------------------------------------------------
(j,"abbreviation,简称",南大,{simple},simple,{南大})
(n,"noun,名词",北大,{simple},simple,{北大})
(n,"noun,名词",东大,{simple},simple,{东大})
(f,"position,方位词",西,{},,)
(a,"adjective,形容词",大,{simple},simple,{大})
(5 rows)
根据上文测试结果在简称字典中添加“西大”,即可提高分词器的准确性。
网友评论