Postgres 的全文检索已经足够好了

作者: 阿森纳不可战胜 | 来源:发表于2017-07-02 20:29 被阅读174次

Postgres 的全文检索已经足够好了
ElasticSearch-搜索查询
django中的全文检索
Django 2.1.7 全文检索
检索
Django全文检索
全文检索
全文检索
Django引入全文检索
Django 引入全文检索

OSchina的这篇译文不错，详细介绍了Postgres的全文检索的功能和用法。
https://www.oschina.net/translate/postgres-full-text-search-is-good-enough

英语原文链接：
http://rachbelaid.com/postgres-full-text-search-is-good-enough/#1

开发Web应用时，你经常要加上搜索功能。甚至还不知能要搜什么，就在草图上画了一个放大镜。
搜索是项非常重要的功能，所以像elasticsearch和SOLR这样的基于lucene的工具变得很流行。它们都很棒。但使用这些大规模“杀伤性”的搜索武器前，你可能需要来点轻量级的，但又足够好的搜索工具。
所谓“足够好”，我是指一个搜索引擎拥有下列的功能：

词根（Stemming）
排名/提升(Ranking / Boost)
支持多种语言
对拼写错误模糊搜索
方言的支持

幸运的是PostgreSQL对这些功能全支持。

本文的目标读者是：

使用PostgreSQL，同时又不想安装其它的搜索引擎。
使用其它的数据库（比如MySQL），同时需要更好的全文搜索功能。

本文中我们将通过下面的表和数据说明PostgreSQL的全文搜索功能。

CREATE TABLE author(
   id SERIAL PRIMARY KEY,
   name TEXT NOT NULL);
CREATE TABLE post(
   id SERIAL PRIMARY KEY,
   title TEXT NOT NULL,
   content TEXT NOT NULL,
   author_id INT NOT NULL references author(id) );
CREATE TABLE tag(
   id SERIAL PRIMARY KEY,
   name TEXT NOT NULL );
CREATE TABLE posts_tags(
   post_id INT NOT NULL references post(id),
   tag_id INT NOT NULL references tag(id)
 );
INSERT INTO author (id, name) 
VALUES (1, 'Pete Graham'), 
       (2, 'Rachid Belaid'), 
       (3, 'Robert Berry');

INSERT INTO tag (id, name) 
VALUES (1, 'scifi'), 
       (2, 'politics'), 
       (3, 'science');

INSERT INTO post (id, title, content, author_id) 
VALUES (1, 'Endangered species', 'Pandas are an endangered species', 1 ), 
       (2, 'Freedom of Speech', 'Freedom of speech is a necessary right missing in many countries', 2), 
       (3, 'Star Wars vs Star Trek', 'Few words from a big fan', 3);

INSERT INTO posts_tags (post_id, tag_id) 
VALUES (1, 3), 
       (2, 2), 
       (3, 1);

这是一个类博客的应用。它有post表，带有title和content字段。post通过外键关联到author。post自身还有多个标签(tag)。

什么是全文搜索

首先，让我们看一下定义：

在文本检索中，全文搜索是指从全文数据库中搜索计算机存储的单个或多个文档(document)的技术。全文搜索不同于基于元数据的搜索或根据数据库中原始文本的搜索。
-- 维基百科

文档是全文搜索系统中的搜索单元。比如，一篇杂质文章或是一封邮件消息。
-- Postgres 文档

这里的文档可以跨多个表，代表为我们想要搜索的逻辑实体。

构建我们的文档(document)

上一节，我们介绍了文档的概念。文档与表的模式无关，而是与数据相关，把字段联合为一个有意义的实体。根据示例中的表的模式，我们的文档(document)由这些组成：

post.title
post.content
post的author.name
关联到post的所有tag.name

根据这些要求产生文档，SQL查询应该是这样的：

 SELECT post.title || ' ' || 
        post.content || ' ' ||
        author.name || ' ' ||
        coalesce((string_agg(tag.name, ' ')), '') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id, author.id;

               document --------------------------------------------------
 Endangered species Pandas are an endangered species Pete Graham politics
 Freedom of Speech Freedom of speech is a necessary right missing in many countries Rachid Belaid politics
 Star Wars vs Star Trek Few words from a big fan Robert Berry politics
(3 rows)

由于用post和author分组了，因为有多个tag关联到一个post，我们使用string_agg()作聚合函数。即使author是外键并且一个post不能有多个author，也要求对author添加聚合函数或者把author加到GROUP BY中。

我们还用了coalesce()。当值可以是NULL时，使用coalesce()函数是个很好的办法，否则字符串连接的结果将是NULL。

至此，我们的文档只是一个长string，这没什么用。我们需要用to_tsvector()把它转换为正确的格式。

SELECT to_tsvector(post.title) || 
       to_tsvector(post.content) ||
       to_tsvector(author.name) ||
       to_tsvector(coalesce((string_agg(tag.name, ' ')), '')) as documentFROM post
JOIN author ON author.id = post.author_id
JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
JOIN tag ON tag.id = posts_tags.tag_id
GROUP BY post.id, author.id;
               document 
-------------------------------------------------- 
'endang':1,6 'graham':9 'panda':3 'pete':8 'polit':10 'speci':2,7
'belaid':16 'countri':14 'freedom':1,4 'mani':13 'miss':11 'necessari':9 'polit':17 'rachid':15 'right':10 'speech':3,6
'berri':13 'big':10 'fan':11 'polit':14 'robert':12 'star':1,4 'trek':5 'vs':3 'war':2 'word':7
(3 rows)

这个查询将返回适于全文搜索的tsvector格式的文档。让我们尝试把一个字符串转换为一个tsvector。

SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value');

这个查询将返回下面的结果：

to_tsvector
----------------------------------------------------------------------
'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17(1 row)

发生了怪事。首先比原文的词少了，一些词也变了（try变成了tri），而且后面还有数字。怎么回事？
一个tsvector是一个标准词位的有序列表（sorted list），标准词位（distinct lexeme）就是说把同一单词的各种变型体都被标准化相同的。

标准化过程几乎总是把大写字母换成小写的，也经常移除后缀（比如英语中的s,es和ing等）。这样可以搜索同一个字的各种变体，而不是乏味地输入所有可能的变体。

数字表示词位在原始字符串中的位置，比如“man"出现在第6和15的位置上。你可以自己数数看。

Postgres中to_tesvetor的默认配置的文本搜索是“英语“。它会忽略掉英语中的停用词（stopword，译注：也就是am is are a an等单词)。

这解释了为什么tsvetor的结果比原句子中的单词少。后面我们会看到更多的语言和文本搜索配置

查询

我们知道了如何构建一个文档，但我们的目标是搜索文档。我们对tsvector搜索时可以使用@@操作符，使用说明见此处。看几个查询文档的例子。

> select to_tsvector('If you can dream it, you can do it') @@ 'dream';
 ?column?
----------
 t
(1 row)

> select to_tsvector('It''s kind of fun to do the impossible') @@ 'impossible';

 ?column?
----------
 f
(1 row)

第二个查询返回了假，因为我们需要构建一个tsquery，使用@@操作符时，把字符串转型(cast)成了tsquery。下面显示了这种l转型和使用to_tsquery()之间的差别。

SELECT 'impossible'::tsquery, to_tsquery('impossible');
   tsquery    | to_tsquery
--------------+------------
 'impossible' | 'imposs'(1 row)

但"dream"的词位与它本身相同。

从现在开始我们使用to_tsquery查询文档。

SELECT to_tsvector('It''s kind of fun to do the impossible') @@ to_tsquery('impossible');

 ?column?
----------
 t
(1 row)

tsquery存储了要搜索的词位，可以使用&（与）、|（或）和!（非）逻辑操作符。可以使用圆括号给操作符分组。

> SELECT to_tsvector('If the facts don't fit the theory, change the facts') @@ to_tsquery('! fact');

 ?column?
----------
 f
(1 row)

> SELECT to_tsvector('If the facts don''t fit the theory, change the facts') @@ to_tsquery('theory & !fact');

 ?column?
----------
 f
(1 row)

> SELECT to_tsvector('If the facts don''t fit the theory, change the facts.') @@ to_tsquery('fiction | theory');

 ?column?
----------
 t
(1 row)

我们也可以使用：*来表达以某词开始的查询。

> SELECT to_tsvector('If the facts don''t fit the theory, change the facts.') @@ to_tsquery('theo:*');

 ?column?
----------
 t
(1 row)

既然我们知道了怎样使用全文搜索查询了，我们回到开始的表模式，试着查询文档。

SELECT pid, p_titleFROM (SELECT post.id as pid,
             post.title as p_title,
             to_tsvector(post.title) || 
             to_tsvector(post.content) ||
             to_tsvector(author.name) ||
             to_tsvector(coalesce(string_agg(tag.name, ' '))) as document
      FROM post
      JOIN author ON author.id = post.author_id
      JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id
      JOIN tag ON tag.id = posts_tags.tag_id
      GROUP BY post.id, author.id) p_search　WHERE p_search.document @@ to_tsquery('Endangered & Species');

 pid |      p_title
-----+--------------------
   1 | Endangered species
(1 row)

这个查询将找到文档中包含Endangered和Species或接近的词。

语言支持

Postgres 内置的文本搜索功能支持多种语言：丹麦语，荷兰语，英语，芬兰语，法语，德语，匈牙利语，意大利语，挪威语，葡萄牙语，罗马尼亚语，俄语，西班牙语，瑞典语，土耳其语。

SELECT to_tsvector('english', 'We are running');
 to_tsvector-------------
 'run':3
(1 row)SELECT to_tsvector('french', 'We are running');
        to_tsvector----------------------------
 'are':2 'running':3 'we':1
(1 row)

基于我们最初的模型，列名可以用来创建tsvector。假设post表中包含不同语言的内容，且它包含一列language。

为了使用language列，现在我们重新编译文档。

SELECT to_tsvector(post.language::regconfig, post.title) || 
       to_tsvector(post.language::regconfig, post.content) ||
       to_tsvector('simple', author.name) ||
       to_tsvector('simple', coalesce((string_agg(tag.name, ' ')), '')) as documentFROM postJOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON tag.id = posts_tags.tag_idGROUP BY post.id, author.id;

如果缺少显示的转化符：：regconfig，查询时会产生一个错误：

ERROR:  function to_tsvector(text, text) does not exist

regconfig是对象标识符类型，它表示Postgres文本搜索配置项。:http://www.postgresql.org/docs/9.3/static/datatype-oid.html

现在，文档的语义会使用post.language中正确的语言进行编译。

我们也使用simple，它也是Postgres提供的一个文本搜索配置项。simple并不忽略禁用词表，它也不会试着去查找单词的词根。使用simple时，空格分割的每一组字符都是一个语义；对于数据来说，simple文本搜索配置项很实用，就像一个人的名字，我们也许不想查找名字的词根。

SELECT to_tsvector('simple', 'We are running');
        to_tsvector
---------------------------- 'are':2 'running':3 'we':1(1 row)

重音字符

当你建立一个搜索引擎支持多种语言时你也需要考虑重音问题。在许多语言中重音非常重要,可以改变这个词的含义。Postgres附带一个unaccent扩展去调用 unaccentuate内容是有用处的。

CREATE EXTENSION unaccent;SELECT unaccent('èéêë');
 unaccent----------
 eeee
(1 row)

让我们添加一些重音的你内容到我们的post表中。

INSERT INTO post (id, title, content, author_id, language) 
VALUES (4, 'il était une fois', 'il était une fois un hôtel ...', 2,'french')

如果我们想要忽略重音在我们建立文档时,之后我们可以简单做到以下几点:

SELECT to_tsvector(post.language, unaccent(post.title)) || 
       to_tsvector(post.language, unaccent(post.content)) ||
       to_tsvector('simple', unaccent(author.name)) ||
       to_tsvector('simple', unaccent(coalesce(string_agg(tag.name, ' '))))JOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON author.id = post.author_idGROUP BY p.id

这样工作的话，如果有更多错误的空间它就有点麻烦。我们还可以建立一个新的文本搜索配置支持无重音的字符。

CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );ALTER TEXT SEARCH CONFIGURATION fr ALTER MAPPINGFOR hword, hword_part, word WITH unaccent, french_stem;

当我们使用这个新的文本搜索配置,我们可以看到词位

SELECT to_tsvector('french', 'il était une fois');
 to_tsvector-------------
 'fois':4
(1 row)SELECT to_tsvector('fr', 'il était une fois');
    to_tsvector--------------------
 'etait':2 'fois':4
(1 row)

这给了我们相同的结果，第一作为应用unaccent并且从结果建立tsvector。

SELECT to_tsvector('french', unaccent('il était une fois'));
    to_tsvector--------------------
 'etait':2 'fois':4
(1 row)

词位的数量是不同的,因为il était une在法国是一个无用词。这是一个问题让这些词停止在我们的文件吗?我不这么认为etait不是一个真正的无用词而是拼写错误。

SELECT to_tsvector('fr', 'Hôtel') @@ to_tsquery('hotels') as result;
 result--------
 t
(1 row)

如果我们为每种语言创建一个无重音的搜索配置,这样我们的post可以写入并且我们保持这个值在post.language的中,然后我们可以保持以前的文档查询。

SELECT to_tsvector(post.language, post.title) || 
       to_tsvector(post.language, post.content) ||
       to_tsvector('simple', author.name) ||
       to_tsvector('simple', coalesce(string_agg(tag.name, ' ')))JOIN author ON author.id = post.author_idJOIN posts_tags ON posts_tags.post_id = posts_tags.tag_idJOIN tag ON author.id = post.author_idGROUP BY p.id

如果你需要为每种语言创建无重音的文本搜索配置由Postgres支持,然后你可以使用gist

我们当前的文档大小可能会增加,因为它可以包括无重音的无用词但是我们并没有关注重音字符查询。这可能是有用的如有人用英语键盘搜索法语内容。

归类

当你创建了一个你想要的搜索引擎用来搜索相关的结果（根据相关性归类）的时候，归类可以是基于许多因素的，它的文档大致解释了这些（归类依据）内容。

归类试图处理特定的上下文搜索, 因此有许多个配对的时候，相关性最高的那个会被排在第一个位置。PostgreSQL提供了两个预定义归类函数，它们考虑到了词法解释，接近度和结构信息；他们考虑到了在上下文中的词频，如何接近上下文中的相同词语，以及在文中的什么位置出现和其重要程度。
-- PostgreSQL documentation

通过PostgreSQL提供的一些函数得到我们想要的相关性结果，在我们的例子中我们将会使用他们中的2个：ts_rank() 和 setweight() 。

函数setweight允许我们通过tsvector函数给重要程度（权）赋值；值可以是'A', 'B', 'C' 或者 'D'。

SELECT pid, p_titleFROM (SELECT post.id as pid,
             post.title as p_title,
             setweight(to_tsvector(post.language::regconfig, post.title), 'A') || 
             setweight(to_tsvector(post.language::regconfig, post.content), 'B') ||
             setweight(to_tsvector('simple', author.name), 'C') ||
             setweight(to_tsvector('simple', coalesce(string_agg(tag.name, ' '))), 'B') as document      FROM post      JOIN author ON author.id = post.author_id      JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id      JOIN tag ON tag.id = posts_tags.tag_id      GROUP BY post.id, author.id) p_searchWHERE p_search.document @@ to_tsquery('english', 'Endangered & Species')ORDER BY ts_rank(p_search.document, to_tsquery('english', 'Endangered & Species')) DESC;

上面的查询，我们在文中不同的栏里面赋了不同的权值。post.title的重要程度超过post.content和tag的总和。最不重要的是author.name。

Postgres 的全文检索已经足够好了
OSchina的这篇译文不错，详细介绍了Postgres的全文检索的功能和用法。https://www.oschi...
ElasticSearch-搜索查询
URL querystring语法全文检索：单字段全文检索：条件组合单字段精确检索：多个检索条件的组合：...
django中的全文检索
全文检索全文检索不同于特定字段的模糊查询，使用全文检索的效率更高，并且能够对于中文进行分词处理 haystack...
Django 2.1.7 全文检索
全文检索全文检索不同于特定字段的模糊查询，使用全文检索的效率更高，并且能够对于中文进行分词处理。 haystac...
检索
全文检索全文检索不同于特定字段的模糊查询，使用全文检索的效率更高，并且能够对于中文进行分词处理。 haystac...
Django全文检索
全文检索全文检索不同于特定字段的模糊查询，使用全文检索的效率更高，并且能够对于中文进行分词处理' 安装 pip ...
全文检索
概念从文本或者数据库中，不限定资料字段，自由地萃取出讯息的技术执行全文检索任务的程式，一般称作搜索引擎，将使...
全文检索
概述 Full-Text Search 是将存储于数据库中的整本书或整篇文章中的任意内容信息查找出来的技术。倒排...
Django引入全文检索
全文检索什么是全文检索全文检索就是针对所有内容进行动态匹配搜索的概念，针对特定的关键词进行建立索引并精确匹配达到...
Django 引入全文检索
1. 全文检索什么是全文检索全文检索就是针对所有内容进行动态匹配搜索的概念，针对特定的关键词进行建立索引并精确匹...

Postgres 的全文检索已经足够好了

幸运的是PostgreSQL对这些功能全支持。

构建我们的文档(document)

查询

归类

相关文章

Postgres 的全文检索已经足够好了

ElasticSearch-搜索查询

django中的全文检索

Django 2.1.7 全文检索

检索

Django全文检索

全文检索

全文检索

Django引入全文检索

Django 引入全文检索

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

SpringBoot