ES中的分词器

作者: 不知名的蛋挞 | 来源:发表于2020-03-09 10:28 被阅读0次

一个非常hao用的elasticsearch中文分词器插件 Ha
elasticsearch分词器
es三节点搭建，扩容
es-分词器
ES插件安装&&分词
ES中的分词器
ES分词器 - 内置分词器
测试分词器
Elasticsearch-自动补全
安装elasticsearch

转载自：https://blog.csdn.net/xiaomin1991222/article/details/50981874、https://segmentfault.com/a/1190000012553894

概念介绍

全文搜索引擎会用某种算法对要建索引的文档进行分析，从文档中提取出若干Token(词元)，这些算法称为Tokenizer(分词器)，这些Token会被进一步处理，比如转成小写等，这些处理算法被称为Token Filter(词元处理器)，被处理后的结果被称为Term(词)，文档中包含了几个这样的Term被称为Frequency(词频)。引擎会建立Term和原文档的Inverted Index(倒排索引)，这样就能根据Term很快到找到源文档了。文本被Tokenizer处理前可能要做一些预处理，比如去掉里面的HTML标记，这些处理的算法被称为Character Filter(字符过滤器)，这整个的分析算法被称为Analyzer(分析器)。

整个分析过程，如下图所示：

ES中的分词器

从第一部分内容可以看出：Analyzer（分析器）由Tokenizer（分词器）和Filter（过滤器）组成。

1. ES内置分析器

2. ES内置分词器

3. ES内置过滤器

3.1 ES内置的token filter

3.2 ES内置的character filter

自定义分析器

ES允许用户通过配置文件elasticsearch.yml自定义分析器Analyzer，如下：

index:
   analysis:
      analyzer:
        myAnalyzer:
            tokenizer: standard
            filter: [standard, lowercase, stop]

上面配置信息注册了一个分析器myAnalyzer，在次注册了之后可以在索引或者查询的时候直接使用。该分析器的功能和标准分析器差不多，tokenizer: standard，使用了标准分词器；filter: [standard, lowercase, stop]，使用了标准过滤器、转小写过滤器和停用词过滤器。

中文分词器es-ik

ElasticSearch默认使用的标准分词器在处理中文的时候会把中文单词切分成一个一个的汉字，所以在很多时候我们会发现效果并不符合我们预期，尤其在我们使用中文文本切分之后本该为一个词语却成了单个的汉字，因此这里我们使用效果更佳的中文分词器es-ik。

ik 带有两个分词器：

ik_max_word ：会将文本做最细粒度的拆分；尽可能多的拆分出词语
ik_smart：会做最粗粒度的拆分；已被分出的词语将不会再次被其它词语占有

区别：

# ik_max_word

curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d '联想是全球最大的笔记本厂商'
#返回

{
  "tokens" : [
    {
      "token" : "联想",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "全球",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "最大",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "笔记本",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "笔记",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "本厂",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "厂商",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 8
    }
  ]
}


# ik_smart

curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_smart' -d '联想是全球最大的笔记本厂商'

# 返回

{
  "tokens" : [
    {
      "token" : "联想",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "全球",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "最大",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "笔记本",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "厂商",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

下面我们来创建一个索引，使用 ik。创建一个名叫 iktest 的索引，设置它的分析器用 ik ，分词器用 ik_max_word，并创建一个 article 的类型，里面有一个 subject 的字段，指定其使用 ik_max_word 分词器。

curl -XPUT 'http://localhost:9200/iktest?pretty' -d '{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "ik" : {
                    "tokenizer" : "ik_max_word"
                }
            }
        }
    },
    "mappings" : {
        "article" : {
            "dynamic" : true,
            "properties" : {
                "subject" : {
                    "type" : "string",
                    "analyzer" : "ik_max_word"
                }
            }
        }
    }
}'

批量添加几条数据，这里我指定元数据 _id 方便查看，subject 内容为我随便找的几条新闻的标题

curl -XPOST http://localhost:9200/iktest/article/_bulk?pretty -d '
{ "index" : { "_id" : "1" } }
{"subject" : "＂闺蜜＂崔顺实被韩检方传唤 韩总统府促彻查真相" }
{ "index" : { "_id" : "2" } }
{"subject" : "韩举行＂护国训练＂ 青瓦台:决不许国家安全出问题" }
{ "index" : { "_id" : "3" } }
{"subject" : "媒体称FBI已经取得搜查令 检视希拉里电邮" }
{ "index" : { "_id" : "4" } }
{"subject" : "村上春树获安徒生奖 演讲中谈及欧洲排外问题" }
{ "index" : { "_id" : "5" } }
{"subject" : "希拉里团队炮轰FBI 参院民主党领袖批其“违法”" }

查询 “希拉里和韩国”

curl -XPOST http://localhost:9200/iktest/article/_search?pretty  -d'
{
    "query" : { "match" : { "subject" : "希拉里和韩国" }},
    "highlight" : {
        "pre_tags" : ["<font color='red'>"],
        "post_tags" : ["</font>"],
        "fields" : {
            "subject" : {}
        }
    }
}
'
#返回
{
  "took" : 113,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.034062363,
    "hits" : [ {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "2",
      "_score" : 0.034062363,
      "_source" : {
        "subject" : "韩举行＂护国训练＂ 青瓦台:决不许国家安全出问题"
      },
      "highlight" : {
        "subject" : [ "<font color=red>韩</font>举行＂护<font color=red>国</font>训练＂ 青瓦台:决不许国家安全出问题" ]
      }
    }, {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "3",
      "_score" : 0.0076681254,
      "_source" : {
        "subject" : "媒体称FBI已经取得搜查令 检视希拉里电邮"
      },
      "highlight" : {
        "subject" : [ "媒体称FBI已经取得搜查令 检视<font color=red>希拉里</font>电邮" ]
      }
    }, {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "5",
      "_score" : 0.006709609,
      "_source" : {
        "subject" : "希拉里团队炮轰FBI 参院民主党领袖批其“违法”"
      },
      "highlight" : {
        "subject" : [ "<font color=red>希拉里</font>团队炮轰FBI 参院民主党领袖批其“违法”" ]
      }
    }, {
      "_index" : "iktest",
      "_type" : "article",
      "_id" : "1",
      "_score" : 0.0021509775,
      "_source" : {
        "subject" : "＂闺蜜＂崔顺实被韩检方传唤 韩总统府促彻查真相"
      },
      "highlight" : {
        "subject" : [ "＂闺蜜＂崔顺实被<font color=red>韩</font>检方传唤 <font color=red>韩</font>总统府促彻查真相" ]
      }
    } ]
  }
}

这里用了高亮属性 highlight，直接显示到 html 中，被匹配到的字或词将以红色突出显示。若要用过滤搜索，直接将 match 改为 term 即可。

一个非常hao用的elasticsearch中文分词器插件 Ha
首先上地址 elasticsearch (es) hao 分词器中文分词器elasticsearch-analy...
elasticsearch分词器
一、es内置分词器只支持英文分词，不支持中文分词 2、es内置分词器 standard：默认分词，单词会被拆分，...
es三节点搭建，扩容
es7.16.2 安装ik分词器
es-分词器
在es中内置了一些分词器:standard,simple,whitespace,stop,keyword,patt...
ES插件安装&&分词
环境 Ubuntu18.04 ES 6.6.1 分词插件安装插件 ik 分词器使用 pinyin分词器使用简体...
ES中的分词器
转载自：https://blog.csdn.net/xiaomin1991222/article/details/...
ES分词器 - 内置分词器
2、分词器：分词器是ES中专门处理分词的组件，英文为Analyzer，它的组成如下： Character Filt...
测试分词器
1、测试es插件里的分词器 POST _analyze { "analyzer": "whitespace", ...
Elasticsearch-自动补全
零、本文纲要一、自动补全二、使用拼音分词三、自定义分词器1、分词器的组成2、使用自定义分词器四、ES自动补...
安装elasticsearch
先安装并启动es的head插件：利用dockerFile方式给es安装分词器执行docker build -t...