美文网首页
AI分词器

AI分词器

作者: zjxchase | 来源:发表于2021-07-25 13:03 被阅读0次

    安装及配置

    下载地址

    https://github.com/medcl/elasticsearch-analysis-ik/releases

    注意:ik分词器的版本要和 Elasticsearch 的版本保持一致

    安装

    将下载的安装包 elasticsearch-analysis-ik-7.10.2.zip 复制到 elasticsearch 根目录下的 plugins 文件夹中, 然后解压 elasticsearch-analysis-ik-7.10.2.zip ,解压完成后删除压缩包,并把分词器文件夹重命名为 ik,重启 Elasticsearch 即可。

    功能介绍

    ik分词器提供两种分词方式:

    分词器名称 说明
    ik_smart 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国”、“国歌”,适合Phrase查询
    ik_max_word 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国”、“中华人民”、“中华”、“华人”、“人民共和国”、“人民”、“人”、“民”、“共和国”、“共和”、“和”、“国”、“国歌”,会穷尽各种可能的组合,适合Term Query。

    ex:

    <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n22" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
    {
    "analyzer": "ik_smart",
    "text": "中华人民共和国"
    }

    执行结果

    {
    "tokens" : [
    {
    "token" : "中华人民共和国",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 0
    }
    ]
    }
    </pre>

    <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n23" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
    {
    "analyzer": "ik_max_word",
    "text": "中华人民共和国"
    }

    执行结果:

    {
    "tokens" : [
    {
    "token" : "中华人民共和国",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 0
    },
    {
    "token" : "中华人民",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 1
    },
    {
    "token" : "中华",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 2
    },
    {
    "token" : "华人",
    "start_offset" : 1,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 3
    },
    {
    "token" : "人民共和国",
    "start_offset" : 2,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 4
    },
    {
    "token" : "人民",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 5
    },
    {
    "token" : "共和国",
    "start_offset" : 4,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 6
    },
    {
    "token" : "共和",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 7
    },
    {
    "token" : "国",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_CHAR",
    "position" : 8
    }
    ]
    }
    </pre>

    自定义分词器

    配置自定义分词器前先看一个例子

    <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n26" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
    {
    "analyzer": "ik_smart",
    "text": ["十三届全国人大三次会议表决通过了“民法典”,自2021年1月1日起施行。"]
    }

    执行结果:

    {
    "tokens" : [
    {
    "token" : "十",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_CHAR",
    "position" : 0
    },
    {
    "token" : "三届",
    "start_offset" : 1,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 1
    },
    {
    "token" : "全国人大",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 2
    },
    {
    "token" : "三次",
    "start_offset" : 7,
    "end_offset" : 9,
    "type" : "CN_WORD",
    "position" : 3
    },
    {
    "token" : "会议",
    "start_offset" : 9,
    "end_offset" : 11,
    "type" : "CN_WORD",
    "position" : 4
    },
    {
    "token" : "表决",
    "start_offset" : 11,
    "end_offset" : 13,
    "type" : "CN_WORD",
    "position" : 5
    },
    {
    "token" : "通过了",
    "start_offset" : 13,
    "end_offset" : 16,
    "type" : "CN_WORD",
    "position" : 6
    },
    {
    "token" : "民法典",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "CN_WORD",
    "position" : 7
    },
    {
    "token" : "自",
    "start_offset" : 20,
    "end_offset" : 21,
    "type" : "CN_CHAR",
    "position" : 8
    },
    {
    "token" : "2021年",
    "start_offset" : 21,
    "end_offset" : 26,
    "type" : "TYPE_CQUAN",
    "position" : 9
    },
    {
    "token" : "1月",
    "start_offset" : 26,
    "end_offset" : 28,
    "type" : "TYPE_CQUAN",
    "position" : 10
    },
    {
    "token" : "1日",
    "start_offset" : 28,
    "end_offset" : 30,
    "type" : "TYPE_CQUAN",
    "position" : 11
    },
    {
    "token" : "起",
    "start_offset" : 30,
    "end_offset" : 31,
    "type" : "CN_CHAR",
    "position" : 12
    },
    {
    "token" : "施行",
    "start_offset" : 31,
    "end_offset" : 33,
    "type" : "CN_WORD",
    "position" : 13
    }
    ]
    }</pre>

    1. 创建自定义词库

    在 安装的 ik 分词器的 config 中创建文件夹 custom : D:\elasticsearch\elasticsearch-7.10.2\plugins\ik\config\custom 在 custom 中 创建 mydic.dic(自定义词库) 和 ext_stopwork.dic(停用词词库)

    在 mydic.dic 中添加内容

    <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n32" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">十三届全国人大</pre>

    在 ext_stopwork.dic 中添加内容

    <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n34" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">自
    起</pre>

    1. 配置自定义词库

    在 目录 D:\elasticsearch\elasticsearch-7.10.2\plugins\ik\config 下的 IKAnalyzer.cfg.xml 中配置刚创建的两个文件,主要内容如下:

    <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n40" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><properties>
    <comment>IK Analyzer 扩展配置</comment>

    <entry key="ext_dict">custom/mydic.dic</entry>

    <entry key="ext_stopwords">custom/ext_stopwork.dic</entry>




    </properties></pre>

    1. 重启 Elasticsearch 服务,再次运行前面的例子:

      <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n44" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET _analyze
      {
      "analyzer": "ik_smart",
      "text": ["十三届全国人大三次会议表决通过了“民法典”,自2021年1月1日起施行。"]
      }

      执行结果:

      {
      "tokens" : [
      {
      "token" : "十三届全国人大",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
      },
      {
      "token" : "三次",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 1
      },
      {
      "token" : "会议",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 2
      },
      {
      "token" : "表决",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 3
      },
      {
      "token" : "通过了",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 4
      },
      {
      "token" : "民法典",
      "start_offset" : 17,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 5
      },
      {
      "token" : "2021年",
      "start_offset" : 23,
      "end_offset" : 28,
      "type" : "TYPE_CQUAN",
      "position" : 6
      },
      {
      "token" : "1月",
      "start_offset" : 28,
      "end_offset" : 30,
      "type" : "TYPE_CQUAN",
      "position" : 7
      },
      {
      "token" : "1日",
      "start_offset" : 30,
      "end_offset" : 32,
      "type" : "TYPE_CQUAN",
      "position" : 8
      },
      {
      "token" : "施行",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "CN_WORD",
      "position" : 9
      }
      ]
      }
      </pre>

    使用IK构建索引库

    1. 使用 ik 分词器创建索引库

      <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n50" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">PUT news
      {
      "mappings": {
      "properties": {
      "title": {
      "type": "text",
      "analyzer": "ik_max_word",
      "search_analyzer": "ik_smart"
      },
      "content": {
      "type": "text",
      "analyzer": "ik_max_word",
      "search_analyzer": "ik_smart"
      }
      }
      }
      }

      查看索引库信息

      GET news/_mapping

      执行结果

      {
      "news" : {
      "mappings" : {
      "properties" : {
      "content" : {
      "type" : "text",
      "analyzer" : "ik_max_word",
      "search_analyzer" : "ik_smart"
      },
      "title" : {
      "type" : "text",
      "analyzer" : "ik_max_word",
      "search_analyzer" : "ik_smart"
      }
      }
      }
      }
      }</pre>

      注意在创建索引 字段 数据类型时, title 和 content 的 analyzer (分词器)使用的是 ik_max_word, 这是因为在创建倒排索引时尽量进行细粒度的拆分,尽量满足更多的搜索需求,而 search_analyzer (搜索) 是 ik_smart , 即搜索时尽量粗粒度的划分,满足搜索的精确性。

    2. 创建测试用例数据

      <pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n55" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">POST news/_bulk
      {"index": {}}
      {"title": "柳岩为何40岁也无人敢取?", "content": "娱乐圈里的女星那么多,但要说到性感女星就一定要提到柳岩,毕竟像刘岩这样有料又有身材的女星,参加活动还是很吃香的。"}
      {"index": {}}
      {"title": "刘德华首当音乐老师", "content": "刘德华表示,希望自己首度塑造的音乐老师形象能够得到大家的认可,尤其希望能的到全国老师,家长和同学们的认可,“如果真的有机会做老师,我也想做音乐老师,因为我觉得音乐课很重要,音乐的力量是可以改变人生的!”"}
      {"index": {}}
      {"title": "奥巴马怒怼特朗普抗疫不力", "content": "奥巴马现身费城的竞选集会并发表讲话,他对特朗普四年的执政工作进行了猛烈攻击,谴责特朗普政府抗疫不力,搞砸美国经济。"}
      {"index": {}}
      {"title": "韩星柳真怀孕4个月喜迎二胎", "content": "韩星柳真怀孕4个月喜迎二胎,柳真为什么选择奇太映女儿为啥姓金?说起韩星柳真有些人可能不认识,不过只要追过S.E.S组合的网友应该都知道她,她曾经在韩国也有“国民妖精”之称,据说他所在的S.E.S更是韩国乐坛的第一支女子组合。"}</pre>

      测试 ex1

      <pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n57" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET news/_search
      {
      "query": {
      "match": {
      "title": "刘德华"
      }
      }
      }

      执行结果

      {
      "took" : 26,
      "timed_out" : false,
      "_shards" : {
      "total" : 1,
      "successful" : 1,
      "skipped" : 0,
      "failed" : 0
      },
      "hits" : {
      "total" : {
      "value" : 1,
      "relation" : "eq"
      },
      "max_score" : 1.547678,
      "hits" : [
      {
      "_index" : "news",
      "_type" : "_doc",
      "_id" : "l_JQC3gB8u3smGzBUQjj",
      "_score" : 1.547678,
      "_source" : {
      "title" : "刘德华首当音乐老师",
      "content" : "刘德华表示,希望自己首度塑造的音乐老师形象能够得到大家的认可,尤其希望能的到全国老师,家长和同学们的认可,“如果真的有机会做老师,我也想做音乐老师,因为我觉得音乐课很重要,音乐的力量是可以改变人生的!”"
      }
      }
      ]
      }
      }</pre>

      测试ex2

      <pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n59" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET news/_search
      {
      "query": {
      "match": {
      "title": "柳岩"
      }
      }
      }

      执行结果:

      {
      "took" : 11,
      "timed_out" : false,
      "_shards" : {
      "total" : 1,
      "successful" : 1,
      "skipped" : 0,
      "failed" : 0
      },
      "hits" : {
      "total" : {
      "value" : 2,
      "relation" : "eq"
      },
      "max_score" : 1.875202,
      "hits" : [
      {
      "_index" : "news",
      "_type" : "_doc",
      "_id" : "lvJQC3gB8u3smGzBUQjj",
      "_score" : 1.875202,
      "_source" : {
      "title" : "柳岩为何40岁也无人敢取?",
      "content" : "娱乐圈里的女星那么多,但要说到性感女星就一定要提到柳岩,毕竟像刘岩这样有料又有身材的女星,参加活动还是很吃香的。"
      }
      },
      {
      "_index" : "news",
      "_type" : "_doc",
      "_id" : "mfJQC3gB8u3smGzBUQjj",
      "_score" : 0.6017173,
      "_source" : {
      "title" : "韩星柳真怀孕4个月喜迎二胎",
      "content" : "韩星柳真怀孕4个月喜迎二胎,柳真为什么选择奇太映女儿为啥姓金?说起韩星柳真有些人可能不认识,不过只要追过S.E.S组合的网友应该都知道她,她曾经在韩国也有“国民妖精”之称,据说他所在的S.E.S更是韩国乐坛的第一支女子组合。"
      }
      }
      ]
      }
      }
      </pre>

      测试ex2 执行结果分析:当搜索 “柳岩” 时出现了 柳岩 和 柳真 两条结果,通过分词查看可知

      <pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n62" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET _analyze
      {
      "analyzer": "ik_smart",
      "text": ["柳岩"]
      }

      执行结果:

      {
      "tokens" : [
      {
      "token" : "柳",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
      },
      {
      "token" : "岩",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
      }
      ]
      }</pre>

      分词器把 柳岩 拆分了 “柳” 和 “研” 两个字去搜索了,当搜索“柳”字时把柳岩 和 柳真 都搜索出来了

    动态更新索引数据

    相关文章

      网友评论

          本文标题:AI分词器

          本文链接:https://www.haomeiwen.com/subject/wsabmltx.html