美文网首页
ES 分词器使用和配置

ES 分词器使用和配置

作者: 小P聊技术 | 来源:发表于2021-03-06 11:27 被阅读0次

    1 介绍

    主要介绍索引请求的基础API操作,使用postman进行请求,接口请求的前缀地址统一为elasticsearch 部署IP地址+端口号(例如 http://192.168.51.4:9200 。

    2 内置分词器

    分词器 介绍
    Standard Analyzer 默认分词器,按词切分,小写处理
    Simple Analyzer 按照非字母切分(符号被过滤), 小写处理
    Stop Analyzer 小写处理,停用词过滤(the,a,is)
    Whitespace Analyzer 按照空格切分,不转小写
    Keyword Analyzer 不分词,直接将输入当作输出
    Patter Analyzer 正则表达式,默认\W+(非字符分割)
    Language 提供了30多种常见语言的分词器
    Customer Analyzer 自定义分词器

    2.1 Standard Analyzer

    standard 是默认的分析器。它提供了基于语法的标记化(基于Unicode文本分割算法),适用于大多数语言

    2.1.1 示例

    请求方式 接口地址 备注
    POST /analyze_demo/_analyze analyze_demo 索引的名称

    传递JSON数据

    {
      "analyzer": "standard",
      "text":     "Tic is a 善良的好人 "
    }
    

    请求结果

    {
        "tokens": [
            {
                "token": "tic",
                "start_offset": 0,
                "end_offset": 3,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
                "token": "is",
                "start_offset": 4,
                "end_offset": 6,
                "type": "<ALPHANUM>",
                "position": 1
            },
            {
                "token": "a",
                "start_offset": 7,
                "end_offset": 8,
                "type": "<ALPHANUM>",
                "position": 2
            },
            {
                "token": "善",
                "start_offset": 9,
                "end_offset": 10,
                "type": "<IDEOGRAPHIC>",
                "position": 3
            },
            {
                "token": "良",
                "start_offset": 10,
                "end_offset": 11,
                "type": "<IDEOGRAPHIC>",
                "position": 4
            },
            {
                "token": "的",
                "start_offset": 11,
                "end_offset": 12,
                "type": "<IDEOGRAPHIC>",
                "position": 5
            },
            {
                "token": "好",
                "start_offset": 12,
                "end_offset": 13,
                "type": "<IDEOGRAPHIC>",
                "position": 6
            },
            {
                "token": "人",
                "start_offset": 13,
                "end_offset": 14,
                "type": "<IDEOGRAPHIC>",
                "position": 7
            }
        ]
    }
    

    区分中英文,英文按照空格切分,同时大写转小写。

    中文按照单个词分词。

    2.1.2 配置

    标准分析器接受下列参数:

    • max_token_length : 最大token长度,默认255
    • stopwords : 预定义的停止词列表,如_english_或 包含停止词列表的数组,默认是 _none_
    • stopwords_path : 包含停止词的文件路径
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_english_analyzer": {
              "type": "standard",       #设置分词器为standard
              "max_token_length": 5,    #设置分词最大为5
              "stopwords": "_english_"  #设置过滤词
            }
          }
        }
      }
    }
    

    2.2 Simple Analyzer

    simple 分析器当它遇到只要不是字母的字符,就将文本解析成term,而且所有的term都是小写的

    请求方式 接口地址 备注
    POST /analyze_demo/_analyze analyze_demo 索引的名称

    传递JSON数据

    {
      "analyzer": "simple",
      "text":     "Tic is a 善良的好人 "
    }
    

    请求结果

    {
        "tokens": [
            {
                "token": "tic",
                "start_offset": 0,
                "end_offset": 3,
                "type": "word",
                "position": 0
            },
            {
                "token": "is",
                "start_offset": 4,
                "end_offset": 6,
                "type": "word",
                "position": 1
            },
            {
                "token": "a",
                "start_offset": 7,
                "end_offset": 8,
                "type": "word",
                "position": 2
            },
            {
                "token": "善良的好人",
                "start_offset": 9,
                "end_offset": 14,
                "type": "word",
                "position": 3
            }
        ]
    }
    

    先按照空格分词,英文大写转小写,不是英文不在分词

    2.3 Whitespace Analyzer

    按照空格分词

    请求方式 接口地址 备注
    POST /analyze_demo/_analyze analyze_demo 索引的名称

    传递JSON数据

    {
      "analyzer": "whitespace",
      "text":     "Tic is a 善良的好人 "
    }
    

    请求结果

    {
        "tokens": [
            {
                "token": "Tic",
                "start_offset": 0,
                "end_offset": 3,
                "type": "word",
                "position": 0
            },
            {
                "token": "is",
                "start_offset": 4,
                "end_offset": 6,
                "type": "word",
                "position": 1
            },
            {
                "token": "a",
                "start_offset": 7,
                "end_offset": 8,
                "type": "word",
                "position": 2
            },
            {
                "token": "善良的好人",
                "start_offset": 9,
                "end_offset": 14,
                "type": "word",
                "position": 3
            }
        ]
    }
    

    按空格分词,英文不区分大小写,中文不再分词

    3 中文分词器

    中文分词器经常被使用的就是IK分词器

    3.1 IK分词器下载

    Github下载地址: https://github.com/medcl/elasticsearch-analysis-ik

    CSDN下载地址: https://download.csdn.net/download/qq_15769939/15465684

    3.2 IK分词器安装

    将下载下来的文件,上传到服务的 /opt/module/software/ 目录下

    [root@localhost ~]# cd /opt/module/software/
    [root@localhost software]# ll
    总用量 289356
    -rw-r--r--. 1 root   root   288775500 2月  22 21:45 elasticsearch-7.4.2-linux-x86_64.tar.gz
    -rw-r--r--. 1 root   root     4504487 2月  24 13:30 elasticsearch-analysis-ik-7.4.2.zip
    
    [root@localhost software]# unzip elasticsearch-analysis-ik-7.4.2.zip -d /usr/local/elasticsearch-7.4.2/plugins/
    [root@localhost software]# cd /usr/local/elasticsearch-7.4.2/
    [root@localhost elasticsearch-7.4.2]# su esuser
    [esuser@localhost elasticsearch-7.4.2]$ jps
    28194 Jps
    26740 Elasticsearch
    [esuser@localhost elasticsearch-7.4.2]$ kill -9 26740
    
    
    [root@localhost software]# cd /usr/local/elasticsearch-7.4.2/
    [esuser@localhost elasticsearch-7.4.2]$ cd bin
    [esuser@localhost bin]$ ./elasticsearch -d
    

    如果jps查看的进程中有elasticsearh服务就kill掉,如果不存在直接启动elasticsearch就行

    3.3 IK分词器使用

    IK有两种颗粒度的拆分:

    ik_smart: 会做最粗粒度的拆分

    ik_max_word: 会将文本做最细粒度的拆分

    3.3.1 ik_smart 拆分

    请求方式 接口地址 备注
    POST /analyze_demo/_analyze analyze_demo 索引的名称

    传递JSON数据

    {
      "analyzer": "ik_smart",
      "text":     "这个世界上的好人和坏人都是存在的"
    }
    

    请求结果

    {
        "tokens": [
            {
                "token": "这个",
                "start_offset": 0,
                "end_offset": 2,
                "type": "CN_WORD",
                "position": 0
            },
            {
                "token": "世界上",
                "start_offset": 2,
                "end_offset": 5,
                "type": "CN_WORD",
                "position": 1
            },
            {
                "token": "的",
                "start_offset": 5,
                "end_offset": 6,
                "type": "CN_CHAR",
                "position": 2
            },
            {
                "token": "好",
                "start_offset": 6,
                "end_offset": 7,
                "type": "CN_CHAR",
                "position": 3
            },
            {
                "token": "人和",
                "start_offset": 7,
                "end_offset": 9,
                "type": "CN_WORD",
                "position": 4
            },
            {
                "token": "坏人",
                "start_offset": 9,
                "end_offset": 11,
                "type": "CN_WORD",
                "position": 5
            },
            {
                "token": "都是",
                "start_offset": 11,
                "end_offset": 13,
                "type": "CN_WORD",
                "position": 6
            },
            {
                "token": "存在",
                "start_offset": 13,
                "end_offset": 15,
                "type": "CN_WORD",
                "position": 7
            },
            {
                "token": "的",
                "start_offset": 15,
                "end_offset": 16,
                "type": "CN_CHAR",
                "position": 8
            }
        ]
    }
    

    3.3.2 ik_max_word 拆分

    请求方式 接口地址 备注
    POST /analyze_demo/_analyze analyze_demo 索引的名称

    传递JSON数据

    {
      "analyzer": "ik_max_word",
      "text":     "这个世界上的好人和坏人都是存在的"
    }
    

    请求结果

    {
        "tokens": [
            {
                "token": "这个",
                "start_offset": 0,
                "end_offset": 2,
                "type": "CN_WORD",
                "position": 0
            },
            {
                "token": "世界上",
                "start_offset": 2,
                "end_offset": 5,
                "type": "CN_WORD",
                "position": 1
            },
            {
                "token": "世界",
                "start_offset": 2,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 2
            },
            {
                "token": "上",
                "start_offset": 4,
                "end_offset": 5,
                "type": "CN_CHAR",
                "position": 3
            },
            {
                "token": "的",
                "start_offset": 5,
                "end_offset": 6,
                "type": "CN_CHAR",
                "position": 4
            },
            {
                "token": "好人",
                "start_offset": 6,
                "end_offset": 8,
                "type": "CN_WORD",
                "position": 5
            },
            {
                "token": "人和",
                "start_offset": 7,
                "end_offset": 9,
                "type": "CN_WORD",
                "position": 6
            },
            {
                "token": "坏人",
                "start_offset": 9,
                "end_offset": 11,
                "type": "CN_WORD",
                "position": 7
            },
            {
                "token": "都是",
                "start_offset": 11,
                "end_offset": 13,
                "type": "CN_WORD",
                "position": 8
            },
            {
                "token": "存在",
                "start_offset": 13,
                "end_offset": 15,
                "type": "CN_WORD",
                "position": 9
            },
            {
                "token": "的",
                "start_offset": 15,
                "end_offset": 16,
                "type": "CN_CHAR",
                "position": 10
            }
        ]
    }
    

    3.4 自定义中文词库

    3.4.1 设置自定义词库

    [root@localhost config]# vi /usr/local/elasticsearch-7.4.2/plugins/ik/config/IKAnalyzer.cfg.xml 
    

    设置自定义词库位置

    <entry key="ext_dict">custom.dic<entry>
    
    [root@localhost config]# vi /usr/local/elasticsearch-7.4.2/plugins/ik/config/custom.dic
    
    吞噬星空
    大主宰
    老干妈
    
    [root@localhost config]# /usr/local/elasticsearch-7.4.2/bin/elasticsearch -d
    

    3.4.2 自定义分词器示例

    请求方式 接口地址 备注
    POST /analyze_demo/_analyze analyze_demo 索引的名称

    传递JSON数据

    {
      "analyzer": "ik_max_word",
      "text":     "我喜欢吃老干妈,喜欢看吞噬星空和大主宰"
    }
    

    请求结果

    {
        "tokens": [
            {
                "token": "我",
                "start_offset": 0,
                "end_offset": 1,
                "type": "CN_CHAR",
                "position": 0
            },
            {
                "token": "喜欢吃",
                "start_offset": 1,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 1
            },
            {
                "token": "喜欢",
                "start_offset": 1,
                "end_offset": 3,
                "type": "CN_WORD",
                "position": 2
            },
            {
                "token": "吃",
                "start_offset": 3,
                "end_offset": 4,
                "type": "CN_CHAR",
                "position": 3
            },
            {
                "token": "老干妈",
                "start_offset": 4,
                "end_offset": 7,
                "type": "CN_WORD",
                "position": 4
            },
            {
                "token": "干妈",
                "start_offset": 5,
                "end_offset": 7,
                "type": "CN_WORD",
                "position": 5
            },
            {
                "token": "喜欢",
                "start_offset": 8,
                "end_offset": 10,
                "type": "CN_WORD",
                "position": 6
            },
            {
                "token": "看",
                "start_offset": 10,
                "end_offset": 11,
                "type": "CN_CHAR",
                "position": 7
            },
            {
                "token": "吞噬星空",
                "start_offset": 11,
                "end_offset": 15,
                "type": "CN_WORD",
                "position": 8
            },
            {
                "token": "吞噬",
                "start_offset": 11,
                "end_offset": 13,
                "type": "CN_WORD",
                "position": 9
            },
            {
                "token": "星空",
                "start_offset": 13,
                "end_offset": 15,
                "type": "CN_WORD",
                "position": 10
            },
            {
                "token": "和",
                "start_offset": 15,
                "end_offset": 16,
                "type": "CN_CHAR",
                "position": 11
            },
            {
                "token": "大主宰",
                "start_offset": 16,
                "end_offset": 19,
                "type": "CN_WORD",
                "position": 12
            },
            {
                "token": "主宰",
                "start_offset": 17,
                "end_offset": 19,
                "type": "CN_WORD",
                "position": 13
            }
        ]
    }
    

    4 相关信息

    • 博文不易,辛苦各位猿友点个关注和赞,感谢

    相关文章

      网友评论

          本文标题:ES 分词器使用和配置

          本文链接:https://www.haomeiwen.com/subject/sobjqltx.html