美文网首页
018.Elasticsearch分词器原理及使用

018.Elasticsearch分词器原理及使用

作者: CoderJed | 来源:发表于2020-07-03 16:01 被阅读0次

    1.分词器介绍

    • 什么是分词器?

      将一段文本按照一定的逻辑,分析成多个词语,同时对这些词语进行常规化(normalization)的一种工具,例如:

      "hello tom and jerry"可以分为"hello"、"tom"、"and"、"jerry"这4个单词

      常规化是说,例如,"hello tom & jerry",那么把"&"这个字符转换为"and",对一个html标签进行分词时,先去掉标签"<span>hello<span>" -> "hello"

    • 常用的内置分词器

      • standard analyzer
      • simple analyzer
      • whitespace analyzer
      • stop analyzer
      • language analyzer
      • pattern analyzer

    1.1 standard analyzer

    默认分词器:按照非字母和非数字字符进行分隔,单词转为小写
    测试文本:a*B!c d4e 5f 7-h
    分词结果:abcd4e5f7h

    {
      "tokens" : [
        {
          "token" : "a", # 分词后的单词
          "start_offset" : 0, # 在原文本中的起始位置
          "end_offset" : 1, # 原文本中的结束位置
          "type" : "<ALPHANUM>", # 单词类型:ALPHANUM(字母)、NUM(数字)
          "position" : 0 # 单词位置,是分出来的所有单词的第几个单词
        },
        {
          "token" : "b",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "c",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "d4e",
          "start_offset" : 6,
          "end_offset" : 9,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "5f",
          "start_offset" : 10,
          "end_offset" : 12,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "7",
          "start_offset" : 13,
          "end_offset" : 14,
          "type" : "<NUM>",
          "position" : 5
        },
        {
          "token" : "h",
          "start_offset" : 15,
          "end_offset" : 16,
          "type" : "<ALPHANUM>",
          "position" : 6
        }
      ]
    }
    

    1.2 simple analyzer

    分词效果:按照非字母字符进行分隔,单词转为小写
    测试文本:a*B!c d4e 5f 7-h
    分词结果:abcdefh

    GET _analyze
    {
      "analyzer": "simple",
      "text": "a*B!c d4e 5f 7-h"
    }
    
    {
      "tokens" : [
        {
          "token" : "a",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "b",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "c",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "d",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "e",
          "start_offset" : 8,
          "end_offset" : 9,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "f",
          "start_offset" : 11,
          "end_offset" : 12,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "h",
          "start_offset" : 15,
          "end_offset" : 16,
        "type" : "word",
          "position" : 6
        }
      ]
    }
    

    1.3 whitespace analyzer

    分词效果:按照空白字符进行分隔
    测试文本:a*B!c D d4e 5f 7-h
    分词结果:a*B!cDd4e5f7-h

    GET _analyze
    {
      "analyzer": "whitespace",
      "text": "a*B!c D d4e 5f 7-h"
    }
    
    {
      "tokens" : [
        {
          "token" : "a*B!c",
          "start_offset" : 0,
          "end_offset" : 5,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "D",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "d4e",
          "start_offset" : 8,
          "end_offset" : 11,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "5f",
          "start_offset" : 12,
          "end_offset" : 14,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "7-h",
          "start_offset" : 15,
          "end_offset" : 18,
          "type" : "word",
          "position" : 4
        }
      ]
    }
    

    1.4 stop analyzer

    分词效果:使用非字母字符进行分隔,单词转换为小写,并去掉停用词(默认为英语的停用词,例如theaanthisofat等)
    测试文本:The apple is red
    分词结果:applered

    GET _analyze
    {
      "analyzer": "stop",
      "text": "The apple is red"
    }
    
    {
      "tokens" : [
        {
          "token" : "apple",
          "start_offset" : 4,
          "end_offset" : 9,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "red",
          "start_offset" : 13,
          "end_offset" : 16,
          "type" : "word",
          "position" : 3
        }
      ]
    }
    

    1.5 language analyzer

    分词效果:使用指定的语言的语法进行分词,默认为english,没有内置中文分词器

    GET _analyze
    {
      "analyzer": "english",
      "text": "\"I'm Tony,\", he said, \"nice to meet you!\""
    }
    
    {
      "tokens" : [
        {
          "token" : "i'm",
          "start_offset" : 1,
          "end_offset" : 4,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "toni",
          "start_offset" : 5,
          "end_offset" : 9,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "he",
          "start_offset" : 13,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "said",
          "start_offset" : 16,
          "end_offset" : 20,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "nice",
          "start_offset" : 23,
          "end_offset" : 27,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "meet",
          "start_offset" : 31,
          "end_offset" : 35,
          "type" : "<ALPHANUM>",
          "position" : 6
        },
        {
          "token" : "you",
          "start_offset" : 36,
          "end_offset" : 39,
          "type" : "<ALPHANUM>",
          "position" : 7
        }
      ]
    }
    

    1.6 pattern analyzer

    分词效果:使用指定的正则表达式进行分词,默认\\W+,即多个非数字非字母字符

    GET _analyze
    {
      "analyzer": "pattern",
      "text": "The best 3-points shooter is Curry!"
    }
    
    {
      "tokens" : [
        {
          "token" : "the",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "best",
          "start_offset" : 4,
          "end_offset" : 8,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "3",
          "start_offset" : 9,
          "end_offset" : 10,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "points",
          "start_offset" : 11,
          "end_offset" : 17,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "shooter",
          "start_offset" : 18,
          "end_offset" : 25,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "is",
          "start_offset" : 26,
          "end_offset" : 28,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "curry",
          "start_offset" : 29,
          "end_offset" : 34,
          "type" : "word",
          "position" : 6
        }
      ]
    }
    

    2.分词器使用

    2.1 指定index的分词器

    创建测试索引

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer_1": {
              "type": "whitespace"
            }
          }
        }
      },
      "mappings": {
        "_doc": {
          "properties": {
            "id": {
              "type": "keyword"
            },
            "name": {
              "type": "text"
            },
            "desc": {
              "type": "text",
              "analyzer": "my_analyzer_1"
            }
          }
        }
      }
    }
    

    创建测试数据:

    PUT my_index/_doc/1
    {
      "id": "001",
      "name": "Curry",
      "desc": "The best 3-points shooter is Curry!"
    }
    

    查询:由于desc字段使用whitespace分词,所以通过curry是查询不到的,需要通过Curry!来查询

    GET my_index/_search
    {
      "query": {
        "match": {
          "desc": "curry"
        }
      }
    }
    
    {
      "took" : 2,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : 0,
        "max_score" : null,
        "hits" : [ ]
      }
    }
    
    GET my_index/_search
    {
      "query": {
        "match": {
          "desc": "Curry!"
        }
      }
    }
    
    {
      "took" : 4,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : 1,
        "max_score" : 0.2876821,
        "hits" : [
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 0.2876821,
            "_source" : {
              "id" : "001",
              "name" : "Curry",
              "desc" : "The best 3-points shooter is Curry!"
            }
          }
        ]
      }
    }
    

    2.2 更改分词器设置

    # 创建索引,并设置分词器,启用停用词,默认的standard分词器是没有使用停用词的
    PUT /my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_standard": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
      }
    }
    
    # 测试
    GET /my_index/_analyze
    {
      "analyzer": "my_standard",
      "text": "a dog is in the house"
    }
    
    {
      "tokens": [
        {
          "token": "dog",
          "start_offset": 2,
          "end_offset": 5,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "house",
          "start_offset": 16,
          "end_offset": 21,
          "type": "<ALPHANUM>",
          "position": 5
        }
      ]
    }
    
    

    2.3 自定义分词器

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "my_char_filter": {
              "type": "mapping",
              "mappings": ["& => and"] # "$"转换为"and"
            }
          },
          "filter": {
            "my_filter": {
              "type": "stop",
              "stopwords": ["the", "a"] # 指定两个停用词
            }
          },
          "analyzer": {
            "my_analyzer": {
              "type": "custom",
              "char_filter": ["html_strip", "my_char_filter"], # 使用内置的html标签过滤和自定义的my_char_filter
              "tokenizer": "standard",
              "filter": ["lowercase", "my_filter"] # 使用内置的lowercase filter和自定义的my_filter
            }
          }
        }
      }
    }
    
    GET /my_index/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "tom&jerry are a friend in the house, <a>, HAHA!!"
    }
    
    {
      "tokens": [
        {
          "token": "tomandjerry",
          "start_offset": 0,
          "end_offset": 9,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "are",
          "start_offset": 10,
          "end_offset": 13,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "friend",
          "start_offset": 16,
          "end_offset": 22,
          "type": "<ALPHANUM>",
          "position": 3
        },
        {
          "token": "in",
          "start_offset": 23,
          "end_offset": 25,
          "type": "<ALPHANUM>",
          "position": 4
        },
        {
          "token": "house",
          "start_offset": 30,
          "end_offset": 35,
          "type": "<ALPHANUM>",
          "position": 6
        },
        {
          "token": "haha",
          "start_offset": 42,
          "end_offset": 46,
          "type": "<ALPHANUM>",
          "position": 7
        }
      ]
    }
    

    2.4 为指定的type、指定的字段设置自定义的分词器

    PUT /my_index/_mapping/my_type
    {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
    

    3. 中文分词器

    3.1. 中文分词器介绍

    Elasticsearch内置的分词器无法对中文进行分词,例如:

    GET _analyze
    {
      "analyzer": "standard",
      "text": "火箭明年总冠军"
    }
    
    {
      "tokens" : [
        {
          "token" : "火",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "<IDEOGRAPHIC>",
          "position" : 0
        },
        {
          "token" : "箭",
          "start_offset" : 1,
          "end_offset" : 2,
          "type" : "<IDEOGRAPHIC>",
          "position" : 1
        },
        {
          "token" : "明",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "<IDEOGRAPHIC>",
          "position" : 2
        },
        {
          "token" : "年",
          "start_offset" : 3,
          "end_offset" : 4,
          "type" : "<IDEOGRAPHIC>",
          "position" : 3
        },
        {
          "token" : "总",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "<IDEOGRAPHIC>",
          "position" : 4
        },
        {
          "token" : "冠",
          "start_offset" : 5,
          "end_offset" : 6,
          "type" : "<IDEOGRAPHIC>",
          "position" : 5
        },
        {
          "token" : "军",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "<IDEOGRAPHIC>",
          "position" : 6
        }
      ]
    }
    

    我们期望的分词结果是火箭明年总冠军,这就需要使用中文分词器了。

    • 常见的中文分词器
      • smartCN :一个简单的中⽂或中英⽂混合文本分词器
      • IK分词器:更智能更友好的中⽂分词器

    3.2 smartCN安装方式

    bin/elasticsearch-plugin install analysis-smartcn
    

    完成后重启ES集群,测试:

    GET _analyze
    {
      "analyzer": "smartcn",
      "text": "火箭明年总冠军"
    }
    
    {
      "tokens" : [
        {
          "token" : "火箭",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "明年",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "总",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "冠军",
          "start_offset" : 5,
          "end_offset" : 7,
          "type" : "word",
          "position" : 3
        }
      ]
    }
    

    3.3 IK分词器安装

    下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases

    • 下载与ES同版本的IK分词器elasticsearch-analysis-ik-x.x.x.zip

    • 在ES的plugins目录下创建ik目录

      [giant@jd2 plugins]$ mkdir ik
      
    • elasticsearch-analysis-ik-x.x.x.zip上传到plugins/ik目录下并解压

      [giant@jd2 ik]$ unzip elasticsearch-analysis-ik-6.6.0.zip
      
    • 删除elasticsearch-analysis-ik-x.x.x.zip安装包

      [giant@jd2 ik]$ rm -rf elasticsearch-analysis-ik-6.6.0.zip
      [giant@jd2 ik]$ ll
      total 1428
      -rw-r--r-- 1 giant giant 263965 Jan 15 17:07 commons-codec-1.9.jar
      -rw-r--r-- 1 giant giant  61829 Jan 15 17:07 commons-logging-1.2.jar
      drwxr-xr-x 2 giant giant    299 Jan 15 17:07 config
      -rw-r--r-- 1 giant giant  54693 Jan 15 17:07 elasticsearch-analysis-ik-6.6.0.jar
      -rw-r--r-- 1 giant giant 736658 Jan 15 17:07 httpclient-4.5.2.jar
      -rw-r--r-- 1 giant giant 326724 Jan 15 17:07 httpcore-4.4.4.jar
      -rw-r--r-- 1 giant giant   1805 Jan 15 17:07 plugin-descriptor.properties
      -rw-r--r-- 1 giant giant    125 Jan 15 17:07 plugin-security.policy
      
    • 所有ES节点均进行以上操作,然后重启ES集群

    IK分词器测试:

    GET _analyze
    {
      "analyzer": "ik_max_word",
      "text": "火箭明年总冠军"
    }
    
    {
      "tokens" : [
        {
          "token" : "火箭",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "CN_WORD",
          "position" : 0
        },
        {
          "token" : "明年",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "CN_WORD",
          "position" : 1
        },
        {
          "token" : "总冠军",
          "start_offset" : 4,
          "end_offset" : 7,
          "type" : "CN_WORD",
          "position" : 2
        },
        {
          "token" : "冠军",
          "start_offset" : 5,
          "end_offset" : 7,
          "type" : "CN_WORD",
          "position" : 3
        }
      ]
    }
    

    IK分词器有两种analyzer,ik_max_word和ik_smart

    • ik_max_word:会将文本做最细粒度的拆分
    • ik_smart:会做最粗粒度的拆分

    3.4 IK分词器配置文件

    • IKAnalyzer.cfg.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
    <properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict"></entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
    </properties>
    
    • main.dic:IK分词器原生内置的中文词库,总共有27万多条,只要这里定义的单词,都会被分在一起
    • quantifier.dic:放了一些单位相关的词
    • suffix.dic:放了一些后缀单词
    • surname.dic:中国的姓氏
    • stopword.dic:英文停用词

    3.5 自定义词库

    • 自定义词库:每年都会涌现一些特殊的流行词,网红,蓝瘦香菇,喊麦,鬼畜,一般不会在ik的原生词典里,自己补充这些最新的词语,到ik的词库里面去,然后修改IKAnalyzer.cfg.xml配置文件

    • 自定义停用词库:比如"了","的","啥","么",我们可能并不想去建立索引,让人家搜索

      <entry key="ext_dict">custom/mydict.dic</entry>
      <entry key="ext_stopwords">custom/mystopdict.dic</entry>
      
    • 然后需要重启es,才能生效

    • 测试
    GET _analyze
    {
      "analyzer": "ik_max_word",
      "text": "网红"
    }
    
    {
        "tokens": [
            {
                "token": "网",
                "start_offset": 0,
                "end_offset": 1,
                "type": "CN_CHAR",
                "position": 0
            },
            {
                "token": "红",
                "start_offset": 1,
                "end_offset": 2,
                "type": "CN_CHAR",
                "position": 1
            }
        ]
    }
    
    • 自定义词库
    mkdir -p ${ELASTICSEARCH_HOME}/plugins/ik/config/custom
    touch ${ELASTICSEARCH_HOME}/plugins/ik/config/custom/mydict.dic
    # 然后把网红这个词写进去
    # 然后修改IKAnalyzer.cfg.xml
    <entry key="ext_dict">custom/mydict.dic</entry>
    
    • 重启es,并测试
    GET _analyze
    {
      "analyzer": "ik_max_word",
      "text": "网红"
    }
    
    {
        "tokens": [
            {
                "token": "网红",
                "start_offset": 0,
                "end_offset": 2,
                "type": "CN_WORD",
                "position": 0
            }
        ]
    }
    

    相关文章

      网友评论

          本文标题:018.Elasticsearch分词器原理及使用

          本文链接:https://www.haomeiwen.com/subject/keisqktx.html