Elasticsearch自定义analyzer

作者: jacksu在简书 | 来源:发表于2016-05-21 11:22 被阅读2810次

standard analyzer英文、数字按照空格来分词，中文直接使用一元分词，因此IP分词后作为一个term。例如 “the 192.168.0.1”，分词后为[the,192.168.0.1]，如果搜索“192”就不会搜到。如果IP需要按照"."来分词，支持IP模糊匹配，搜索“192”可以搜到192.168.0.1，那么就需要自己定义analyzer，来看看如何自定义analyzer。

rest建立索引设置settings的方式

PUT /my_index
{
   "settings":{
      "analysis":{
         "analyzer":{
            "my_analyzer":{ 
               "type":"custom",
               "tokenizer":"standard",
               "filter":["word_delimiter"]
            }
         }
      }
   },
   "mappings":{
      "my_type":{
         "properties":{
            "title": {
               "type":"string",
               "analyzer":"my_analyzer", 
               "search_analyzer":"my_analyzer"
              }
            }
         }
      }
   }
}

通过java API建立索引设置settings的方式为：

CreateIndexRequest createIndexRequest = new CreateIndexRequest(fullIndexName);
createIndexRequest.source(mapping);
CreateIndexResponse res = admin.create(createIndexRequest).actionGet();

测试分词器的方式

curl -XGET 'http://localhost:9200/my_index/_analyze?pretty=1&analyzer=my_analyzer' -d '192.168.10.10'

比如分词我是huawei is 192.168.10.10
standard analyzer的结果为：

{
  "tokens" : [ {
    "token" : "我",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "<IDEOGRAPHIC>",
    "position" : 0
  }, {
    "token" : "是",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "<IDEOGRAPHIC>",
    "position" : 1
  }, {
    "token" : "huawei",
    "start_offset" : 2,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "is",
    "start_offset" : 9,
    "end_offset" : 11,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "192.168.10.10",
    "start_offset" : 12,
    "end_offset" : 25,
    "type" : "<NUM>",
    "position" : 4
  } ]
}

my_analyzer的结果为：

{
  "tokens" : [ {
    "token" : "我",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "<IDEOGRAPHIC>",
    "position" : 0
  }, {
    "token" : "是",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "<IDEOGRAPHIC>",
    "position" : 1
  }, {
    "token" : "huawei",
    "start_offset" : 2,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "is",
    "start_offset" : 9,
    "end_offset" : 11,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "192",
    "start_offset" : 12,
    "end_offset" : 15,
    "type" : "<NUM>",
    "position" : 4
  }, {
    "token" : "168",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "<NUM>",
    "position" : 5
  }, {
    "token" : "10",
    "start_offset" : 20,
    "end_offset" : 22,
    "type" : "<NUM>",
    "position" : 6
  }, {
    "token" : "10",
    "start_offset" : 23,
    "end_offset" : 25,
    "type" : "<NUM>",
    "position" : 7
  } ]
}

可以看出my_analyzer达到了我们的目的。

网友评论

本文标题：Elasticsearch自定义analyzer

本文链接：https://www.haomeiwen.com/subject/yljwrttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Elasticsearch自定义analyzer

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

关于搜索，我们聊聊

elasticsearch

首页投稿（暂停使用，暂停投稿）

程序员