Elasticsearch自定义analyzer

作者: jacksu在简书 | 来源:发表于2016-05-21 11:22 被阅读2810次

    standard analyzer英文、数字按照空格来分词,中文直接使用一元分词,因此IP分词后作为一个term。例如 “the 192.168.0.1”,分词后为[the,192.168.0.1],如果搜索“192”就不会搜到。如果IP需要按照"."来分词,支持IP模糊匹配,搜索“192”可以搜到192.168.0.1,那么就需要自己定义analyzer,来看看如何自定义analyzer。

    rest建立索引设置settings的方式

    PUT /my_index
    {
       "settings":{
          "analysis":{
             "analyzer":{
                "my_analyzer":{ 
                   "type":"custom",
                   "tokenizer":"standard",
                   "filter":["word_delimiter"]
                }
             }
          }
       },
       "mappings":{
          "my_type":{
             "properties":{
                "title": {
                   "type":"string",
                   "analyzer":"my_analyzer", 
                   "search_analyzer":"my_analyzer"
                  }
                }
             }
          }
       }
    }
    

    通过java API建立索引设置settings的方式为:

    CreateIndexRequest createIndexRequest = new CreateIndexRequest(fullIndexName);
    createIndexRequest.source(mapping);
    CreateIndexResponse res = admin.create(createIndexRequest).actionGet();
    

    测试分词器的方式

    curl -XGET 'http://localhost:9200/my_index/_analyze?pretty=1&analyzer=my_analyzer' -d '192.168.10.10'
    

    比如分词我是huawei is 192.168.10.10
    standard analyzer的结果为:

    {
      "tokens" : [ {
        "token" : "我",
        "start_offset" : 0,
        "end_offset" : 1,
        "type" : "<IDEOGRAPHIC>",
        "position" : 0
      }, {
        "token" : "是",
        "start_offset" : 1,
        "end_offset" : 2,
        "type" : "<IDEOGRAPHIC>",
        "position" : 1
      }, {
        "token" : "huawei",
        "start_offset" : 2,
        "end_offset" : 8,
        "type" : "<ALPHANUM>",
        "position" : 2
      }, {
        "token" : "is",
        "start_offset" : 9,
        "end_offset" : 11,
        "type" : "<ALPHANUM>",
        "position" : 3
      }, {
        "token" : "192.168.10.10",
        "start_offset" : 12,
        "end_offset" : 25,
        "type" : "<NUM>",
        "position" : 4
      } ]
    }
    

    my_analyzer的结果为:

    {
      "tokens" : [ {
        "token" : "我",
        "start_offset" : 0,
        "end_offset" : 1,
        "type" : "<IDEOGRAPHIC>",
        "position" : 0
      }, {
        "token" : "是",
        "start_offset" : 1,
        "end_offset" : 2,
        "type" : "<IDEOGRAPHIC>",
        "position" : 1
      }, {
        "token" : "huawei",
        "start_offset" : 2,
        "end_offset" : 8,
        "type" : "<ALPHANUM>",
        "position" : 2
      }, {
        "token" : "is",
        "start_offset" : 9,
        "end_offset" : 11,
        "type" : "<ALPHANUM>",
        "position" : 3
      }, {
        "token" : "192",
        "start_offset" : 12,
        "end_offset" : 15,
        "type" : "<NUM>",
        "position" : 4
      }, {
        "token" : "168",
        "start_offset" : 16,
        "end_offset" : 19,
        "type" : "<NUM>",
        "position" : 5
      }, {
        "token" : "10",
        "start_offset" : 20,
        "end_offset" : 22,
        "type" : "<NUM>",
        "position" : 6
      }, {
        "token" : "10",
        "start_offset" : 23,
        "end_offset" : 25,
        "type" : "<NUM>",
        "position" : 7
      } ]
    }
    

    可以看出my_analyzer达到了我们的目的。

    相关文章

      网友评论

        本文标题:Elasticsearch自定义analyzer

        本文链接:https://www.haomeiwen.com/subject/yljwrttx.html