美文网首页
Elasticsearch 7.x 深入【5】analyze A

Elasticsearch 7.x 深入【5】analyze A

作者: 孙瑞锴 | 来源:发表于2020-05-03 18:49 被阅读0次

    1. 借鉴

    极客时间 阮一鸣老师的Elasticsearch核心技术与实战
    Elasticsearch 分词器
    Elasticsearch 默认分词器和中分分词器之间的比较及使用方法
    Elasticsearch系列---使用中文分词器
    官网 character filters
    官网 tokenizers
    官网 token filters

    2. 开始

    一、analyze api

    方式1 指定分词器

    GET /_analyze
    {
      "analyzer": "ik_max_word",
      "text": "Hello Lady, I'm Elasticsearch ^_^"
    }
    

    方式2 指定索引及属性字段

    GET /tmdb_movies/_analyze
    {
      "field": "title",
      "text": "Basketball with cartoon alias"
    }
    

    方式3 自定义分词

    GET /_analyze
    {
      "tokenizer": "standard",
      "filter": ["lowercase"],
      "text": "Machine Building Industry Epoch"
    }
    

    二、自定义分词器

    • 分词器是由三部分组成的,分别是character filter, tokenizer, token filter

    character filter[字符过滤器]

    处理原始文本,可以配置多个,会影响到tokenizer的position和offset信息
    在es中有几个默认的字符过滤器

    • HTML Strip
      去除html标签
    • Mapping
      字符串替换
    • Pattern Replace
      正则匹配替换

    举个栗子

    html_strip
    GET _analyze
    {
      "tokenizer": "keyword",
      "char_filter": ["html_strip"],
      "text": "<br>you know, for search</br>"
    }
    
    • 结果
    {
      "tokens" : [
        {
          "token" : """
    
    you know, for search
    
    """,
          "start_offset" : 0,
          "end_offset" : 29,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    
    mapping
    GET _analyze
    {
      "tokenizer": "whitespace",
      "char_filter": [
        {
          "type": "mapping",
          "mappings": ["- => "]
        },
        "html_strip"
      ],
      "text": "<br>中国-北京 中国-台湾 中国-人民</br>"
    }
    
    • 结果
    {
      "tokens" : [
        {
          "token" : "中国北京",
          "start_offset" : 4,
          "end_offset" : 9,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "中国台湾",
          "start_offset" : 10,
          "end_offset" : 15,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "中国人民",
          "start_offset" : 16,
          "end_offset" : 21,
          "type" : "word",
          "position" : 2
        }
      ]
    }
    
    pattern_replace
    GET /_analyze
    {
      "tokenizer": "keyword",
      "char_filter": [
        {
          "type": "pattern_replace",
          "pattern": "https?://(.*)",
          "replacement": "$1"
        }],
        "text": "https://www.elastic.co"
    }
    
    • 结果
    {
      "tokens" : [
        {
          "token" : "www.elastic.co",
          "start_offset" : 0,
          "end_offset" : 22,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    

    tokenizer[分词器]

    将原始文本按照一定规则,切分成词项(字符处理)
    在es中有几个默认的分词器

    • standard
    • letter
    • lowercase
    • whitespace
    • uax url email
    • classic
    • thai
    • n-gram
    • edge n-gram
    • keyword
    • pattern
    • simple
    • char group
    • simple pattern split
    • path

    举个栗子

    path_hierarchy
    GET /_analyze
    {
      "tokenizer": "path_hierarchy",
      "text": ["/usr/local/bin/java"]
    }
    
    • 结果
    {
      "tokens" : [
        {
          "token" : "/usr",
          "start_offset" : 0,
          "end_offset" : 4,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "/usr/local",
          "start_offset" : 0,
          "end_offset" : 10,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "/usr/local/bin",
          "start_offset" : 0,
          "end_offset" : 14,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "/usr/local/bin/java",
          "start_offset" : 0,
          "end_offset" : 19,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    

    token filter[分词过滤]

    将tokenizer输出的词项进行处理,如:增加,修改,删除
    在es中有几个默认的分词过滤器

    • lowercase
    • stop
    • uppercase
    • reverse
    • length
    • n-gram
    • edge n-gram
    • pattern replace
    • trim
    • ...[更多参照官网,目前仅列举用到的]

    举个栗子

    GET /_analyze
    {
      "tokenizer": "whitespace",
      "filter": ["stop"],
      "text": ["how are you i am fine thank you"]
    }
    

    三、自定义分词器

    自定义也无非是定义char_filter,tokenizer,filter(token filter)

    DELETE /my_analysis
    PUT /my_analysis
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "type": "custom",
              "char_filter": [
                "my_char_filter"
                ],
              "tokenizer": "my_tokenizer",
              "filter": [
                "my_tokenizer_filter"
                ]
            }
          },
          "char_filter": {
            "my_char_filter": {
              "type": "mapping",
              "mappings": ["_ => "]
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "pattern",
              "pattern": "[,.!? ]"
            }
          },
          "filter": {
            "my_tokenizer_filter": {
              "type": "stop",
              "stopword": "__english__"
            }
          }
        }
      }
    }
    
    POST /my_analysis/_analyze
    {
      "analyzer": "my_analyzer",
      "text": ["Hello Kitty!, A_n_d you?"]
    }
    

    3. 大功告成

    相关文章

      网友评论

          本文标题:Elasticsearch 7.x 深入【5】analyze A

          本文链接:https://www.haomeiwen.com/subject/olldghtx.html