美文网首页
elasticsearch分析器

elasticsearch分析器

作者: DimonHo | 来源:发表于2019-10-30 18:23 被阅读0次

    官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis.html

    分析器analyzer包含如下几个属性:

    分析器类型type:custom
    字符过滤器char_filter: 零个或多个
    分词器tokenizer: 有且仅有一个
    词元过滤器filter零个或多个 按顺序应用的

    字符过滤器

    字符过滤器也叫预处理过滤器,用于预处理字符流,然后再将其传递给分词器
    字符过滤器有三种:

    1. html_strip:HTML标签字符过滤器

    特性:

    a. 从原始文本中过滤掉HTML标签

    可选配置:

    escaped_tags: 不应从原始文本中过滤掉的HTML标签,数组类型。

    example:

    GET _analyze
    {
      "tokenizer":      "keyword", 
      "char_filter":  [ "html_strip" ],
      "text": "<p>I&apos;m so <b>happy</b>!</p>"
    }
    
    {
      "tokens": [
        {
          "token": """
    
    I'm so happy!
    
    """,
          "start_offset": 0,
          "end_offset": 32,
          "type": "word",
          "position": 0
        }
      ]
    }
    
    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "keyword",
              "char_filter": ["my_char_filter"]
            }
          },
          "char_filter": {
            "my_char_filter": {
              "type": "html_strip",
              "escaped_tags": ["b"]  // 不从原始文本中过滤<b></b>标签
            }
          }
        }
      }
    }
    
    GET my_index/_analyze
    {
      "analyzer":      "my_analyzer", 
      "text": "<p>I&apos;m so <b>happy</b>!</p>"
    }
    
    {
      "tokens": [
        {
          "token": """
    
    I'm so <b>happy</b>!
    
    """,
          "start_offset": 0,
          "end_offset": 32,
          "type": "word",
          "position": 0
        }
      ]
    }
    

    2. mapping: 映射字符过滤器

    特性:

    a. mapping字符过滤器接受键值对数组。每当遇到与键相同的字符串时,它将用与该键关联的值替换它们。
    b. 匹配是贪婪的,优先匹配最长的那一个。
    c. 允许替换为空字符串。

    可选配置:

    mappings:定义一个键值对数组
    mappings_path: 定义一个包含键值对数组文件的路径

    example:

    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "keyword",
              "char_filter": [
                "my_char_filter"
              ]
            }
          },
          "char_filter": {
            "my_char_filter": {
              "type": "mapping",
              "mappings": [
                "& => and",
                "$ => ¥"
              ]
            }
          }
        }
      }
    }
    
    POST my_index/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "My license plate is $203 & $110"
    }
    
    {
      "tokens": [
        {
          "token": "My license plate is ¥203 and ¥110",
          "start_offset": 0,
          "end_offset": 31,
          "type": "word",
          "position": 0
        }
      ]
    }
    

    3. pattern_replace: 正则替换字符过滤器

    特性:

    a. 使用一个正则表达式匹配,用指定的字符串替换字符。
    b. 替换字符串可以引用正则表达式中的捕获组。

    可选配置:

    pattern: 一个Java的正则表达式,必须。
    replacement:替换字符串,可以参考使用捕获组 $1.. $9语法,说明 这里
    flags:Java正则表达式标志。标记应以管道分隔,例如"CASE_INSENSITIVE|COMMENTS"

    example:

    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "keyword",
              "char_filter": [
                "my_char_filter"
              ]
            }
          },
          "char_filter": {
            "my_char_filter": {
              "type": "pattern_replace",
              "pattern": "(\\d+)-(?=\\d)",
              "replacement": "$1_"
            }
          }
        }
      }
    }
    
    POST my_index/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "My credit card is 123-456-789"
    }
    
    {
      "tokens": [
        {
          "token": "My credit card is 123_456_789",
          "start_offset": 0,
          "end_offset": 29,
          "type": "word",
          "position": 0
        }
      ]
    }
    

    分词器

    特性:

    标准类型的tokenizer对欧洲语言非常友好, 支持Unicode。

    可选配置:

    max_token_length:最大的token集合,即经过tokenizer过后得到的结果集的最大值。如果token的长度超过了设置的长度,将会继续分,默认255

    ex:

    POST _analyze
    {
      "tokenizer": "standard",
      "text": "The 2 QUICK Brown-Foxes of dog's bone."
    }
    

    结果

    [The, 2, QUICK, Brown, Foxes, of, dog's bone]

    ex:

    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "standard",
              "max_token_length": 5
            }
          }
        }
      }
    }
    

    特性:

    每当遇到一个字符是不是字母的时候,进行分词。

    可选配置:

    不可配置

    ex:

    POST _analyze
    {
      "tokenizer": "letter",
      "text": "The 2 QUICK Brown-Foxes of dog's bone."
    }
    

    结果:

    [The, 2, QUICK, Brown, Foxes, of, dog, s, bone]

    特性:

    可以看做是Letter Tokenizer和lower token filter的组合

    特性:

    遇到空白字符就分词

    可选配置:

    特性:

    和standard 类型的分词器类似,但是能识别url和email

    可选配置:

    max_token_length:默认256

    ex:

    POST _analyze
    {
      "tokenizer": "uax_url_email",
      "text": "Email me at john.smith@global-international.com http://www.baidu.com"
    }
    

    结果

    [Email, me, at, john.smith@global-international.com, http://www.baidu.com]

    特性:

    为英语而生的分词器. 这个分词器对于英文的首字符缩写、 公司名字、 email 、 大部分网站域名.都能很好的解决。 但是, 对于除了英语之外的其他语言,都不是很好使。

    可选配置:

    max_token_length: 默认255

    特性:

    泰语专用分词器

    特性:

    N-gram就像一个滑动窗口,在整个单词上移动-连续的指定长度字符序列。它们对于查询不用空格的语言(例如德语,汉语)很有用。

    可选配置:

    min_gram:分词后词语的最小长度
    max_gram: 分词后数据的最大长度
    token_chars:设置分词的形式,例如数字还是文字。elasticsearch将根据分词的形式对文本进行分词。
    [] (Keep all characters)

    token_chars可用的值:

    普通字符:letter —  for example a, b, ï or 京
    数字:digit —  for example 3 or 7
    空格或回车符:whitespace —  for example " " or "\n"
    标点符号:punctuation — for example ! or "
    特殊字符:symbol —  for example $ or √

    ex:

    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer",
              "filter":["lowercase"]
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "ngram",
              "min_gram": 3,
              "max_gram": 10,
              "token_chars": [
                "letter",
                "digit"
              ]
            }
          }
        }
      },
      "mappings": {
        "doc": {
          "properties": {
            "title": {
              "type": "text",
              "analyzer": "my_analyzer"
            }
          }
        }
      }
    }
    
    POST my_index/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "2 2311 Quick Foxes."
    }
    

    分词结果:

    [231, 2311, 311, qui, quic, quick, uic, uick, ick, fox, foxe, foxes, oxe, oxes, xes]

    边缘ngram分词器,与ngram分词器的不同之处在于ngram是补全提示,而edge-ngram是自动补全。

    例如:

    POST _analyze
    {
      "tokenizer": "ngram",
      "text": "a Quick Foxes."
    }
    

    ngram分词测试结果:

    ["a", "a ", " ", " Q", "Q", "Qu", "u", "ui", "i", "ic", "c", "ck", "k", "k ", " ", " F", "F", "Fo", "o", "ox", "x", "xe", "e", "es", "s", "s.", ".",]

    POST _analyze
    {
      "tokenizer": "edge_ngram",
      "text": "a Quick Foxes."
    }
    

    edeg_ngram分词测试结果

    ["a", "a "]

    从上面的测试结果可以看出:

    默认ngram和edge_ngram分词器是将a Quick Foxes.当成一个整体的。
    默认ngram和edge_ngram最小长度和最大长度都是 1 和 2。
    ngram是一个固定的小窗口在单词上滑动的。(一般用来搜索补全提示)
    edge_ngram是起始位置不动,小窗口从最小值到最大值延长的结果。(一般用来单词自动补全)

    特性:

    keyword 类型的tokenizer 是将一整块的输入数据作为一个单独的分词

    可选配置:

    buffer_size: 默认256

    用正则表达式分词

    可选配置:

    pattern:一个Java的正则表达式,则默认为\W+
    flags: Java正则表达式标志。标记应以管道分隔,例如"CASE_INSENSITIVE|COMMENTS"
    group: 要提取哪个捕获组作为令牌。默认为-1(分割).

    group默认为-1,表示以正则表达式匹配字符来进行分割分词
    group=0,则表示保留匹配所有正则表达式的字符串来分词
    group=1,2,3...,则表示保留匹配正则表达式中的某一个()中的匹配结果。

    ex:

    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "pattern",
              "pattern": "\"(.*)\"",   
              "flags": "",
              "group": -1
            }
          }
        }
      }
    }
    

    注意:匹配两端带"的字符串,注意"\"(.*)\"""\".*\""的区别,同样都是匹配两端带"的字符串,前面一个有group,而后面这个没有group。

    {
      "analyzer": "my_analyzer",
      "text": "comma,\"separated\",values"
    }
    

    分词测试结果:

    当group为默认的-1时, 以正则匹配到的结果作为分隔符分词,所得结果: ["comma", "values"]
    当group=0时,以正则匹配到的结果作为分词结果,所得结果为匹配的字符串:[""separated""]
    当group=1时,以正则中第一个()中匹配的字符串作为分词结果,所得的分词结果为:["separated"]
    当group=2时,以正则中第二个()中匹配的字符串作为分词结果,这里正则表达式中因为只有一个(),所以会报异常。

    所述path_hierarchy标记生成器需要像文件系统路径的分层值,分割的路径分隔,并发出一个术语,树中的每个组件。

    可选配置

    delimiter:路径匹配分割符,默认/.
    replacement:替换路径分隔符,默认与delimiter一致
    buffer_size:分割路径最大长度,默认1024.
    reverse:是否翻转,默认 false.
    skip: 默认为 0.

    ex:

    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "path_hierarchy",
              "delimiter": "-",
              "replacement":"/",
              "reverse": false,
              "skip": 0
            }
          }
        }
      }
    }
    
    POST my_index/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "one-two-three-four"
    }
    
    {
      "tokens": [
        {
          "token": "one",
          "start_offset": 0,
          "end_offset": 3,
          "type": "word",
          "position": 0
        },
        {
          "token": "one/two",
          "start_offset": 0,
          "end_offset": 7,
          "type": "word",
          "position": 0
        },
        {
          "token": "one/two/three",
          "start_offset": 0,
          "end_offset": 13,
          "type": "word",
          "position": 0
        },
        {
          "token": "one/two/three/four",
          "start_offset": 0,
          "end_offset": 18,
          "type": "word",
          "position": 0
        }
      ]
    }
    

    一、内置的8种分析器:

    • standard analyzer:默认分词器,它提供了基于语法的分词(基于Unicode文本分割算法,如Unicode® Standard Annex #29所指定的),适用于大多数语言,对中文分词效果很差。
    POST _analyze
    {
      "analyzer":"standard",
      "text":"Geneva K. Risk-Issues "
    }
    
    {
      "tokens": [
        {
          "token": "geneva",
          "start_offset": 0,
          "end_offset": 6,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "k",
          "start_offset": 7,
          "end_offset": 8,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "risk",
          "start_offset": 10,
          "end_offset": 14,
          "type": "<ALPHANUM>",
          "position": 2
        },
        {
          "token": "issues",
          "start_offset": 15,
          "end_offset": 21,
          "type": "<ALPHANUM>",
          "position": 3
        }
      ]
    }
    
    • simple analyzer:它提供了基于字母的分词,如果遇到不是字母时直接分词,所有字母均置为小写
    POST _analyze
    {
      "analyzer":"simple",
      "text":"Geneva K. Risk-Issues "
    }
    
    {
      "tokens": [
        {
          "token": "geneva",
          "start_offset": 0,
          "end_offset": 6,
          "type": "word",
          "position": 0
        },
        {
          "token": "k",
          "start_offset": 7,
          "end_offset": 8,
          "type": "word",
          "position": 1
        },
        {
          "token": "risk",
          "start_offset": 10,
          "end_offset": 14,
          "type": "word",
          "position": 2
        },
        {
          "token": "issues",
          "start_offset": 15,
          "end_offset": 21,
          "type": "word",
          "position": 3
        }
      ]
    }
    
    • whitespace analyzer:它提供了基于空格的分词,如果遇到空格时直接分词
    POST _analyze
    {
      "analyzer":"whitespace",
      "text":"Geneva K. Risk-Issues "
    }
    
    {
      "tokens": [
        {
          "token": "Geneva",
          "start_offset": 0,
          "end_offset": 6,
          "type": "word",
          "position": 0
        },
        {
          "token": "K.",
          "start_offset": 7,
          "end_offset": 9,
          "type": "word",
          "position": 1
        },
        {
          "token": "Risk-Issues",
          "start_offset": 10,
          "end_offset": 21,
          "type": "word",
          "position": 2
        }
      ]
    }
    
    • stop analyzer:它与simple analyzer相同,但是支持删除停用词。它默认使用 _english_stop 单词。
    POST _analyze
    {
      "analyzer":"stop",
      "text":"Geneva K.of Risk-Issues "
    }
    
    {
      "tokens": [
        {
          "token": "geneva",
          "start_offset": 0,
          "end_offset": 6,
          "type": "word",
          "position": 0
        },
        {
          "token": "k",
          "start_offset": 7,
          "end_offset": 8,
          "type": "word",
          "position": 1
        },
        {
          "token": "risk",
          "start_offset": 16,
          "end_offset": 20,
          "type": "word",
          "position": 4
        },
        {
          "token": "issues",
          "start_offset": 21,
          "end_offset": 27,
          "type": "word",
          "position": 5
        }
      ]
    }
    
    • keyword analyzer:它提供的是无操作分词,它将整个输入字符串作为一个词返回,即不分词。
    POST _analyze
    {
      "analyzer":"keyword",
      "text":"Geneva K.of Risk-Issues "
    }
    
    {
      "tokens": [
        {
          "token": "Geneva K.of Risk-Issues ",
          "start_offset": 0,
          "end_offset": 24,
          "type": "word",
          "position": 0
        }
      ]
    }
    
    • pattern analyzer:它提供了基于正则表达式将文本分词。正则表达式应该匹配词语分隔符,而不是词语本身。正则表达式默认为\W+(或所有非单词字符)。
    POST _analyze
    {
      "analyzer":"pattern",
      "text":"Geneva K.of Risk-Issues "
    }
    
    {
      "tokens": [
        {
          "token": "geneva",
          "start_offset": 0,
          "end_offset": 6,
          "type": "word",
          "position": 0
        },
        {
          "token": "k",
          "start_offset": 7,
          "end_offset": 8,
          "type": "word",
          "position": 1
        },
        {
          "token": "of",
          "start_offset": 9,
          "end_offset": 11,
          "type": "word",
          "position": 2
        },
        {
          "token": "risk",
          "start_offset": 12,
          "end_offset": 16,
          "type": "word",
          "position": 3
        },
        {
          "token": "issues",
          "start_offset": 17,
          "end_offset": 23,
          "type": "word",
          "position": 4
        }
      ]
    }
    
    • language analyzers:它提供了一组语言的分词,旨在处理特定语言.它包含了一下语言的分词:

      阿拉伯语,亚美尼亚语,巴斯克语,巴西语,保加利亚语,加泰罗尼亚语,cjk,捷克语,丹麦语,荷兰语,英语,芬兰语,法语,加利西亚语,德语,希腊语,印度语,匈牙利语,爱尔兰语,意大利语,拉脱维亚语,立陶宛语,挪威语,波斯语,葡萄牙语,罗马尼亚语,俄罗斯语,索拉尼语,西班牙语,瑞典语,土耳其语,泰国语

    POST _analyze
    {
      "analyzer":"english",  ## french(法语)
      "text":"Geneva K.of Risk-Issues "
    }
    
    {
      "tokens": [
        {
          "token": "geneva",
          "start_offset": 0,
          "end_offset": 6,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "k.of",
          "start_offset": 7,
          "end_offset": 11,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "risk",
          "start_offset": 12,
          "end_offset": 16,
          "type": "<ALPHANUM>",
          "position": 2
        },
        {
          "token": "isue",
          "start_offset": 17,
          "end_offset": 23,
          "type": "<ALPHANUM>",
          "position": 3
        }
      ]
    }
    
    • fingerprint analyzer:对text进行排序,重复数据删除然后将它们重新组合为单个text。
    POST _analyze
    {
      "analyzer":"fingerprint",
      "text":"Geneva K.of Risk-Issues "
    }
    
    {
      "tokens": [
        {
          "token": "geneva issues k.of risk",
          "start_offset": 0,
          "end_offset": 24,
          "type": "fingerprint",
          "position": 0
        }
      ]
    }
    

    二、测试自定义分析器

    分析器analyze API的使用
    分析器analyze API可验证分析器的分析效果并解释分析过程。

    text: 待分析文本
    explain:解释分析过程
    char_filter:字符过滤器
    tokenizer:分词器
    filter:词元过滤器

    GET _analyze
    {
      "char_filter": ["html_strip"],
      "tokenizer": "standard",
      "filter": ["lowercase"],
      "text": "<p><em>No <b>dreams</b>, why bother <b>Beijing</b> !</em></p>",
      "explain": true
    }
    
    {
      "detail": {
        "custom_analyzer": true,
        "charfilters": [
          {
            "name": "html_strip",
            "filtered_text": [
              """
    
    No dreams, why bother Beijing !
    
    """
            ]
          }
        ],
        "tokenizer": {
          "name": "standard",
          "tokens": [
            {
              "token": "No",
              "start_offset": 7,
              "end_offset": 9,
              "type": "<ALPHANUM>",
              "position": 0,
              "bytes": "[4e 6f]",
              "positionLength": 1
            },
            {
              "token": "dreams",
              "start_offset": 13,
              "end_offset": 23,
              "type": "<ALPHANUM>",
              "position": 1,
              "bytes": "[64 72 65 61 6d 73]",
              "positionLength": 1
            },
            {
              "token": "why",
              "start_offset": 25,
              "end_offset": 28,
              "type": "<ALPHANUM>",
              "position": 2,
              "bytes": "[77 68 79]",
              "positionLength": 1
            },
            {
              "token": "bother",
              "start_offset": 29,
              "end_offset": 35,
              "type": "<ALPHANUM>",
              "position": 3,
              "bytes": "[62 6f 74 68 65 72]",
              "positionLength": 1
            },
            {
              "token": "Beijing",
              "start_offset": 39,
              "end_offset": 50,
              "type": "<ALPHANUM>",
              "position": 4,
              "bytes": "[42 65 69 6a 69 6e 67]",
              "positionLength": 1
            }
          ]
        },
        "tokenfilters": [
          {
            "name": "lowercase",
            "tokens": [
              {
                "token": "no",
                "start_offset": 7,
                "end_offset": 9,
                "type": "<ALPHANUM>",
                "position": 0,
                "bytes": "[6e 6f]",
                "positionLength": 1
              },
              {
                "token": "dreams",
                "start_offset": 13,
                "end_offset": 23,
                "type": "<ALPHANUM>",
                "position": 1,
                "bytes": "[64 72 65 61 6d 73]",
                "positionLength": 1
              },
              {
                "token": "why",
                "start_offset": 25,
                "end_offset": 28,
                "type": "<ALPHANUM>",
                "position": 2,
                "bytes": "[77 68 79]",
                "positionLength": 1
              },
              {
                "token": "bother",
                "start_offset": 29,
                "end_offset": 35,
                "type": "<ALPHANUM>",
                "position": 3,
                "bytes": "[62 6f 74 68 65 72]",
                "positionLength": 1
              },
              {
                "token": "beijing",
                "start_offset": 39,
                "end_offset": 50,
                "type": "<ALPHANUM>",
                "position": 4,
                "bytes": "[62 65 69 6a 69 6e 67]",
                "positionLength": 1
              }
            ]
          }
        ]
      }
    }
    

    归一化分析器 normalizer

    针对type为keyword类型的字段,只能精确搜索,而且是区分大小写的。有时候我们希望对于keyword类型的字段不区分大小写也能精确检索怎么办呢?normalizer这个分析器可以帮你解决这个问题。
    normalizer的构成比analyzer少了一个tokenizer属性,它的结构如下:

    分析器类型type:custom
    字符过滤器char_filter: 零个或多个 按顺序应用的
    词元过滤器filter零个或多个 按顺序应用的

    这里借用官方的一个例子:

    PUT index
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "quote": {
              "type": "mapping",
              "mappings": [
                "« => \"",
                "» => \""
              ]
            }
          },
          "normalizer": {
            "my_normalizer": {
              "type": "custom",
              "char_filter": ["quote"],
              "filter": ["lowercase", "asciifolding"]
            }
          }
        }
      },
      "mappings": {
        "type": {
          "properties": {
            "foo": {
              "type": "keyword",  // normalizer只能用在keyword类型的字段
              "normalizer": "my_normalizer"
            }
          }
        }
      }
    }
    
    PUT testlog/wd_doc/1
    {
      "title": "Quick Frox" 
    }
    
    GET testlog/wd_doc/_search
    {
      "query": {
        "match": {
          "title": {
            "query": "quick Frox"   // 大小写不敏感,无论大小写都能检索到
          }
        }
      }
    }
    

    内置的12个分词器

    相关文章

      网友评论

          本文标题:elasticsearch分析器

          本文链接:https://www.haomeiwen.com/subject/xpanvctx.html