美文网首页
ES7 Tokenizer

ES7 Tokenizer

作者: 逸章 | 来源:发表于2020-05-02 23:14 被阅读0次

    一、例子

    1. The standard tokenizer( "tokenizer": "standard")

    uses Unicode Text Segmentation to divide the text

    POST _analyze 
    {
        "tokenizer": "standard",
        "text": "Those who dare to fail miserably can achieve greatly."
    }
    
    图片.png

    2. The letter tokenizer("tokenizer": "letter")

    breaks the text into individual words whenever it meets a character that is not a letter

    POST _analyze
     {
       "tokenizer": "letter",
       "text": "You're a wizard, Harry."
     }
    

    切割为:

    [You, re, a, wizard, Harry]
    

    3. The lowercase tokenizer("tokenizer": "lowercase")

    letter tokenizer && it also turns all terms into lowercase

    POST _analyze
     {
       "tokenizer": "lowercase",
       "text": "You're a wizard, Harry."
     }
    

    4. The whitespace tokenizer("tokenizer": "whitespace")

    The whitespace tokenizer breaks the text into individual words whenever whitespace is encountered

    POST _analyze
     {
       "tokenizer": "whitespace",
       "text": "You're a wizard, Harry."
     }
    

    输出为:[You're, a, wizard, Harry]

    5. The keyword tokenizer("tokenizer": "keyword")

    outputs the text as a single term

    POST _analyze
     {
       "tokenizer": "keyword",
       "text": "Los Angeles"
     }
    
    输出为: 图片.png

    6. The pattern tokenizer("tokenizer": "pattern")

    The pattern tokenizer uses a regular expression to divide the text or capture the matching text as terms.
    The default pattern is \W+

    POST _analyze
     {
       "tokenizer": "pattern",
       "text": "The foo_bar_size's default is 5."
     }
    
    图片.png

    如下参数可以配置:

    • The pattern parameter defaults to \W+.
    • The flags parameter represents Java flags.
    • The group parameter captures the group as a token. By default, it splits at -1.
    PUT my_index_name 
    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "my_analyzer_name": {
                        "tokenizer": "my_tokenizer_name"
                    }
                },
                "tokenizer": {
                    "my_tokenizer_name": {
                        "type": "pattern",
                        "pattern": [",", "|"]
                    }
                }
            }
        }
    }
    

    测试:

    POST my_index_name/_analyze 
    {
        "analyzer": "my_analyzer_name",
        "text": "comma, separated, values|one|two|three-four"
    }
    
    分割结果是: 图片.png

    7. The simple pattern tokenizer("type": "simple_pattern")

    类似与pattern tokenizer, 但它只使用一个pattern,不接受分割的模式集,所以通常比pattern tokenizer快

    PUT my_index_name 
    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "my_analyzer_name": {
                        "tokenizer": "my_tokenizer_name"
                    }
                },
                "tokenizer": {
                    "my_tokenizer_name": {
                        "type": "simple_pattern",
                        "pattern": "[0123456789]{3}"
                    }
                }
            }
        }
    }
    

    测试:

    post my_index_name/_analyze
    {
        "analyzer": "my_analyzer_name",
        "text": "asta-313-267-847-mm-309"
    }
    
    图片.png

    相关文章

      网友评论

          本文标题:ES7 Tokenizer

          本文链接:https://www.haomeiwen.com/subject/pjfrghtx.html