美文网首页
多字段与自定义Analyzer

多字段与自定义Analyzer

作者: 滴流乱转的小胖子 | 来源:发表于2020-07-08 06:00 被阅读0次

    一、多字段

    字段实现精确

    • 增加一个keyword字段

    使用不同的analyzer

    • 不同语言
    • pinyin字段的搜索
    • 还支持为搜索和索引指定不同的analyzer

    二、Exact Values vs Full Text

    Exact Values:包括数字/日期/具体一个字符串(例如“Apple Store”)

    • ES 中的 keyword

    全文本,非结构化的文本数据

    • ES中的text


      image.png

    Exact Values 不需要被分词

    • ES为每一个字段创建一个倒排索引
    • Exact Value在索引时,不需要做特殊的分词处理


      image.png
    • Character Filters
    • Tokenizer
    • Token Filter

    三、自定义分词

    当es自带的分词器无法满足时,可以自定义分词器。通过自组合不同的组件实现

    3.1 Character Filters

    在Tokenizer之前对文本进行处理,例如增加删除及替换字符。可以配置多个 Character Filters。会影响Tokenizer和offset信息

    一些自带的 Character Filters

    • HTML script -- 去除html标签
    • Mapping -- 字符串替换
    • Pattern replace -- 正则匹配替换

    3.2 Tokenizer

    • 将原始的文本按照一定的规则,切分为词(term or token)
    • ES内置的Tokenizers
      whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
    • 可以用java 开发组件实现自己的 Tokenizer

    3.3 Token Filters

    • 将Tokenizer 输出的单词(term),进行增加,修改,删除
    • 自带的 Token Filters
      Lowercase / stop /synonym (添加 近义词)


      image.png

    词的执行过程
    1:分析器的大体执行过程
    char filter - >tokenizer -> token filter
    2:分词的时机
    分词在索引时做,也就是数据写入时。目的是创建倒排索引提高搜索速度。写入的原始数据保存在_source中


    image.png
    PUT logs/_doc/1
    {"level":"DEBUG"}
    
    GET /logs/_mapping
    
    POST _analyze
    {
      "tokenizer":"keyword",
      "char_filter":["html_strip"],
      "text": "<b>hello world</b>"
    }
    
    
    POST _analyze
    {
      "tokenizer":"path_hierarchy",
      "text":"/user/ymruan/a/b/c/d/e"
    }
    
    #使用char filter进行替换
    POST _analyze
    {
      "tokenizer": "standard",
      "char_filter": [
          {
            "type" : "mapping",
            "mappings" : [ "- => _"]
          }
        ],
      "text": "123-456, I-test! test-990 650-555-1234"
    }
    
    //char filter 替换表情符号
    POST _analyze
    {
      "tokenizer": "standard",
      "char_filter": [
          {
            "type" : "mapping",
            "mappings" : [ ":) => happy", ":( => sad"]
          }
        ],
        "text": ["I am felling :)", "Feeling :( today"]
    }
    
    // white space and snowball
    GET _analyze
    {
      "tokenizer": "whitespace",
      "filter": ["stop","snowball"],
      "text": ["The gilrs in China are playing this game!"]
    }
    
    
    // whitespace与stop
    GET _analyze
    {
      "tokenizer": "whitespace",
      "filter": ["stop","snowball"],
      "text": ["The rain in Spain falls mainly on the plain."]
    }
    
    
    //remove 加入lowercase后,The被当成 stopword删除
    GET _analyze
    {
      "tokenizer": "whitespace",
      "filter": ["lowercase","stop","snowball"],
      "text": ["The gilrs in China are playing this game!"]
    }
    
    //正则表达式
    GET _analyze
    {
      "tokenizer": "standard",
      "char_filter": [
          {
            "type" : "pattern_replace",
            "pattern" : "http://(.*)",
            "replacement" : "$1"
          }
        ],
        "text" : "http://www.elastic.co"
    }
    

    自定义分析器 示例

    自定义分析器标准格式是:
    PUT /my_index
    {
        "settings": {
            "analysis": {
                "char_filter": { ... custom character filters ... },//字符过滤器
                "tokenizer": { ... custom tokenizers ... },//分词器
                "filter": { ... custom token filters ... }, //词单元过滤器
                "analyzer": { ... custom analyzers ... }
            }
        }
    }
    ============================实例===========================
    PUT /my_index
    {
        "settings": {
            "analysis": {
                "char_filter": {
                    "&_to_and": {
                        "type": "mapping",
                        "mappings": [ "&=> and "]
                }},
                "filter": {
                    "my_stopwords": {
                        "type": "stop",
                        "stopwords": [ "the", "a" ]
                }},
                "analyzer": {
                    "my_analyzer": {
                        "type": "custom",
                        "char_filter": [ "html_strip", "&_to_and" ],
                        "tokenizer": "standard",
                        "filter": [ "lowercase", "my_stopwords" ]
                }}
    }}}
    
    
    ============================实例===========================
    比如自定义好的analyzer名字是my_analyzer,在此索引下的某个新增字段应用此分析器
    PUT /my_index/_mapping
    {
       "properties":{
            "username":{
                 "type":"text",
                  "analyzer" : "my_analyzer"
             },
            "password" : {
              "type" : "text"
            }
        
      }
    }
    =================插入数据====================
    PUT /my_index/_doc/1
    {
      "username":"The quick & brown fox ",
       "password":"The quick & brown fox "
    
    
    }
    ====username采用自定义分析器my_analyzer,password采用默认的standard分析器==
    ===验证
    GET /index_v1/_analyze
    {
      "field":"username",
      "text":"The quick & brown fox"
    }
    
    GET /index_v1/_analyze
    {
      "field":"password",
      "text":"The quick & brown fox"
    }
    //官网权威指南是真的讲得好,虽然版本太老,Elasticsearch 2.x 版本,一些api已经不适用了,自定义分析器地址:https://www.elastic.co/guide/cn/elasticsearch/guide/cn/custom-analyzers.html
    

    相关文章

      网友评论

          本文标题:多字段与自定义Analyzer

          本文链接:https://www.haomeiwen.com/subject/gsqtcktx.html