美文网首页
ES7 Analysis

ES7 Analysis

作者: 逸章 | 来源:发表于2020-05-01 15:24 被阅读0次

一、例子

1 whitespace分词器( "analyzer": "whitespace")

POST _analyze 
{
    "analyzer": "whitespace",
    "text": "This is my first program and these are my first 5 lines."
}
图片.png

2 standard分词器( "tokenizer": "standard")

下面的lowercase会把分词结果用小写展示

POST _analyze 
{
    "tokenizer": "standard",
    "filter": ["lowercase", "asciifolding"],
    "text": "Is this déja vu?"
}
图片.png 图片.png

3 自定义analysis("type": "custom")

下面的"type": "custom"是固定写法(另外一种是"type": "standard")

PUT my_index_name 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "custom_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "asciifolding"
                    ]
                }
            }
        }
    },
    "mappings": {
          "properties": {
                "my_sentence": {
                    "type": "text",
                    "analyzer": "custom_analyzer"
                }
            }
    }
}

可以这样使用

GET my_index_name/_analyze
{
    "analyzer": "custom_analyzer",
    "text": "Is this déjà vu?"
}

也可以这样使用:

GET my_index_name/_analyze
{
    "field": "my_sentence",
    "text": "Is this déjà vu?"
}
图片.png

4 自定义analysis("type": "standard")&& 使用stopwords

PUT my_index2 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_personalized_analyzer": {
                    "type": "standard",
                    "max_token_length": 5,
                    "stopwords": "_english_"
                }
            }
        }
    }
}

这样使用:

POST my_index2/_analyze
 {
    "analyzer": "my_personalized_analyzer",
    "text": "This is my first program and these are my first 5 lines"
 }

分词结果为:[my, first, progr, am, my, first, 5 lines]
注意,如果使用默认analysis,则最后一个是line,s会去掉

5. 自定义analysis("type": "standard")&& 无stopwords

PUT my_index2 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_personalized_analyzer": {
                    "type": "standard",
                    "max_token_length": 5
                }
            }
        }
    }
}
POST my_index2/_analyze
 {
    "analyzer": "my_personalized_analyzer",
    "text": "This is my first program and these are my first 5 lines"
 }
图片.png

注意this和is都留下来了

6. 不需要type=xxx

定义一个可以去重的analysis:

PUT standard_example 
{
    "settings": {
      "analysis": {
            "analyzer": {
                "rebuilt_standard": {
                    "tokenizer": "standard",
                    "filter": [
                        "remove_duplicates"
                    ]
                }
            }
        }
    }
}

7. simple分析器("analyzer": "simple")

A simple analyzer breaks the text into intervals whenever it encounters a non-letter term. It cannot be configured and only contains a lowercase tokenizer.
看一个例子:

POST _analyze 
{
    "analyzer": "simple",
    "text": "This is my first program and these are my first 5 lines."
}

5和最后一个点号都被忽略

现在自定义一个:

PUT simple_example 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "rebuilt_simple": {
                    "tokenizer": "lowercase",
                    "filter": []
                }
            }
        }
    }
}

使用

POST simple_example/_analyze
 {
    "analyzer": "rebuilt_simple",
    "text": "one 2 three Five 5"
 }
图片.png

8. stop分析器("analyzer": "stop")

默认是使用_english_ stop words

图片.png
当然我们也可以自定义:
PUT my_index_name 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "the_stop_analyzer": {
                    "type": "stop",
                    "stopwords": ["first", "and", "_english_"]
                }
            }
        }
    }
}

测试:

POST my_index_name/_analyze
 {
   "analyzer": "the_stop_analyzer",
   "text": "This is my first program and these are my first lines."
 }
图片.png

stop analyzer是由一个lowercase tokenizer 和一个 stop token filter构成, 这两个参数都可以配置,如下自定义:

PUT stop_example 
{
  "settings": {
        "analysis": {
          
            "filter": {
                "english_stop": {
                    "type": "stop",
                    "stopwords": "_english_" //this can be overwritten with stopwords or stopwords_path parameters
                }
            },
            
            
            "analyzer": {
                "rebuilt_stop": {
                    "tokenizer": "lowercase",
                    "filter": [
                        "english_stop"
                    ]
                }
            }
        }
    }
}

9. keyword分析器( "analyzer": "keyword")

POST _analyze 
{
    "analyzer": "keyword",
    "text": "This is my first program and these are my first lines."
}
图片.png

我们自定义一个:

PUT keyword_example 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "rebuilt_keyword": {
                    "tokenizer": "keyword",
                    "filter": []
                }
            }
        }
    }
}

10. pattern分析器("analyzer": "pattern")

A pattern analyzer splits the text into terms using a regular expression. The regular expression must match the token separators and not the tokens themselves. The regular expression defaults to non-word characters using \W+:(注意默认是non-word作为separator)

POST _analyze {
    "analyzer": "pattern",
    "text": "This is my first program and these are my first line."
}
图片.png

配置其他参数:

"analysis": {
            "analyzer": {
                "email_analyzer": {
                    "type": "pattern",
                    "pattern": "\\W|_",
                    "Lowercase": true
                }
            }
        }
    }
}
POST my_index_name/_analyze
{
 "analyzer": "email_analyzer",
 "text": "Jane_Smith@foo-bar.com"
}
图片.png

自定义方式:

PUT pattern_example 
{
    "settings": {
        "analysis": {
            "tokenizer": {
                "split_on_non_word": {
                    "type": "pattern",
                    "pattern": "\\W+"
                }
            },
            "analyzer": {
              "rebuilt_pattern": {
                    "tokenizer": "split_on_non_word",
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    }
}
图片.png

11. The language analyzer

Language analyzers can be set to analyze the text of a specific language. They support custom stopwords, and the stem_exclusion parameter allows the user to specify a certain array of lowercase words that are not to be stemmed.

POST _analyze
{
  "analyzer": "ik_smart",
  "text": ["我是中国人"]
}

ik_smart:会做最粗粒度的拆分

图片.png

下面这个结果不同:

    POST _analyze
    {
      "analyzer": "ik_max_word",
      "text": ["我是中国人"]
    }

ik_max_word:会将文本做最细粒度的拆分

图片.png

现在可以多配置一些:

PUT my_index2
{
    "settings": {
        "analysis": {
            "analyzer": {
                "ik": {
                    "tokenizer": "ik_max_word"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text"
            },
            "content": {
                "type": "text",
                "analyzer": "ik"
            }
        }
    }
}

测试:

POST my_index2/_analyze
{
  "analyzer": "ik",
  "text": ["我是中国人"]
}
图片.png

也可以这样使用:

POST my_index2/_analyze
    {
      "field": "content",
      "text": ["我是中国人"]
    }

映射字符过滤器char_filter例子:

PUT char_filter_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter"]
        }
      },
      "char_filter":{
          "my_char_filter":{
            "type":"mapping",
            "mappings":["孙悟空 => 齐天大圣","猪八戒 => 天蓬元帅"]
          }
        }
    }
  }
}

测试:

POST char_filter_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "孙悟空打妖怪,猪八戒吃西瓜"
}
图片.png

12. Normalizers

  • but instead of producing multiple tokens, they only produce one
  • They do not contain tokenizers and accept only some character filters and token filters
  • There is no built-in normalizer
    定义一个normalizer类似与定义一个analyzer, 只是它使用normalizer keyword而不是analyzer
PUT index 
{
    "settings": {
        "analysis": {
            "char_filter": {
                "quote": {
                    "type": "mapping",
                    "mappings": [
                        "« => \"",
                        "» => \""
                    ]
                }
            },
            "normalizer": {
                "my_normalizer_name": {
                    "type": "custom",
                    "char_filter": ["quote"],
                    "filter": ["lowercase", "asciifolding"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "foo": {
              "type": "keyword",
                "normalizer": "my_normalizer_name"
            }
        }
    }
}

插入一些数据:

PUT index/_doc/1
{
  "foo": "BÀR"
}

PUT index/_doc/2
{
  "foo": "bar"
}

PUT index/_doc/3
{
  "foo": "baz"
}

PUT index/_doc/4
{
  "foo": "« will be changed"
}

POST index/_refresh

查询:

GET index/_search
{
  "query": {
    "term": {
      "foo": "BAR"
    }
  }
}

GET index/_search
{
  "query": {
    "match": {
      "foo": "BAR"
    }
  }
}
图片.png

原因说明:The above queries match documents 1 and 2 since BÀR is converted to bar at both index and query time.

图片.png

相关文章

网友评论

      本文标题:ES7 Analysis

      本文链接:https://www.haomeiwen.com/subject/wzgfghtx.html