一、例子
1 whitespace分词器( "analyzer": "whitespace")
POST _analyze
{
"analyzer": "whitespace",
"text": "This is my first program and these are my first 5 lines."
}
![](https://img.haomeiwen.com/i7007629/66686ebaafa43f66.png)
2 standard分词器( "tokenizer": "standard")
下面的lowercase会把分词结果用小写展示
POST _analyze
{
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding"],
"text": "Is this déja vu?"
}
![](https://img.haomeiwen.com/i7007629/980afa341dc1887e.png)
![](https://img.haomeiwen.com/i7007629/0c3a6a896adb4c01.png)
3 自定义analysis("type": "custom")
下面的"type": "custom"是固定写法(另外一种是"type": "standard")
PUT my_index_name
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"properties": {
"my_sentence": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
可以这样使用
GET my_index_name/_analyze
{
"analyzer": "custom_analyzer",
"text": "Is this déjà vu?"
}
也可以这样使用:
GET my_index_name/_analyze
{
"field": "my_sentence",
"text": "Is this déjà vu?"
}
![](https://img.haomeiwen.com/i7007629/e2b8fc318c352e15.png)
4 自定义analysis("type": "standard")&& 使用stopwords
PUT my_index2
{
"settings": {
"analysis": {
"analyzer": {
"my_personalized_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
这样使用:
POST my_index2/_analyze
{
"analyzer": "my_personalized_analyzer",
"text": "This is my first program and these are my first 5 lines"
}
分词结果为:[my, first, progr, am, my, first, 5 lines]
注意,如果使用默认analysis,则最后一个是line,s会去掉
5. 自定义analysis("type": "standard")&& 无stopwords
PUT my_index2
{
"settings": {
"analysis": {
"analyzer": {
"my_personalized_analyzer": {
"type": "standard",
"max_token_length": 5
}
}
}
}
}
POST my_index2/_analyze
{
"analyzer": "my_personalized_analyzer",
"text": "This is my first program and these are my first 5 lines"
}
![](https://img.haomeiwen.com/i7007629/16b28ced88d17e42.png)
注意this和is都留下来了
6. 不需要type=xxx
定义一个可以去重的analysis:
PUT standard_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"remove_duplicates"
]
}
}
}
}
}
7. simple分析器("analyzer": "simple")
A simple analyzer breaks the text into intervals whenever it encounters a non-letter term. It cannot be configured and only contains a lowercase tokenizer.
看一个例子:
POST _analyze
{
"analyzer": "simple",
"text": "This is my first program and these are my first 5 lines."
}
5和最后一个点号都被忽略
现在自定义一个:
PUT simple_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_simple": {
"tokenizer": "lowercase",
"filter": []
}
}
}
}
}
使用
POST simple_example/_analyze
{
"analyzer": "rebuilt_simple",
"text": "one 2 three Five 5"
}
![](https://img.haomeiwen.com/i7007629/67651adbb8cae9df.png)
8. stop分析器("analyzer": "stop")
默认是使用_english_ stop words
![](https://img.haomeiwen.com/i7007629/c08f0a4a5ecc9af5.png)
当然我们也可以自定义:
PUT my_index_name
{
"settings": {
"analysis": {
"analyzer": {
"the_stop_analyzer": {
"type": "stop",
"stopwords": ["first", "and", "_english_"]
}
}
}
}
}
测试:
POST my_index_name/_analyze
{
"analyzer": "the_stop_analyzer",
"text": "This is my first program and these are my first lines."
}
![](https://img.haomeiwen.com/i7007629/ca9919d7382d7160.png)
stop analyzer是由一个lowercase tokenizer 和一个 stop token filter构成, 这两个参数都可以配置,如下自定义:
PUT stop_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_" //this can be overwritten with stopwords or stopwords_path parameters
}
},
"analyzer": {
"rebuilt_stop": {
"tokenizer": "lowercase",
"filter": [
"english_stop"
]
}
}
}
}
}
9. keyword分析器( "analyzer": "keyword")
POST _analyze
{
"analyzer": "keyword",
"text": "This is my first program and these are my first lines."
}
![](https://img.haomeiwen.com/i7007629/3e8c2362f1a28a80.png)
我们自定义一个:
PUT keyword_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_keyword": {
"tokenizer": "keyword",
"filter": []
}
}
}
}
}
10. pattern分析器("analyzer": "pattern")
A pattern analyzer splits the text into terms using a regular expression. The regular expression must match the token separators and not the tokens themselves. The regular expression defaults to non-word characters using \W+:(注意默认是non-word作为separator)
POST _analyze {
"analyzer": "pattern",
"text": "This is my first program and these are my first line."
}
![](https://img.haomeiwen.com/i7007629/3f98f8dbe7973049.png)
配置其他参数:
"analysis": {
"analyzer": {
"email_analyzer": {
"type": "pattern",
"pattern": "\\W|_",
"Lowercase": true
}
}
}
}
}
POST my_index_name/_analyze
{
"analyzer": "email_analyzer",
"text": "Jane_Smith@foo-bar.com"
}
![](https://img.haomeiwen.com/i7007629/7d9718942c2e662f.png)
自定义方式:
PUT pattern_example
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_non_word": {
"type": "pattern",
"pattern": "\\W+"
}
},
"analyzer": {
"rebuilt_pattern": {
"tokenizer": "split_on_non_word",
"filter": [
"lowercase"
]
}
}
}
}
}
![](https://img.haomeiwen.com/i7007629/b9dfb0dfb922299b.png)
11. The language analyzer
Language analyzers can be set to analyze the text of a specific language. They support custom stopwords, and the stem_exclusion parameter allows the user to specify a certain array of lowercase words that are not to be stemmed.
POST _analyze
{
"analyzer": "ik_smart",
"text": ["我是中国人"]
}
ik_smart:会做最粗粒度的拆分
![](https://img.haomeiwen.com/i7007629/8db40570136c8ce4.png)
下面这个结果不同:
POST _analyze
{
"analyzer": "ik_max_word",
"text": ["我是中国人"]
}
ik_max_word:会将文本做最细粒度的拆分
![](https://img.haomeiwen.com/i7007629/e01b6891f48323f0.png)
现在可以多配置一些:
PUT my_index2
{
"settings": {
"analysis": {
"analyzer": {
"ik": {
"tokenizer": "ik_max_word"
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text"
},
"content": {
"type": "text",
"analyzer": "ik"
}
}
}
}
测试:
POST my_index2/_analyze
{
"analyzer": "ik",
"text": ["我是中国人"]
}
![](https://img.haomeiwen.com/i7007629/e9c42e21412dfa2e.png)
也可以这样使用:
POST my_index2/_analyze
{
"field": "content",
"text": ["我是中国人"]
}
映射字符过滤器char_filter例子:
PUT char_filter_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{
"tokenizer":"keyword",
"char_filter":["my_char_filter"]
}
},
"char_filter":{
"my_char_filter":{
"type":"mapping",
"mappings":["孙悟空 => 齐天大圣","猪八戒 => 天蓬元帅"]
}
}
}
}
}
测试:
POST char_filter_index/_analyze
{
"analyzer": "my_analyzer",
"text": "孙悟空打妖怪,猪八戒吃西瓜"
}
![](https://img.haomeiwen.com/i7007629/3a8428d8ed2aafae.png)
12. Normalizers
- but instead of producing multiple tokens, they only produce one
- They do not contain tokenizers and accept only some character filters and token filters
- There is no built-in normalizer
定义一个normalizer类似与定义一个analyzer, 只是它使用normalizer keyword而不是analyzer
PUT index
{
"settings": {
"analysis": {
"char_filter": {
"quote": {
"type": "mapping",
"mappings": [
"« => \"",
"» => \""
]
}
},
"normalizer": {
"my_normalizer_name": {
"type": "custom",
"char_filter": ["quote"],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"properties": {
"foo": {
"type": "keyword",
"normalizer": "my_normalizer_name"
}
}
}
}
插入一些数据:
PUT index/_doc/1
{
"foo": "BÀR"
}
PUT index/_doc/2
{
"foo": "bar"
}
PUT index/_doc/3
{
"foo": "baz"
}
PUT index/_doc/4
{
"foo": "« will be changed"
}
POST index/_refresh
查询:
GET index/_search
{
"query": {
"term": {
"foo": "BAR"
}
}
}
GET index/_search
{
"query": {
"match": {
"foo": "BAR"
}
}
}
![](https://img.haomeiwen.com/i7007629/da4b3b42f9e38fd5.png)
原因说明:The above queries match documents 1 and 2 since BÀR is converted to bar at both index and query time.
![](https://img.haomeiwen.com/i7007629/c3c1ff977b51cd83.png)
网友评论