一、例子
1. The standard tokenizer( "tokenizer": "standard")
uses Unicode Text Segmentation to divide the text
POST _analyze
{
"tokenizer": "standard",
"text": "Those who dare to fail miserably can achieve greatly."
}
图片.png
2. The letter tokenizer("tokenizer": "letter")
breaks the text into individual words whenever it meets a character that is not a letter
POST _analyze
{
"tokenizer": "letter",
"text": "You're a wizard, Harry."
}
切割为:
[You, re, a, wizard, Harry]
3. The lowercase tokenizer("tokenizer": "lowercase")
letter tokenizer && it also turns all terms into lowercase
POST _analyze
{
"tokenizer": "lowercase",
"text": "You're a wizard, Harry."
}
4. The whitespace tokenizer("tokenizer": "whitespace")
The whitespace tokenizer breaks the text into individual words whenever whitespace is encountered
POST _analyze
{
"tokenizer": "whitespace",
"text": "You're a wizard, Harry."
}
输出为:[You're, a, wizard, Harry]
5. The keyword tokenizer("tokenizer": "keyword")
outputs the text as a single term
POST _analyze
{
"tokenizer": "keyword",
"text": "Los Angeles"
}
输出为:
图片.png
6. The pattern tokenizer("tokenizer": "pattern")
The pattern tokenizer uses a regular expression to divide the text or capture the matching text as terms.
The default pattern is \W+
POST _analyze
{
"tokenizer": "pattern",
"text": "The foo_bar_size's default is 5."
}
图片.png
如下参数可以配置:
- The pattern parameter defaults to \W+.
- The flags parameter represents Java flags.
- The group parameter captures the group as a token. By default, it splits at -1.
PUT my_index_name
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_name": {
"tokenizer": "my_tokenizer_name"
}
},
"tokenizer": {
"my_tokenizer_name": {
"type": "pattern",
"pattern": [",", "|"]
}
}
}
}
}
测试:
POST my_index_name/_analyze
{
"analyzer": "my_analyzer_name",
"text": "comma, separated, values|one|two|three-four"
}
分割结果是:
图片.png
7. The simple pattern tokenizer("type": "simple_pattern")
类似与pattern tokenizer, 但它只使用一个pattern,不接受分割的模式集,所以通常比pattern tokenizer快
PUT my_index_name
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_name": {
"tokenizer": "my_tokenizer_name"
}
},
"tokenizer": {
"my_tokenizer_name": {
"type": "simple_pattern",
"pattern": "[0123456789]{3}"
}
}
}
}
}
测试:
post my_index_name/_analyze
{
"analyzer": "my_analyzer_name",
"text": "asta-313-267-847-mm-309"
}
图片.png
网友评论