Token filters receive a stream of tokens from a tokenizer and have the ability to add, modify, or delete tokens.
1.The HTML strip character filter("type": "html_strip")
PUT my_index_name
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_name": {
"tokenizer": "keyword",
"char_filter": ["my_char_filter_name"]
}
},
"char_filter": {
"my_char_filter_name": {
"type": "html_strip",
"escaped_tags": ["b"]
}
}
}
}
}
测试如下:
POST _analyze
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text": "<p>I'm so <b>happy</b>!</p>"
}

2. The mapping character filter("type": "mapping")
we can replace certain characters in a string with their associated keys.
PUT my_index_name
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_name": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter_name"
]
}
},
"char_filter": {
"my_char_filter_name": {
"type": "mapping",
"mappings": [
"٠ => 0",
"١ => 1",
"٢ => 2",
"٣ => 3",
"٤ => 4",
"٥ => 5",
"٦ => 6",
"٧ => 7",
"٨ => 8",
"٩ => 9"
]
}
}
}
}
}
测试:
POST my_index_name/_analyze
{
"analyzer": "my_analyzer_name",
"text": "My license plate is ٢٥٠١٥"
}

3. The pattern replace character filter("type": "pattern_replace")
例子:
1、元数据:"aa bb aa bb" 、pattern="(aa)\s+(bb)"、 replacement="2"
输出结果为:"aa#bb aa#bb"
2、元数据:"aa123bb" 、pattern="(aa)\d+(bb)" 、 replacement="2"
输出结果为:"aa bb"
PUT pattern_test5
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
}
}
}
}
}
测试数据:
POST pattern_test5/_analyze
{
"analyzer": "my_analyzer",
"text": "My credit card is 123-456-789"
}

网友评论