多字段类型
-
多字段特性
-
厂商名字实现精确匹配
- 增加一个Keyword字段
-
使用不同的analyzer
-
不同语言
-
pinyin字段的搜索
-
还支持为搜索和索引指定不同的analyzer
-
-
Exact Values v.s Full Text
-
Exact Values v.s Full Text
-
Exact Values: 包括数字/日期/具体一个字符串(例如"Apple Store")
- ElasticSearch中的keyword
-
全文本,非结构化的文本数据
- ElasticSearch中的text
-
Exact Values 不需要被分词
-
ElasticSearch为每一个字段创建一个倒排索引
- Exact Value 在索引时,不需要做特殊处理
自定义分词
-
当ElasticSearch自带的分词器无法满足时,可以自定义分词器.通过自组合不同的组件实现
-
Character Filter
-
Tokenizer
-
Token Filter
-
Character Filters
-
在Tokenizer之前对文本进行处理,例如增加删除及替换字符.可以配置多个Character Filters. 会影响Tokenizer的position和offset信息
-
一些自带的Character Filters
-
HTML strip - 去除html标签
-
Mapping - 字符串替换
-
Pattern replace - 正则匹配替换
-
Tokenizer
-
将原始文本按照一定规则,切分为词(term or token)
-
ElasticSearch内置的Tokenizers
- whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
-
可以用Java开发插件,实现自己的Tokenizer
Token Filters
-
将Tokenizer输出的单词(term),进行增加,修改,删除
-
自带的Token Filters
- Lowercase / stop / synonym(添加近义词)
课程Demo
PUT logs/_doc/1
{"level":"DEBUG"}
GET /logs/_mapping
POST _analyze
{
"tokenizer":"keyword",
"char_filter":["html_strip"],
"text": "<b>hello world</b>"
}
POST _analyze
{
"tokenizer":"path_hierarchy",
"text":"/user/ymruan/a/b/c/d/e"
}
#使用char filter进行替换
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "mapping",
"mappings" : [ "- => _"]
}
],
"text": "123-456, I-test! test-990 650-555-1234"
}
//char filter 替换表情符号
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "mapping",
"mappings" : [ ":) => happy", ":( => sad"]
}
],
"text": ["I am felling :)", "Feeling :( today"]
}
// white space and snowball
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop","snowball"],
"text": ["The gilrs in China are playing this game!"]
}
// whitespace与stop
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop","snowball"],
"text": ["The rain in Spain falls mainly on the plain."]
}
//remove 加入lowercase后,The被当成 stopword删除
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase","stop","snowball"],
"text": ["The gilrs in China are playing this game!"]
}
//正则表达式
GET _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "pattern_replace",
"pattern" : "http://(.*)",
"replacement" : "$1"
}
],
"text" : "http://www.elastic.co"
}
网友评论