美文网首页
多字段与自定义Analyzer

多字段与自定义Analyzer

作者: 滴流乱转的小胖子 | 来源:发表于2020-07-08 06:00 被阅读0次

一、多字段

字段实现精确

  • 增加一个keyword字段

使用不同的analyzer

  • 不同语言
  • pinyin字段的搜索
  • 还支持为搜索和索引指定不同的analyzer

二、Exact Values vs Full Text

Exact Values:包括数字/日期/具体一个字符串(例如“Apple Store”)

  • ES 中的 keyword

全文本,非结构化的文本数据

  • ES中的text


    image.png

Exact Values 不需要被分词

  • ES为每一个字段创建一个倒排索引
  • Exact Value在索引时,不需要做特殊的分词处理


    image.png
  • Character Filters
  • Tokenizer
  • Token Filter

三、自定义分词

当es自带的分词器无法满足时,可以自定义分词器。通过自组合不同的组件实现

3.1 Character Filters

在Tokenizer之前对文本进行处理,例如增加删除及替换字符。可以配置多个 Character Filters。会影响Tokenizer和offset信息

一些自带的 Character Filters

  • HTML script -- 去除html标签
  • Mapping -- 字符串替换
  • Pattern replace -- 正则匹配替换

3.2 Tokenizer

  • 将原始的文本按照一定的规则,切分为词(term or token)
  • ES内置的Tokenizers
    whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
  • 可以用java 开发组件实现自己的 Tokenizer

3.3 Token Filters

  • 将Tokenizer 输出的单词(term),进行增加,修改,删除
  • 自带的 Token Filters
    Lowercase / stop /synonym (添加 近义词)


    image.png

词的执行过程
1:分析器的大体执行过程
char filter - >tokenizer -> token filter
2:分词的时机
分词在索引时做,也就是数据写入时。目的是创建倒排索引提高搜索速度。写入的原始数据保存在_source中


image.png
PUT logs/_doc/1
{"level":"DEBUG"}

GET /logs/_mapping

POST _analyze
{
  "tokenizer":"keyword",
  "char_filter":["html_strip"],
  "text": "<b>hello world</b>"
}


POST _analyze
{
  "tokenizer":"path_hierarchy",
  "text":"/user/ymruan/a/b/c/d/e"
}

#使用char filter进行替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ "- => _"]
      }
    ],
  "text": "123-456, I-test! test-990 650-555-1234"
}

//char filter 替换表情符号
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ ":) => happy", ":( => sad"]
      }
    ],
    "text": ["I am felling :)", "Feeling :( today"]
}

// white space and snowball
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball"],
  "text": ["The gilrs in China are playing this game!"]
}


// whitespace与stop
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball"],
  "text": ["The rain in Spain falls mainly on the plain."]
}


//remove 加入lowercase后,The被当成 stopword删除
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop","snowball"],
  "text": ["The gilrs in China are playing this game!"]
}

//正则表达式
GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "pattern_replace",
        "pattern" : "http://(.*)",
        "replacement" : "$1"
      }
    ],
    "text" : "http://www.elastic.co"
}

自定义分析器 示例

自定义分析器标准格式是:
PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },//字符过滤器
            "tokenizer": { ... custom tokenizers ... },//分词器
            "filter": { ... custom token filters ... }, //词单元过滤器
            "analyzer": { ... custom analyzers ... }
        }
    }
}
============================实例===========================
PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type": "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type": "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "char_filter": [ "html_strip", "&_to_and" ],
                    "tokenizer": "standard",
                    "filter": [ "lowercase", "my_stopwords" ]
            }}
}}}


============================实例===========================
比如自定义好的analyzer名字是my_analyzer,在此索引下的某个新增字段应用此分析器
PUT /my_index/_mapping
{
   "properties":{
        "username":{
             "type":"text",
              "analyzer" : "my_analyzer"
         },
        "password" : {
          "type" : "text"
        }
    
  }
}
=================插入数据====================
PUT /my_index/_doc/1
{
  "username":"The quick & brown fox ",
   "password":"The quick & brown fox "


}
====username采用自定义分析器my_analyzer,password采用默认的standard分析器==
===验证
GET /index_v1/_analyze
{
  "field":"username",
  "text":"The quick & brown fox"
}

GET /index_v1/_analyze
{
  "field":"password",
  "text":"The quick & brown fox"
}
//官网权威指南是真的讲得好,虽然版本太老,Elasticsearch 2.x 版本,一些api已经不适用了,自定义分析器地址:https://www.elastic.co/guide/cn/elasticsearch/guide/cn/custom-analyzers.html

相关文章

网友评论

      本文标题:多字段与自定义Analyzer

      本文链接:https://www.haomeiwen.com/subject/gsqtcktx.html