Elasticsearch--- mapping是什么

作者: 缓慢移动的蜗牛 | 来源:发表于2017-04-04 22:05 被阅读0次

Elasticsearch--- mapping是什么
015.Elasticsearch Mapping介绍
一位猎头新人关于Mapping的一些思考
思维导图工具XMind
四、其他：一些基本概念01
猎头顾问必备技能之Mapping篇
ORM
ES index 里创建多个type 异常
Django第三课：Models
ElasticSearch系列三:初识搜索引擎

模拟数据

PUT /website/article/1
{
  "post_date": "2017-01-01",
  "title": "my first article",
  "content": "this is my first article in this website",
  "author_id": 11400
}

PUT /website/article/2
{
  "post_date": "2017-01-02",
  "title": "my second article",
  "content": "this is my second article in this website",
  "author_id": 11400
}

PUT /website/article/3
{
  "post_date": "2017-01-03",
  "title": "my third article",
  "content": "this is my third article in this website",
  "author_id": 11400
}

搜索测试：

GET /website/article/_search?q=2017    //三条数据全部搜索出来             
GET /website/article/_search?q=2017-01-01    //三条数据全部搜索出来   
GET /website/article/_search?q=post_date:2017-01-01  //只搜索出来post_date=2017-01-01的那一条数据
GET /website/article/_search?q=post_date:2017  //也是只搜索出来post_date=2017-01-01的那一条数据

为什么会是这样的结果：
这和es自动建立的mapping有关

GET /website/_mapping/article

可以看到每个字段的类型

{
  "website": {
    "mappings": {
      "article": {
        "properties": {
          "author_id": {
            "type": "long"
          },
          "content": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "post_date": {
            "type": "date"
          },
          "title": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

因为es自动建立mapping的时候，设置了不同的field不同的data type。不同的data type的分词、搜索等行为是不一样的。所以出现了_all field和post_date field的搜索结果完全不一样。不同的type搜索方式也不一样。

搜索方式

exact value#####

例如：对2017-01-01进行exact value 搜索的时候，必须输入2017-01-01，才能搜索出来，如果你输入一个01，是搜索不出来的

full text#####

也即“全文检索”
当你进行搜索的时候，对你要搜索的词，会进行一系列的转换

缩写 vs. 全程：cn vs. china

格式转化：like liked likes

大小写：Tom vs tom

同义词：like vs love

例如：搜索2017-01-01时，可能会先分解为2017 01 01，搜索2017，或者01，都可以搜索出来
就不是说单纯的只是匹配完整的一个值，而是可以对值进行拆分词语后（分词）进行匹配，也可以通过缩写、时态、大小写、同义词等进行匹配

分词器

作用：切分词语，normalization（提升recall召回率）

给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换）
reacall（召回率）：搜索的时候，增加能够搜索到的结果的数量

分词器的一些功能#####

character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you）
tokenizer：分词。例如：hello you and me --> hello, you, and, me
token filter：大小写的转换，停用词，同义词的转换等。例如：dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 这些词没有什么意义，就干掉，mother --> mom，small --> little等

es内置的分词器#####

standard analyzer

simple analyzer

whitespace analyzer

language analyzer(特定的语言的分词器)

例如：

例句：Set the shape to semi-transparent by calling set_trans(5)
不同分词器的分词结果

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）

simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans

whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5

分词器测试#####

GET /_analyze
{
  "analyzer": "standard",
  "text":"I love you"
}

使用query string查询的说明

query string必须以和index建立时相同的分词器进行分词
query string对exact value和full text的区别对待

对于本文开头查询出现的问题

GET /_search?q=2017，搜索的是_all field，document所有的field都会拼接成一个大串，进行分词，所以可以搜索出三条记录
GET /_search?q=post_date:2017-01-01，post_date存储时是date类型，会作为exact value去建立索引，所以只查处一条数据
GET /_search?q=post_date:2017 这个也查询了一条数据，是因为软件版本的优化问题

小结：

往es里面直接插入数据，es会自动建立索引，同时建立type以及对应的mapping

mapping中就自动定义了每个field的数据类型

不同的数据类型（比如说text和date），可能有的是exact value，有的是full text

exact value，在建立倒排索引的时候，分词的时候，是将整个值一起作为一个关键词建立到倒排索引中的；full text，会经历各种各样的处理，分词，normaliztion（时态转换，同义词转换，大小写转换），才会建立到倒排索引中

同时，exact value和full text类型的field就决定了，在一个搜索过来的时候，对exact value field或者是full text field进行搜索的行为也是不一样的，会跟建立倒排索引的行为保持一致；比如说exact value搜索的时候，就是直接按照整个值进行匹配，full text query string，也会进行分词和normalization再去倒排索引中去搜索

可以用es的dynamic mapping，让其自动建立mapping，包括自动设置数据类型；也可以提前手动创建index和type的mapping，自己对各个field进行设置，包括数据类型，包括索引行为，包括分词器，等等

mapping，就是index的type的元数据，每个type都有一个自己的mapping，决定了数据类型，建立倒排索引的行为，还有进行搜索的行为

mapping核心数据类型

string/text

byte,short,integer,long

float,double

boolean

date

如果是es动态创建mapping的话（dynamic mapping），规则如下

true or false ----------> boolean

123 ----------> long

123.45 ----------> double

2017-01-01 ----------> date

"hello world" ----------> string/text

查看mapping

GET /index/_mapping/type

索引的几种类型

analyzed 分词

not_analyzed 不分词，当做一个整体，和exact value一致注：ES 5.0以上的not_analyzed 已经不能用了。要用type:keyword

no 不能被索引和搜索

analyzer内置的有
whitespace 、 simple 和 english等

创建或修改mapping

创建####

PUT /website
{
  "mappings": {
    "article":{
      "properties": {
        "author_id":{
          "type": "long"
        },
        "title":{
          "type": "text",
          "analyzer": "english"
        },
        "content":{
          "type": "text"
        },
        "post_date":{
          "type": "date"
        },
        "publisher_id":{
          //下面这两行可以用"type":"keyword"代替
          "type": "string",
          "index":"not_analyzed"
        }
      }
    }
  }
}

修改或添加一个新字段####

PUT /website/_mapping/article
{
  "properties": {
    "new_field":{
      //下面这两行可以用"type":"keyword"代替
      "type": "string",
      "index":"not_analyzed"
    }
  }
}

测试我们新建立的mapping和索引
测试content字段

GET /website/_analyze
{
  "field": "content",//默认用的是standard分词
  "text": "my-dogs"
}

结果：

{
  "tokens": [
    {
      "token": "my",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "dogs",
      "start_offset": 3,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

测试new_field

GET /website/_analyze
{
  "field": "new_field",
  "text":"my-dogs"
}

结果报错，原因是：我们设置该字段时不分词的，是execute value

{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[XHoQN0O][127.0.0.1:9300][indices:admin/analyze[s]]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Can't process field [new_field], Analysis requests are only supported on tokenized fields"
  },
  "status": 400
}

复杂数据类型

PUT /company/employee/1
{
  "address":{
    "country":"china",
    "provice":"beijing",
    "city":"beiing"
  },
  "name":"lili",
  "age":"18"
}

该employee type的mapping
GET /company/_mapping/employee

{
  "company": {
    "mappings": {
      "employee": {
        "properties": {
          "address": {
            "properties": {
              "city": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "country": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "provice": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "age": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

该复杂数据类型在底层的存储类似于

{
  "name":[jack],
  "age":[27],
   "address.country":[china],
  "address.provice":[beijing],
  "address.city":[beijing]
}

再复杂一些的数据底层的存储结构

{
    "author":[
        {"age":26,"name":"Jack White"},
        {"age":55","name":"Tom Jones"},
        {"age":39,"name":"Kitty Smith"}
    ]
}
//底层存储
{
    "author.age":[26,55,39],
    "author.name":[jack,white,tom,jones,kitty,smith]
}

Elasticsearch--- mapping是什么
模拟数据搜索测试：为什么会是这样的结果：这和es自动建立的mapping有关可以看到每个字段的类型因为es...
015.Elasticsearch Mapping介绍
1. mapping解析 1.1 mapping是什么 mapping，就是index的type的元数据，每个ty...
一位猎头新人关于Mapping的一些思考
Mapping是什么？一般意义上的Mapping其实即为：Search Progress Report 一、Map...
思维导图工具XMind
是什么官网给出的回答： The Most Popular Mind Mapping Tool on The Pl...
四、其他：一些基本概念01
贴图、纹理、材质的区别是什么？Shading和Shader又是什么？那么 UV Mapping 又是什么？跟 Te...
猎头顾问必备技能之Mapping篇
每次领导和我强调mapping的重要性，我内心是崩溃的！“你说的mapping究竟是什么样的，怎么做？你倒是教我啊...
ORM
ORM是什么 O：object R:relational M:mapping 对象关系映射在Django中，有什...
ES index 里创建多个type 异常
异常如下 Rejecting mapping update to [] as the final mapping ...
Django第三课：Models
Models是什么？ ORM对象关系映射（Object Relation Mapping），实现了对象和数据库之间...
ElasticSearch系列三:初识搜索引擎
搜索引擎原理 1.mapping mapping中exact value及full text 1.1mapping...