十七、Elasticsearch索引相关命令、分词器及原理

作者: 书写只为分享 | 来源:发表于2019-11-22 01:00 被阅读0次

十七、Elasticsearch索引相关命令、分词器及原理
ES入门4-分词
十六、Elasticsearch查询相关命令及原理
ElasticSearch核心之——分词
Elasticsearch 技术分析（八）：剖析 Elastic
一个非常hao用的elasticsearch中文分词器插件 Ha
Elasticsearch-基础使用
MySQL索引背后的数据结构及算法原理
搜索引擎ElasticSearch之（1）、架构简介及基本服务搭
Elasticsearch从入门到放弃：瞎说Mapping

1、索引的增删改

创建索引的示例

PUT /my_index

{

"settings": {

"number_of_shards": 1,

"number_of_replicas": 0

},

"mappings": {

"my_type": {

"properties": {

"my_field": {

"type": "text"

}

修改索引

PUT /my_index/_settings

{

"number_of_replicas": 1

}

删除索引

DELETE /my_index

DELETE /index_one,index_two

DELETE /index_*

DELETE /_all

防止使用delete/_all,可以在elasticsearch.yml里面修改action.destructive_requires_name: true

2、分词器设置

（1）、默认的分词器standard

standard tokenizer：以单词边界进行切分

standard token filter：什么都不做

lowercase token filter：将所有字母转换为小写

stop token filer（默认被禁用）：移除停用词，比如a the it等等

（2）、修改分词器的设置

设置一个索引，启用english停用词默认token filter，这样这个索引就是按这种来分词的

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"es_std": {

"type": "standard",

"stopwords": "_english_"

}

（3）、定制化自己的分词器

PUT /my_index

{

"settings": {

"analysis": {

"char_filter": {

"&_to_and": {

"type": "mapping",

"mappings": ["&=> and"]

}

},

"filter": {

"my_stopwords": {

"type": "stop",

"stopwords": ["the", "a"]

}

},

"analyzer": {

"my_analyzer": {

"type": "custom",

"char_filter": ["html_strip", "&_to_and"],

"tokenizer": "standard",

"filter": ["lowercase", "my_stopwords"]

}

GET /my_index/_analyze

{

"text": "tom&jerry are a friend in the house, <a>, HAHA!!",

"analyzer": "my_analyzer"

}

PUT /my_index/_mapping/my_type

{

"properties": {

"content": {

"type": "text",

"analyzer": "my_analyzer"

}

3、type相关知识

type，是一个index中用来区分类似的数据的，类似的数据，但是可能有不同的fields，而且有不同的属性来控制索引建立、分词器。

field的value，在底层的lucene中建立索引的时候，全部是二进制类型，不区分类型的，lucene是没有type的概念的，在document中，实际上将type作为一个document的field来存储，即_type，es通过_type来进行type的过滤和筛选。

一个index中的多个type，实际上是放在一起存储的，因此一个index下，不能有多个type重名，而类型或者其他设置不同的，因为那样是无法处理的

例如：

PUT /my_index/my_type/1

{

"name": "geli kongtiao",

"price": 1999.0,

"service_period": "one year"

}

PUT /my_index/my_type/2

{

"name": "aozhou dalongxia",

"price": 199.0,

"eat_period": "one week"

}

在底层的存储是这样子的。。。。

{

"my_index": {

"mappings": {

"_type": {

"type": "string",

"index": "not_analyzed"

},

"name": {

"type": "string"

}

"price": {

"type": "double"

}

"service_period": {

"type": "string"

}

"eat_period": {

"type": "string"

}

{

"_type": "elactronic_goods",

"name": "geli kongtiao",

"price": 1999.0,

"service_period": "one year",

"eat_period": ""

}

{

"_type": "fresh_goods",

"name": "aozhou dalongxia",

"price": 199.0,

"service_period": "",

"eat_period": "one week"

}

最佳实践，将类似结构的type放在一个index下，这些type应该有多个field是相同的

假如说，你将两个type的field完全不同，放在一个index下，那么就每条数据都至少有一半的field在底层的lucene中是空值，会有严重的性能问题

4、mapping相关知识

Mapping相关属性设置有那些统称root object

就是某个type对应的mapping json，包括了properties，metadata（_id，_source，_type），settings（analyzer），其他settings（比如include_in_all)这些

格式：

PUT /my_index

{

"mappings": {

"my_type": {

"properties": {}

}

（1）properties

PUT /my_index/_mapping/my_type

{

“properties”:{

“title”:{

“type”:”text”

}

(2)、_source

好处

（1）查询的时候，直接可以拿到完整的document，不需要先拿document id，再发送一次请求拿document

（2）partial update基于_source实现

（3）reindex时，直接基于_source实现，不需要从数据库（或者其他外部存储）查询数据再修改

（4）可以基于_source定制返回field

（5）debug query更容易，因为可以直接看到_source

如果不需要上述好处，可以禁用_source

PUT /my_index/_mapping/my_type2

{

"_source": {"enabled": false}

}

(3)、_all

将所有field打包在一起，作为一个_all field，建立索引。没指定任何field进行搜索时，就是使用_all field在搜索。

PUT /my_index/_mapping/my_type3

{

"_all": {"enabled": false}

}

也可以在field级别设置include_in_all field，设置是否要将field的值包含在_all field中

PUT /my_index/_mapping/my_type4

{

"properties": {

"my_field": {

"type": "text",

"include_in_all": false

}

5、定制自动化策略

1、定制dynamic策略

"dynamic": "strict"

true：遇到陌生字段，就进行dynamic mapping

false：遇到陌生字段，就忽略

strict：遇到陌生字段，就报错

PUT /my_index

{

"mappings": {

"my_type": {

"dynamic": "strict",

"properties": {

"title": {

"type": "text"

},

"address": {

"type": "object",

"dynamic": "true"

}

-------这样设置会报错

PUT /my_index/my_type/1

{

"title": "my article",

"content": "this is my article",

"address": {

"province": "guangdong",

"city": "guangzhou"

}

6、定制dynamic mapping策略

（1）date_detection

默认会按照一定格式识别date，比如yyyy-MM-dd。但是如果某个field先过来一个2017-01-01的值，就会被自动dynamic mapping成date，后面如果再来一个"hello world"之类的值，就会报错。可以手动关闭某个type的date_detection，如果有需要，自己手动指定某个field为date类型。如果当自己动成date的不是自己想要的，在生JAVA在使用的情况下怎么处理呢？下面讲解

PUT /my_index/_mapping/my_type

{

"date_detection": false

}

（2）定制自己的dynamic mapping template（type level）

PUT /my_index

{

"mappings": {

"my_type": {

"dynamic_templates": [

{ "en": {

"match": "*_en",

"match_mapping_type": "string",

"mapping": {

"type": "string",

"analyzer": "english"

}

}}

]

}}}

PUT /my_index/my_type/1

{

"title": "this is my first article"

}

PUT /my_index/my_type/2

{

"title_en": "this is my first article"

}

GET /my_index/my_type/_search

{

"query": {

"match": {

"title": "is"//或者title_en

}

查询时，title没有匹配到任何的dynamic模板，默认就是standard分词器，不会过滤停用词，is会进入倒排索引，用is来搜索是可以搜索到的

title_en匹配到了dynamic模板，就是english分词器，会过滤停用词，is这种停用词就会被过滤掉，用is来搜索就搜索不到了

（3）定制自己的default mapping template（index level）

PUT /my_index

{

"mappings": {

"_default_": {

"_all": { "enabled": false }

},

"blog": {

"_all": { "enabled": true }

}

7、重建索引

一个field的设置是不能被修改的，如果要修改一个Field，那么应该重新按照新的mapping，建立一个index，然后将数据批量查询出来，重新用bulk api写入index中

问题：

一开始，依靠dynamic mapping，插入数据，但是不小心有些数据是2017-01-01这种日期格式的，所以title这种field被自动映射为了date类型，实际上它应该是string类型的，当后期向索引中加入string类型的title值的时候，就会报错。

解决：

（1）、进行reindex，也就是说，重新建立一个索引，将旧索引的数据查询出来，再导入新索引，如果说旧索引的名字，是old_index，新索引的名字是new_index，终端java应用，已经在使用old_index在操作了，难道还要去停止java应用，修改使用的index为new_index，才重新启动java应用吗？这个过程中，就会导致java应用停机，可用性降低。

（2）、所以说，给java应用一个别名，这个别名是指向旧索引的，java应用先用着，java应用先用goods_index alias来操作，此时实际指向的是旧的old_index

PUT /old_index/_alias/goods_index

（3）、新建一个index，调整其title的类型为string

PUT /my_index_new

{

"mappings": {

"my_type": {

"properties": {

"title": {

"type": "text"

}

（4）、使用scroll api将数据批量查询出来

GET /my_index/_search?scroll=1m

{

"query": {

"match_all": {}

},

"sort": ["_doc"],

"size": 1

}

{

"_scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAADpAFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAA6QRY0b25zVFlWWlRqR3ZJajlfc3BXejJ3AAAAAAAAOkIWNG9uc1RZVlpUakd2SWo5X3NwV3oydwAAAAAAADpDFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAA6RBY0b25zVFlWWlRqR3ZJajlfc3BXejJ3",

"took": 1,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

},

"hits": {

"total": 3,

"max_score": null,

"hits": [

{

"_index": "my_index",

"_type": "my_type",

"_id": "2",

"_score": null,

"_source": {

"title": "2017-01-02"

},

"sort": [

0

]

}

]

}

（5）采用bulk api将scoll查出来的一批数据，批量写入新索引

POST /_bulk

{ "index": { "_index": "my_index_new", "_type": "my_type", "_id": "2" }}

{ "title": "2017-01-02" }

这里有个疑惑，难不成一条一条处理？

[if !supportLists]（1）[endif]、将goods_index alias切换到my_index_new上去，java应用会直接通过index别名使用新的索引中的数据，java应用程序不需要停机，零提交，高可用

POST /_aliases

{

"actions": [

{ "remove": { "index": "my_index", "alias": "goods_index" }},

{ "add": { "index": "my_index_new", "alias": "goods_index" }}

]

}

8、基于alias对client透明切换index

PUT /my_index_v1/_alias/my_index

client对my_index进行操作

reindex操作，完成之后，切换v1到v2

POST /_aliases

{

"actions": [

{ "remove": { "index": "my_index_v1", "alias": "my_index" }},

{ "add": { "index": "my_index_v2", "alias": "my_index" }}

]

}

9、倒排索引知识补充

倒排索引的结构

（1）包含这个关键词的document list

（2）包含这个关键词的所有document的数量：IDF（inverse document frequency）

（3）这个关键词在每个document中出现的次数：TF（term frequency）

（4）这个关键词在这个document中的次序

（5）每个document的长度：length norm

（6）包含这个关键词的所有document的平均长度

倒排索引不可变的好处：

（1）不需要锁，提升并发能力，避免锁的问题

（2）数据不变，一直保存在os cache中，只要cache内存足够

（3）filter cache一直驻留在内存，因为数据不变

（4）可以压缩，节省cpu和io开销

倒排索引不可变的坏处：

（1）每次都要重新构建整个索引

网友评论

本文标题：十七、Elasticsearch索引相关命令、分词器及原理

本文链接：https://www.haomeiwen.com/subject/nezxwctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

十七、Elasticsearch索引相关命令、分词器及原理

相关文章