摘要:Elasticsearch
《Elasticsearch搜索引擎构建入门与实战》第三章读书笔记
索引操作
索引操作主要有创建,删除,关闭,打开,别名等
(1)创建索引
请求类型为PUT,语法为
PUT /${index_name}
{
"settings": {
...
},
"mappings": {
...
}
}
其中settings中设置索引的配置项,比如主分片数和副分片数,mappings填写数据组织结构,例如如下语句创建了主分片3,副分片1,两个字段的索引
PUT /my_label
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"ent_name": {
"type": "keyword"
},
"score": {
"type": "double"
}
}
}
}
查看kibana的Index Management,已经显示了primaries=3,replicas=1
(2)创建索引
删除索引,使用DELETE请求
DELETE /my_index
(3)关闭索引
关闭索引之后ES索引只负责数据存储,不能提供数据更新和搜索功能,知道索引再次打开,使用POST请求,_close
路由
POST /my_label/_close
(4)打开索引
同理POST请求_open
路由
POST /my_label/_open
(5)索引别名
可以给一个或者多个es索引定义一个另一个名称,相当于linux的用户名
和用户组名
,这样就可以实现对多个索引进行查询(用户组),而不是一个一个查询索引(用户),关系如下
举例先创建3个索引,最后三个都别名为同一个索引
PUT /my_index_1
{
"mappings": {
"properties": {
"title":{
"type": "text"
},
"city":{
"type": "keyword"
},
"price": {
"type": "double"
}
}
}
}
PUT /my_index_2
{
"mappings": {
"properties": {
"title":{
"type": "text"
},
"city":{
"type": "keyword"
},
"price": {
"type": "double"
}
}
}
}
PUT /my_index_3
{
"mappings": {
"properties": {
"title":{
"type": "text"
},
"city":{
"type": "keyword"
},
"price": {
"type": "double"
}
}
}
}
再插入三条数据
POST /my_index_1/_doc/001
{
"title":"好再来餐厅",
"city": "青岛",
"price": 578.23
}
POST /my_index_3/_doc_/001
{
"title":"好再来网吧",
"city": "青岛",
"price": 578.23
}
POST /my_index_2/_doc_/001
{
"title":"好再来浴室",
"city": "青岛",
"price": 578.23
}
将my_index,my_index_2,my_index_3三个索引都别名为my_index_all
POST /_aliases
{
"actions": [
{
"add": {
"index": "my_index_1",
"alias": "my_index_all"
}
},
{
"add": {
"index": "my_index_2",
"alias": "my_index_all"
}
},
{
"add": {
"index": "my_index_3",
"alias": "my_index_all"
}
}
]
}
此时对别名之后的索引集合做搜索,所有id是001的文档
GET /my_index_all/_doc/001
报错和多个索引相关,无法定位
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "alias [my_index_all] has more than one index associated with it [my_index_2, my_index_3, my_index_1], can't execute a single index op"
}
],
"type" : "illegal_argument_exception",
"reason" : "alias [my_index_all] has more than one index associated with it [my_index_2, my_index_3, my_index_1], can't execute a single index op"
},
"status" : 400
}
可以进行其他条件查询
POST /my_index_all/_search
{
"query":{
"match":{
"title": "好再"
}
}
}
返回三条结果,每一条文档给出了所在的索引(_index
)
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "my_index_1",
"_id" : "001",
"_score" : 0.5753642,
"_source" : {
"title" : "好再来餐厅",
"city" : "青岛",
"price" : 578.23
}
},
{
"_index" : "my_index_2",
"_id" : "001",
"_score" : 0.5753642,
"_source" : {
"title" : "好再来浴室",
"city" : "青岛",
"price" : 578.23
}
},
{
"_index" : "my_index_3",
"_id" : "001",
"_score" : 0.5753642,
"_source" : {
"title" : "好再来网吧",
"city" : "青岛",
"price" : 578.23
}
}
]
}
}
如果要删除别名使用如下语法
POST /_aliases
{
"actions":[
{ "remove":{"index": "my_index_1", "alias": "my_index_all"}},
{ "remove":{"index": "my_index_2", "alias": "my_index_all"}},
{ "remove":{"index": "my_index_3", "alias": "my_index_all"}}
]
}
映射操作
映射类似于传统数据库的表结构,ES可以自动推断数据类型,建议用户手动创建。
(1)创建映射
创建映射的基本语法如下,在创建索引的时候直接创建
PUT /${index_name}
{
"mappings": {
"properties": {
"cols1": {"type": ""}
}
}
}
也可以先创建索引,再创建mappings,使用POST请求传给_mapping
路由
PUT /my_index_4
POST /my_index_4/_mapping
{
"properties": {
"title": {"type": "text"},
"city": {"type": "keyword"},
"price": {"type": "double"}
}
}
(2)查看映射
查看映射直接使用GET和路由
GET /my_index_4/_mapping
(3)拓展映射
映射中已经定义的字段的属性或者类型是不能修改,这能增加字段,增加字段的DSL是一样的,使用POST请求_mapping路由
POST /my_index_4/_mapping
{
"properties": {
"degree": {"type": "keyword"}
}
}
(3)基本的数据类型
1.keyword类型
keyword代表不进行切分的字符串类型,在构建索引时,ES直接对keyword的字符串做倒排索引,而不是对切分之后的子部分都做倒排索引。keyword一般用于字符串比较相等,用于过滤,排序,聚合的场景,在DSL中使用term
查询
例如查询某个字段为某个值进行过滤
GET /my_index_4/_search
{
"query":{
"term": {
"city": {"value": "扬州"}
}
}
}
如果对keyword字段用match进行部分内容的全文检索是不会命中文档的,例如
GET /my_index_4/_search
{
"query":{
"match": {
"city": "州"
}
}
}
2.text类型
text类型是对于字符串进行切割,切割的每一部分加入倒排索引中,搜索匹配的时候会进行打分
GET /my_label/_search
{
"query": {
"match": {
"title": "好来药酒"
}
}
}
返回结果按照score进行降序
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0577903,
"hits" : [
{
"_index" : "my_label",
"_id" : "003",
"_score" : 1.0577903,
"_source" : {
"title" : "好再来药店",
"city" : "青岛",
"price" : 578.23
}
},
{
"_index" : "my_label",
"_id" : "001",
"_score" : 0.8630463,
"_source" : {
"title" : "好再来酒店",
"city" : "青岛",
"price" : 578.23
}
},
{
"_index" : "my_label",
"_id" : "002",
"_score" : 0.36464313,
"_source" : {
"title" : "好再来饭店",
"city" : "青岛",
"price" : 578.23
}
}
]
}
}
如果对text字段使用term搜索会搜索不到,因为text已经被切割了
POST /my_label/_search
{
"query": {
"match": {
"title": {"value": "好再来饭店" }
}
}
}
返回空文档,分数为null
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
如果text类型在mapping里手指了参数:index:false,则该字段无法被索引到,只能用来展示,无法用来匹配搜索
PUT /hotel/_doc/_mapping
{
"properties": {
"no_index_col": {"type": "text", "index": false}
}
}
搜索该字段会报错没有被索引,同样给keyword字段设置该属性也无法检索
"reason": "Cannot search on field [no_index_col] since it is not indexed
3.数值类型
ES支持多种数值类型(long,integer,short,byte,double,float等),应该在满足业务需求的情况下尽量算则范围小的数值类型。举例
PUT /my_index_5
{
"mappings":{
"properties": {
"name": {"type": "keyword"},
"age": {"type": "integer"},
"score": {"type": "double"},
"no": {"type": "long"}
}
}
}
插入几条数据
POST /my_index_5/_doc/001
{
"name": "xiaogp",
"age":13,
"score": 98.5,
"no": 123456789
}
POST /my_index_5/_doc/002
{
"name": "wangfan",
"age":92,
"score": 33.5,
"no": 123456786
}
POST /my_index_5/_doc/003
{
"name": "xuguangfeng",
"age":33,
"score": 71.5,
"no": 123456788
}
数值类型主要用于term搜索和范围搜索range
,例如查找score在60-100之间的文档
POST /my_index_5/_search
{
"query": {
"range": {
"score": {
"gt": 60,
"lt": 100
}
}
}
}
结果返回两条文档
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index_5",
"_id" : "001",
"_score" : 1.0,
"_source" : {
"name" : "xiaogp",
"age" : 13,
"score" : 98.5,
"no" : 123456789
}
},
{
"_index" : "my_index_5",
"_id" : "003",
"_score" : 1.0,
"_source" : {
"name" : "xuguangfeng",
"age" : 33,
"score" : 71.5,
"no" : 123456788
}
}
]
}
}
4.布尔类型
布尔类型在mapping中使用boolean
定义,搜索时使用term
精确匹配,匹配值可以直接是true,false,也可以是字符串格式的true,false
# 给my_index_5新增一个字段
POST /my_index_5/_mapping
{
"properties": {
"is_good": {"type": "boolean"}
}
}
给my_index_5中001文档增加新字段的数据
POST /my_index_5/_doc/001
{
"name": "xiaogp",
"age":13,
"score": 98.5,
"no": 123456789,
"is_good": "true"
}
搜索boolean字段
GET /my_index_5/_search
{
"query": {
"term": {
"is_good": {"value": "true"} # 可以不带双引号
}
}
}
5.日期类型
在ES中时间日期类型是date
,默认支持的类型中不包含yyyy-MM-dd HH:mm:ss
,需要在设置映射时增加format
属性
PUT /my_label
{
"mappings": {
"properties": {
"ent_name": {"type": "keyword"},
"update_date": {"type": "date"},
"score": {"type": "double"}
}
}
}
插入yyyy-MM-dd数据成功
POST /my_label/_doc/001
{
"ent_name": "xiaogp",
"score": 23.3,
"update_date": "2021-01-01"
}
插入yyyyMMdd数据成功
POST /my_label/_doc/002
{
"ent_name": "xiaogp",
"score": 23.3,
"update_date": "20210109"
}
看一下插入的数据,虽然这两种格式不一样,但是都是ES的date默认支持的格式,因此都成功写入了,且展示的格式不一样
{
"_index" : "my_label",
"_type" : "_doc",
"_id" : "001",
"_score" : 1.0,
"_source" : {
"ent_name" : "xiaogp",
"score" : 23.3,
"update_date" : "2021-01-01"
}
},
{
"_index" : "my_label",
"_type" : "_doc",
"_id" : "002",
"_score" : 1.0,
"_source" : {
"ent_name" : "xiaogp",
"score" : 23.3,
"update_date" : "20210109"
}
}
再插入yyyy-MM-dd HH:mm:ss数据报错
POST /my_label/_doc/003
{
"ent_name": "xiaogp",
"score": 23.3,
"update_date": "2021-01-09 11:11:11"
}
报错信息如下显示日期类型解析错误
{
"type": "mapper_parsing_exception",
"reason": "failed to parse field [update_date] of type [date] in document with id '003'"
}
如果要插入和显示yyyy-MM-dd HH:mm:ss数据,需要求改mapping
PUT /my_label
POST /my_label/_doc/_mapping
{
"properties": {
"ent_name": {"type": "keyword"},
"update_date":
{"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"score": {"type": "double"}
}
}
再插一次显示成功
# GET /my_label/_doc/003
{
"_index" : "my_label",
"_type" : "_doc",
"_id" : "003",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"ent_name" : "xiaogp",
"score" : 23.3,
"update_date" : "2021-01-09 11:11:11"
}
}
date类型的常用查询是range
查询,例如查询时间范围的文档
GET /my_label/_search
{
"query": {
"range": {
"update_date": {
"gte": "2021-01-01",
"lte": "2022-01-01"
}
}
}
}
返回如下
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_label",
"_type" : "_doc",
"_id" : "003",
"_score" : 1.0,
"_source" : {
"ent_name" : "xiaogp",
"score" : 33.3,
"update_date" : "2022-01-09 11:11:11"
}
}
]
}
6.数组类型
数据类型是不需要定义的,只需要定义数组元素的类型,比如定义为keyword,写入数据的时候使用类似于JSONArray的格式即可
# 重新创建一个索引,tag是数组字段,内部元素都是keyword
PUT /my_label
POST /my_label/_doc/_mapping
{
"properties": {
"ent_name": {"type": "keyword"},
"tag": {"type": "keyword"},
"score": {"type": "double"}
}
}
插入一条数据,DSL中tag字段使用JSONArray格式
POST /my_label/_doc/001
{
"ent_name": "xiaogp",
"score": 23.3,
"tag": ["好人", "有钱", "有才"]
}
GET /my_label/_doc/001
数据返回如下
{
"_index" : "my_label",
"_type" : "_doc",
"_id" : "001",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"ent_name" : "xiaogp",
"score" : 23.3,
"tag" : [
"好人",
"有钱",
"有才"
]
}
}
如果插入的数据是JSONArray,保存的时候想采用String的格式,则需要转义,使用三引号插入
POST /my_label/_doc/003
{
"ent_name": "xiaogp",
"score": 23.3,
"tag": """["好人", "有钱", "男人"]"""
}
在搜索的时候kibana也会显示出三引号
# GET /my_label/_doc/003
{
"_index" : "my_label",
"_type" : "_doc",
"_id" : "003",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"ent_name" : "xiaogp",
"score" : 23.3,
"tag" : """["好人", "有钱", "男人"]"""
}
}
用Python客户端验证一下使用三引号和不使用直接插入Array在读取数据时是否能够区分
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch(hosts="192.168.61.240", port=8200, timeout=200)
>>> es.get(index="my_label", doc_type="_doc", id='001')['_source']['tag']
['好人', '有钱', '有才']
>>> es.get(index="my_label", doc_type="_doc", id='003')['_source']['tag']
'["好人", "有钱", "男人"]'
可以看到取数的时候三引号没有了,保留了插入时候的字符串格式,如果按照数组插入,取数的时候也返回Python数组
数组查询的时候实际上是对数据内部元素做与或非查询,最简单的查询是搜索数组字段中包含某个keyword的文档
GET /my_label/_search
{
"query": {
"term": {
"tag": {
"value": "有才"
}
}
}
}
使用term
查询,此时只要tag中包含‘有才;的文档都会被返回,如果数组中有多个值需要搜索,使用terms
GET /my_label/_search
{
"query": {
"terms": {
"tag": [
"好人",
"有才"
]
}
}
}
terms传入的对象是一个数组,只要tag中有数组中的任意一个,文档就会被返回,相当于元素的并集or查询,结合bool
+must
语句可以实现与查询,取数据元素的交集
GET /my_label/_search
{
"query": {
"bool": {
"must": [
{"term": {
"tag": {
"value": "好人"
}
}},
{"term": {
"tag": {
"value": "有才"
}
}}
]
}
}
}
文档操作
(1)写入单条文档
写入文档的请求类型是POST,请求语法如下
POST /${index_name}/_doc/${_id}
{
...
}
这种方式是用户直接定义_id值,不使用es生成的id,请求体是JSON
格式,用户也可以不指定_id直接POST+请求体,此时ES将会自动生成id
POST /${index_name}/_doc
{
...
}
例如
POST /my_label/_doc
{
"title": "123",
"city": "234",
"price": 23.3
}
GET /my_label/_search
返回结果的_id是ES自动随机生成的
{
"_index" : "my_label",
"_type" : "_doc",
"_id" : "YIygTn8Bxh2kjPU0z9Pg",
"_score" : 1.0,
"_source" : {
"title" : "123",
"city" : "234",
"price" : 23.3
}
}
(2)批量写入文档
批量写入多条文档同样是POST请求,例子如下
POST /_bulk
{"index": {"_index": "my_label", "_type": "_doc", "_id": "009"}}
{"title": "123","city": "234", "price": 93.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "010"}}
{"title": "777","city": "567", "price": 123.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "011"}}
{"title": "666","city": "ftyg", "price": 31.3 }
以上一共插入了3条数据,每条数据是上下两行,第一行代表要插入的索引,_type以及_id,在新版中中_type可以不指定默认是_doc,_id不指定随机生成,返回如下插入成功
{
"took" : 103,
"errors" : false,
"items" : [
{
"index" : {
"_index" : "my_label",
"_type" : "_doc",
"_id" : "009",
"_version" : 2,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1,
"status" : 200
}
},
{
"index" : {
"_index" : "my_label",
"_type" : "_doc",
"_id" : "010",
"_version" : 2,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1,
"status" : 200
}
},
{
"index" : {
"_index" : "my_label",
"_type" : "_doc",
"_id" : "011",
"_version" : 2,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 2,
"_primary_term" : 1,
"status" : 200
}
}
]
}
如果数据量很大,行数很多,推荐使用linux的curl
进行批量插入,例如将以上3条数据共6行写入一个文件
# vim bulk_data.json
{"index": {"_index": "my_label", "_type": "_doc", "_id": "012"}}
{"title": "1233333","city": "234", "price": 93.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "013"}}
{"title": "777777","city": "567", "price": 123.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "014"}}
{"title": "66666","city": "ftyg", "price": 31.3 }
curl -H "Content-Type: application/json" -X POST '192.168.61.240:8200/_bulk?pretty' --data-binary "@bulk_data.json"
最终达到的效果是一样的,解释一下curl和请求路由中的相关参数
*-H
:自定义头信息传递给服务器,引号字符串
*-X
:指定 HTTP 请求的方法,curl默认是GET请求
*--data-binary
:HTTP POST请求中的数据为纯二进制数据,value如果是@file_name
,则保留文件中的回车符和换行符,不做任何转换
*pretty
:让es美化输出,美化为JSON格式
linux控制台会有输出,如果不想看到输出可以写入文件>
或者使用curl的-o
(3)更新单条文档
更新文档也是POST请求,在请求路由最后增加_update
即可,例如
POST /my_label/_doc/010/_update
{
"doc": {
"title" : "888"
}
}
该语句只会修改_id为010的文档的title字段,其他不做修改,如果不加_update就是直接覆盖原有的010文档,覆盖之后只有title字段其他全部删除,如果对一个不存在的_id做更新会直接报错document_missing_exception,因此只能对现有文档做更新,如果要实现有则更新无则插入的操作需要使用upsert
POST /my_label/_doc/099/_update
{
"doc": {
"title" : "888",
"city": "成都",
"price": 12.3
},
"upsert": {
"title" : "888",
"city": "成都",
"price": 12.3
}
}
相当于如果文档不存在执行doc的更新内容,如果已经存在,执行upsert的插入内容
(4)批量更新文档
批量更新文档的bulk语句和批量插入类似,例子如下
POST /_bulk
{"update": {"_index": "my_label", "_type": "_doc", "_id": "010"}}
{"doc": {"title": "999", "city": "郑州"}}
{"update": {"_index": "my_label", "_type": "_doc", "_id": "0123"}}
{"doc": {"title": "999", "city": "郑州"}, "upsert": {"title": "999", "city": "郑州"}}
更新两条数据,其中第二条没有就是用upsert操作
(5)根据条件更新文档
类似于关系型数据库的update set where,es使用_update_by_query
实现,语法如下
POST /${index_name}/_update_by_query
{
"query": { // 条件查询
},
"script":{ // 更新脚本
}
}
直接看一个例子
POST /my_label/_update_by_query
{
"query": {
"term": {
"city": {
"value": "郑州"
}
}
},
"script": {
"source": "ctx._source['city']='苏州'",
"lang": "painless"
}
}
先找到city字段等于郑州的,全部更新为苏州,script的语法使用painless
,是es的默认脚本。如果在请求体中不加入query,则会更新全部文档
(6)删除单条文档
使用DELETE
请求,请求体指定文档_id即可
DELETE /my_label/_doc/010
(7)批量删除文档
批量删除数据也需要POST请求和_bulk路由,例子如下
POST /_bulk
{"delete": {"_index": "my_label", "_type": "_doc", "_id": "009"}}
{"delete": {"_index": "my_label", "_type": "_doc", "_id": "012"}}
(8)根据条件删除文档
类似结构型数据库的delete from where,在es中使用_delete_by_query
路由,和update_by_query不同的是,_delete_by_query只需要指定query,不需要script,因为执行的操作就是删除是单一的确定的,例子如下
POST /my_label/_doc/_delete_by_query
{
"query": {
"term": {
"city": {
"value": "苏州"
}
}
}
}
网友评论