前言
DSL全称 Domain Specific language,即特定领域专用语言
1.全局操作
1.1 查询集群健康情况
GET /_cat/health?v ?v表示显示头信息
集群的健康状态有红、黄、绿三个状态:
绿 – 一切正常(集群功能齐全)
黄 – 所有数据可用,但有些副本尚未分配(集群功能完全)
红 – 有些数据不可用(集群部分功能)
1.2 查询各个节点状态
GET /_cat/nodes?v
2. 对索引的操作
2.1 查询各个索引状态
GET /_cat/indices?v
ES中会默认存在一些索引
health | green(集群完整) yellow(单点正常、集群不完整) red(单点不正常) |
---|---|
status | 是否能使用 |
index | 索引名 |
uuid | 索引统一编号 |
pri | 主节点几个分片 |
rep | 从节点几个(副本数) |
docs.count | 文档数 |
docs.deleted | 文档被删了多少 |
store.size | 整体占空间大小 |
pri.store.size | 主节点占空间大小 |
2.2 创建索引
API:PUT 索引名?pretty
PUT movie_index?pretty
使用PUT创建名为“movie_index”的索引。末尾追加pretty,可以漂亮地打印JSON响应(如果有的话)。
索引名命名要求:
-
仅可能为小写字母,不能下划线开头
-
不能包括 , /, *, ?, ", <, >, |, 空格, 逗号, #
-
7.0版本之前可以使用冒号:,但不建议使用并在7.0版本之后不再支持
-
不能以这些字符 -, _, + 开头
-
不能包括 . 或 …
-
长度不能超过 255 个字符
2.3 查询某个索引的分片情况
API:GET /_cat/shards/索引名
GET /_cat/shards/movie_index
默认5个分片,1个副本。所以看到一共有10个分片,5个主,每一个主分片对应一个副本,注意:同一个分片的主和副本肯定不在同一个节点上
2.4 删除索引
API:DELETE /索引名
DELETE /movie_index
3. 对文档进行操作
3.1 创建文档
向索引movie_index中放入文档,文档ID分别为1,2,3
5.API: PUT /索引名/类型名/文档id
注意:文档id和文档中的属性”id”不是一回事
PUT /movie_index/movie/1
{ "id":100,
"name":"operation red sea",
"doubanScore":8.5,
"actorList":[
{"id":1,"name":"zhang yi"},
{"id":2,"name":"hai qing"},
{"id":3,"name":"zhang han yu"}
]
}
PUT /movie_index/movie/2
{
"id":200,
"name":"operation meigong river",
"doubanScore":8.0,
"actorList":[
{"id":3,"name":"zhang han yu"}
]
}
PUT /movie_index/movie/3
{
"id":300,
"name":"incident red sea",
"doubanScore":5.0,
"actorList":[
{"id":4,"name":"zhang san feng"}
]
}
注意,Elasticsearch并不要求,先要有索引,才能将文档编入索引。创建文档时,如果指定索引不存在,将自动创建。默认创建的索引分片是5,副本是1,我们创建的文档会在其中的某一个分片上存一份,副本上存一份,所以看到的响应_shards-total:2
3.2 根据文档id查看文档
API:GET /索引名/类型名/文档id
GET /movie_index/movie/1?pretty
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "1",
"_version" : 2,
"_seq_no" : 1,
"_primary_term" : 1,
"found" : true,
"_source" : {
"id" : 100,
"name" : "operation red sea",
"doubanScore" : 8.5,
"actorList" : [
{
"id" : 1,
"name" : "zhang yi"
},
{
"id" : 2,
"name" : "hai qing"
},
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
}
这里有一个字段found为真,表示找到了一个ID为3的文档,另一个字段_source,该字段返回完整JSON文档。
3.3 查询所有文档
API:GET /索引名/_search
Kinana中默认显示10条,可以通过size控制
GET /movie_index/_search
{
"size":10
}
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"id" : 200,
"name" : "operation meigong river",
"doubanScore" : 8.0,
"actorList" : [
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
},
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 100,
"name" : "operation red sea",
"doubanScore" : 8.5,
"actorList" : [
{
"id" : 1,
"name" : "zhang yi"
},
{
"id" : 2,
"name" : "hai qing"
},
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
},
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"id" : 300,
"name" : "incident red sea",
"doubanScore" : 5.0,
"actorList" : [
{
"id" : 4,
"name" : "zhang san feng"
}
]
}
}
]
}
}
took:执行查询花费的时间毫秒数
_shards=>total:搜索了多少个分片(当前表示搜索了全部5个分片)
3.4 根据文档id删除文档
API: DELETE /索引名/类型名/文档id
DELETE /movie_index/movie/3
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "3",
"_version" : 2,
"result" : "deleted",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 4,
"_primary_term" : 1
}
注意:删除索引和删除文档的区别?
-
删除索引是会立即释放空间的,不存在所谓的“标记”逻辑。
-
删除文档的时候,是将新文档写入,同时将旧文档标记为已删除。 磁盘空间是否释放取决于新旧文档是否在同一个segment file里面,因此ES后台的segment merge在合并segment file的过程中有可能触发旧文档的物理删除。
-
也可以手动执行POST /_forcemerge进行合并触发
3.5 替换文档
- PUT(幂等性操作)
当我们通过执行PUT 索引名/类型名/文档id 命令的添加时候,如果文档id已经存在,那么再次执行上面的命令,ElasticSearch将替换现有文档。
PUT /movie_index/movie/3
{
"id":300,
"name":"incident red sea",
"doubanScore":5.0,
"actorList":[
{"id":4,"name":"zhang cuishan"}
]
}
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "3",
"_version" : 2,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 6,
"_primary_term" : 1
}
文档id3已经存在,会替换原来的文档内容
-
POST(非幂等性操作)
创建文档时,ID部分是可选的。如果没有指定,Elasticsearch将生成一个随机ID,然后使用它来引用文档。
POST /movie_index/movie/ { "id":300, "name":"incident red sea", "doubanScore":5.0, "actorList":[ {"id":4,"name":"zhang cuishan"} ] }
{ "_index" : "movie_index", "_type" : "movie", "_id" : "jyVMMHUBFYRAUn5_l-Ap", "_version" : 1, "result" : "created", "_shards" : { "total" : 2, "successful" : 2, "failed" : 0 }, "_seq_no" : 7, "_primary_term" : 1 }
3.6 根据文档id更新文档
除了创建和替换文档外,ES还可以更新文档中的某一个字段内容。
注意,Elasticsearch实际上并没有在底层执行就地更新,而是先删除旧文档,再添加新文档。
API:
POST /索引名/类型名/文档id/_update?pretty
{
"doc": { "字段名": "新的字段值" } doc固定写法
}
需求:把文档ID为3中的name字段更改为“wudang”:
POST /movie_index/movie/3/_update?pretty
{
"doc": {"name":"wudang"}
}
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "3",
"_version" : 3,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 8,
"_primary_term" : 1
}
3.7 根据条件更新文档(了解)
POST /movie_index/_update_by_query
{
"query": {
"match":{
"actorList.id":1
}
},
"script": {
"lang": "painless",
"source":"for(int i=0;i<ctx._source.actorList.length;i++){if(ctx._source.actorList[i].id==3){ctx._source.actorList[i].name='tttt'}}"
}
}
{
"took" : 118,
"timed_out" : false,
"total" : 1,
"updated" : 1,
"deleted" : 0,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0,
"failures" : [ ]
}
3.8 删除文档属性(了解)
POST /movie_index/movie/1/_update
{
"script" : "ctx._source.remove('name')"
}
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "1",
"_version" : 4,
"_seq_no" : 3,
"_primary_term" : 1,
"found" : true,
"_source" : {
"doubanScore" : 8.5,
"actorList" : [
{
"name" : "zhang yi",
"id" : 1
},
{
"name" : "hai qing",
"id" : 2
},
{
"name" : "tttt",
"id" : 3
}
],
"id" : 100
}
}
3.9 根据条件删除文档(了解)
POST /movie_index/_delete_by_query
{
"query": {
"match_all": {}
}
}
{
"took" : 25,
"timed_out" : false,
"total" : 4,
"deleted" : 4,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0,
"failures" : [ ]
}
3.10 批处理
除了对单个文档执行创建、更新和删除之外,ElasticSearch还提供了使用_bulk API批量执行上述操作的能力。
API: POST /索引名/类型名/_bulk?pretty _bulk表示批量操作
注意:Kibana要求批量操作的json内容写在同一行
需求1:在索引中批量创建两个文档
POST /movie_index/movie/_bulk
{"index":{"_id":66}}
{"id":300,"name":"incident red sea","doubanScore":5.0,"actorList":[{"id":4,"name":"zhang cuishan"}]}
{"index":{"_id":88}}
{"id":300,"name":"incident red sea","doubanScore":5.0,"actorList":[{"id":4,"name":"zhang cuishan"}]}
{
"took" : 5,
"errors" : false,
"items" : [
{
"index" : {
"_index" : "movie_index",
"_type" : "movie",
"_id" : "66",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 5,
"_primary_term" : 1,
"status" : 201
}
},
{
"index" : {
"_index" : "movie_index",
"_type" : "movie",
"_id" : "88",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
}
]
}
需求2:在一个批量操作中,先更新第一个文档(ID为66),再删除第二个文档(ID为88)
POST /movie_index/movie/_bulk
{"update":{"_id":"66"}}
{"doc": { "name": "wudangshanshang" } }
{"delete":{"_id":"88"}}
{
"took" : 8,
"errors" : false,
"items" : [
{
"update" : {
"_index" : "movie_index",
"_type" : "movie",
"_id" : "66",
"_version" : 2,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 6,
"_primary_term" : 1,
"status" : 200
}
},
{
"delete" : {
"_index" : "movie_index",
"_type" : "movie",
"_id" : "88",
"_version" : 2,
"result" : "deleted",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1,
"status" : 200
}
}
]
}
4. 查询操作
4.1 搜索参数传递有2种方法
- URI发送搜索参数查询所有数据
GET /索引名/_search?q=* &pretty
例如:GET /movie_index/_search?q=_id:66
这种方式不太适合复杂查询场景,了解
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
- 请求体(request body)发送搜索参数查询所有数据
GET /movie_index/_search
{
"query": {
"match_all": {}
}
}
4.2 按条件查询(全部)
GET movie_index/movie/_search
{
"query":{
"match_all": {}
}
}
4.3 按分词查询(必须使用分词text类型)
测试前:将movie_index索引中的数据恢复到初始的3条
GET movie_index/movie/_search
{
"query":{
"match": {"name":"operation red sea"}
}
}
ES中,name属性会进行分词,底层以倒排索引的形式进行存储,对查询的内容也会进行分词,然后和文档的name属性内容进行匹配,所以命中3次,不过命中的分值不同。
注意:ES底层在保存字符串数据的时候,会有两种类型text和keyword
text:分词
keyword:不分词
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.8630463,
"hits" : [
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "1",
"_score" : 0.8630463,
"_source" : {
"id" : 100,
"name" : "operation red sea",
"doubanScore" : 8.5,
"actorList" : [
{
"id" : 1,
"name" : "zhang yi"
},
{
"id" : 2,
"name" : "hai qing"
},
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
},
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "3",
"_score" : 0.5753642,
"_source" : {
"id" : 300,
"name" : "incident red sea",
"doubanScore" : 5.0,
"actorList" : [
{
"id" : 4,
"name" : "zhang san feng"
}
]
}
},
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "2",
"_score" : 0.2876821,
"_source" : {
"id" : 200,
"name" : "operation meigong river",
"doubanScore" : 8.0,
"actorList" : [
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
}
]
}
}
4.4 按分词子属性查询
GET movie_index/movie/_search
{
"query":{
"match": {"actorList.name":"zhang han yu"}
}
}
返回3条件结果
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.970927,
"hits" : [
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "1",
"_score" : 0.970927,
"_source" : {
"id" : 100,
"name" : "operation red sea",
"doubanScore" : 8.5,
"actorList" : [
{
"id" : 1,
"name" : "zhang yi"
},
{
"id" : 2,
"name" : "hai qing"
},
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
},
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "2",
"_score" : 0.8630463,
"_source" : {
"id" : 200,
"name" : "operation meigong river",
"doubanScore" : 8.0,
"actorList" : [
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
},
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "3",
"_score" : 0.2876821,
"_source" : {
"id" : 300,
"name" : "incident red sea",
"doubanScore" : 5.0,
"actorList" : [
{
"id" : 4,
"name" : "zhang san feng"
}
]
}
}
]
}
}
4.5 按短语查询(相当于like %短语%)
按短语查询,不再利用分词技术,直接用短语在原始数据中匹配
把演员名包含zhang han yu的查询出来
GET movie_index/movie/_search
{
"query":{
"match_phrase": {"actorList.name":"zhang han yu"}
}
}
返回2条结果,把演员名包含zhang han yu的查询出来
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.8630463,
"hits" : [
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "2",
"_score" : 0.8630463,
"_source" : {
"id" : 200,
"name" : "operation meigong river",
"doubanScore" : 8.0,
"actorList" : [
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
},
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "1",
"_score" : 0.8630463,
"_source" : {
"id" : 100,
"name" : "operation red sea",
"doubanScore" : 8.5,
"actorList" : [
{
"id" : 1,
"name" : "zhang yi"
},
{
"id" : 2,
"name" : "hai qing"
},
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
}
]
}
}
4.6 通过term精准搜索匹配(必须使用keyword类型)
GET movie_index/movie/_search
{
"query":{
"term":{
"actorList.name.keyword":"zhang han yu"
}
}
}
返回2条结果,把演员中完全匹配zhang han yu的查询出来
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "2",
"_score" : 0.2876821,
"_source" : {
"id" : 200,
"name" : "operation meigong river",
"doubanScore" : 8.0,
"actorList" : [
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
},
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"id" : 100,
"name" : "operation red sea",
"doubanScore" : 8.5,
"actorList" : [
{
"id" : 1,
"name" : "zhang yi"
},
{
"id" : 2,
"name" : "hai qing"
},
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
}
]
}
}
4.7 fuzzy查询(容错匹配)
校正匹配分词,当一个单词都无法准确匹配,ES通过一种算法对非常接近的单词也给与一定的评分,能够查询出来,但是消耗更多的性能,对中文来讲,实现不是特别好。
GET movie_index/movie/_search
{
"query":{
"fuzzy": {"name":"rad"}
}
}
返回2个结果,会把incident red sea和operation red sea匹配上
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.19178805,
"hits" : [
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "1",
"_score" : 0.19178805,
"_source" : {
"id" : 100,
"name" : "operation red sea",
"doubanScore" : 8.5,
"actorList" : [
{
"id" : 1,
"name" : "zhang yi"
},
{
"id" : 2,
"name" : "hai qing"
},
{
"id" : 3,
"name" : "zhang han yu"
}
]
}
},
{
"_index" : "movie_index",
"_type" : "movie",
"_id" : "3",
"_score" : 0.19178805,
"_source" : {
"id" : 300,
"name" : "incident red sea",
"doubanScore" : 5.0,
"actorList" : [
{
"id" : 4,
"name" : "zhang san feng"
}
]
}
}
]
}
}
4.8 过滤—先匹配,再过滤
GET movie_index/movie/_search
{
"query":{
"match": {"name":"red"}
},
"post_filter":{
"term": {
"actorList.id": 3
}
}
}
4.9 过滤—匹配和过滤同时(推荐使用)
GET movie_index/movie/_search
{
"query": {
"bool": {
"must": [
{"match": {
"name": "red"
}}
],
"filter": [
{"term": { "actorList.id": "1"}},
{"term": {"actorList.id": "3"}}
]
}
}
}
4.10 过滤--按范围过滤
GET movie_index/movie/_search
{
"query": {
"range": {
"doubanScore": {
"gte": 6,
"lte": 8.5
}
}
}
}
关于范围操作符:
gt | 大于 |
---|---|
lt | 小于 |
gte | 大于等于 great than or equals |
lte | 小于等于 less than or equals |
4.11 排序
GET movie_index/movie/_search
{
"query":{
"match": {"name":"red sea"}
},
"sort":
{
"doubanScore": {
"order": "desc"
}
}
}
4.12 分页查询
from参数(基于0)指定从哪个文档序号开始
size参数指定返回多少个文档
这两个参数对于搜索结果分页非常有用。
注意,如果没有指定from,则默认值为0。
GET movie_index/movie/_search
{
"query": { "match_all": {} },
"from": 1,
"size": 1
}
4.13 指定查询的字段
GET movie_index/movie/_search
{
"query": { "match_all": {} },
"_source": ["name", "doubanScore"]
}
只显示name和doubanScore字段
4.14 高亮
GET movie_index/movie/_search
{
"query":{
"match": {"name":"red sea"}
},
"highlight": {
"fields": {"name":{} }
}
}
对命中的词进行高亮显示
4.15 聚合
聚合提供了对数据进行分组、统计的能力,类似于SQL中Group By和SQL聚合函数。在ElasticSearch中,可以同时返回搜索结果及其聚合计算结果,这是非常强大和高效的。
需求1:取出每个演员共参演了多少部电影
GET movie_index/movie/_search
{
"aggs": {
"myAGG": {
"terms": {
"field": "actorList.name.keyword"
}
}
}
}
aggs : 表示聚合
myAGG:给聚合取的名字,
trems:表示分组,相当于groupBy
field:指定分组字段
需求2:每个演员参演电影的平均分是多少,并按评分排序
GET movie_index/movie/_search
{
"aggs": {
"groupby_actor_id": {
"terms": {
"field": "actorList.name.keyword" ,
"order": {
"avg_score": "desc"
}
},
"aggs": {
"avg_score":{
"avg": {
"field": "doubanScore"
}
}
}
}
}
}
.keyword 是某个字符串字段,专门储存不分词格式的副本,在某些场景中只允许只用不分词的格式,
比如过滤filter比如聚合aggs, 所以字段要加上.keyword的后缀。
5. 分词
5.1 查看英文单词默认分词情况
GET _analyze
{
"text":"hello world"
}
按照空格对单词进行切分
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
5.2 查看中文默认分词情况
GET _analyze
{
"text":"小米手机"
}
默认手机是按照每个汉字进行切分
{
"tokens" : [
{
"token" : "小",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "米",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "手",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "机",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
}
]
}
5.3 中文分词器
通过上面的查询,我们可以看到ES本身自带的中文分词,就是单纯把中文一个字一个字的分开,根本没有词汇的概念。
但是实际应用中,用户都是以词汇为条件,进行查询匹配的,如果能够把文章以词汇为单位切分开,那么与用户的查询条件能够更贴切的匹配上,查询速度也更加快速。
常见的一些开源分词器对比,我们使用IK分词器
分词器 | 优势 | 劣势 |
---|---|---|
Smart Chinese Analysis | 官方插件 | 中文分词效果惨不忍睹 |
IKAnalyzer | 简单易用,支持自定义词典和远程词典 | 词库需要自行维护,不支持词性识别 |
结巴分词 | 新词识别功能 | 不支持词性识别 |
Ansj中文分词 | 分词精准度不错,支持词性识别 | 对标hanlp词库略少,学习成本高 |
Hanlp | 目前词库最完善,支持的特性非常多 | 需要更优的分词效果,学习成本高 |
5.4 IK分词器的安装及使用
- 下载地址
https://github.com/medcl/elasticsearch-analysis-ik
-
将相关上传到/opt/software
-
解压zip文件
unzip elasticsearch-analysis-ik-6.6.0.zip -d /opt/module/elasticsearch/plugins/ik
注意
使用unzip进行解压
-d指定解压后的目录
必须放到ES的plugins目录下,并在plugins目录下创建单独的目录
-
查看/opt/module/elasticsearch/plugins/ik/conf下的文件,分词就是将所有词汇分好放到文件中
-
分发
[root@node03 elasticsearch]# scp -r /opt/module/elasticsearch/plugins/ik root@node04:/opt/module/elasticsearch/plugins/ik [root@node03 elasticsearch]# scp -r /opt/module/elasticsearch/plugins/ik root@node05:/opt/module/elasticsearch/plugins/ik
-
重启ES
es-cluster.sh stop es-cluster.sh start
-
测试使用
ik_smart
GET movie_index/_analyze { "analyzer": "ik_smart", "text": "我是中国人" }
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
}
]
}
ik_max_word
GET movie_index/_analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "中国",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "国人",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 4
}
]
}
5.5 自定义词库-本地指定
有的时候,词库提供的词并不包含项目中使用到的一些专业术语或者新兴网络用语,需要我们对词库进行补充。
具体步骤
-
通过配置本地目录直接指定自定义词库
修改/opt/module/elasticsearch/plugins/ik/config/中的IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict">./myword.txt</entry> <!--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords"></entry> <!--用户可以在这里配置远程扩展字典 --> <!-- <entry key="remote_ext_dict">words_location</entry> --> <!--用户可以在这里配置远程扩展停止词字典--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>
-
在/opt/module/elasticsearch/plugins/ik/config/当前目录下创建myword.txt
[root@node03 config]# vim myword.txt 蓝瘦 蓝瘦香菇
-
分发配置文件以及myword.txt
[root@node03 elasticsearch]# scp -r /opt/module/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml root@node04:/opt/module/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml [root@node03 elasticsearch]# scp -r /opt/module/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml root@node05:/opt/module/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml [root@node03 elasticsearch]# scp -r /opt/module/elasticsearch/plugins/ik/config/myword.txt root@node04:/opt/module/elasticsearch/plugins/ik/config/myword.txt [root@node03 elasticsearch]# scp -r /opt/module/elasticsearch/plugins/ik/config/myword.txt root@node05:/opt/module/elasticsearch/plugins/ik/config/myword.txt
-
重启ES服务
es-cluster.sh stop
es-cluster.sh start
- 测试分词效果
GET movie_index/_analyze
{
"analyzer": "ik_smart",
"text": "蓝瘦香菇"
}
5.6 自定义词库-远程指定
远程配置一般是如下流程,我们这里简易通过nginx模拟
自定义词库远程指定.png-
修改/opt/module/elasticsearch/plugins/ik/config/中的IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典 --> <!--<entry key="ext_dict"> </entry>--> <!--用户可以在这里配置自己的扩展停止词字典--> <!--<entry key="ext_stopwords"></entry>--> <!--用户可以在这里配置远程扩展字典 --> <entry key="remote_ext_dict">http://node03/fenci/myword.txt</entry> <!--用户可以在这里配置远程扩展停止词字典--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>
注意:将本地配置注释掉
-
分发配置文件
-
在nginx.conf文件中配置静态资源路径
pwd
/opt/module/nginx/conf
[atguigu@node03 conf]$ vim nginx.conf
location /fenci{
root es;
}
- 在/opt/module/nginx/目录下创建es/fenci目录,并在es/fenci目录下创建myword.txt
pwd
/opt/module/nginx/es/fenci
vim myword.txt
蓝瘦
蓝瘦香菇
- 启动nginx
/opt/module/nginx/sbin/nginx
- 重启ES服务测试nginx是否能够访问
es-cluster.sh stop
es-cluster.sh start
- 测试分词效果
更新完成后,ES只会对新增的数据用新词分词。历史数据是不会重新分词的。如果想要历史数据重新分词。需要执行:
POST movies_index_chn/_update_by_query?conflicts=proceed
6 关于mapping
Type可以理解为关系型数据库的Table,那每个字段的数据类型是如何定义的?
实际上每个Type中的字段是什么数据类型,由mapping定义,如果我们在创建Index的时候,没有设定mapping,系统会自动根据一条数据的格式来推断出该数据对应的字段类型,具体推断类型如下:
-
true/false → boolean
-
1020 → long
-
20.1 → float
-
“2018-02-01” → date
-
“hello world” → text +keyword
默认只有text会进行分词,keyword是不会分词的字符串。mapping除了自动定义,还可以手动定义,但是只能对新加的、没有数据的字段进行定义,一旦有了数据就无法再做修改了。
6.1 基于中文分词搭建索引-自动定义mapping
- 直接创建Document
这个时候index不存在,建立文档的时候自动创建index,同时mapping会自动定义
PUT /movie_chn_1/movie/1
{ "id":1,
"name":"红海行动",
"doubanScore":8.5,
"actorList":[
{"id":1,"name":"张译"},
{"id":2,"name":"海清"},
{"id":3,"name":"张涵予"}
]
}
PUT /movie_chn_1/movie/2
{
"id":2,
"name":"湄公河行动",
"doubanScore":8.0,
"actorList":[
{"id":3,"name":"张涵予"}
]
}
PUT /movie_chn_1/movie/3
{
"id":3,
"name":"红海事件",
"doubanScore":5.0,
"actorList":[
{"id":4,"name":"张三丰"}
]
}
-
查看测试
GET /movie_chn_1/movie/_search { "query": { "match": { "name": "海行" } } }
{ "took" : 23, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 0.5753642, "hits" : [ { "_index" : "movie_chn_1", "_type" : "movie", "_id" : "1", "_score" : 0.5753642, "_source" : { "id" : 1, "name" : "红海行动", "doubanScore" : 8.5, "actorList" : [ { "id" : 1, "name" : "张译" }, { "id" : 2, "name" : "海清" }, { "id" : 3, "name" : "张涵予" } ] } }, { "_index" : "movie_chn_1", "_type" : "movie", "_id" : "2", "_score" : 0.2876821, "_source" : { "id" : 2, "name" : "湄公河行动", "doubanScore" : 8.0, "actorList" : [ { "id" : 3, "name" : "张涵予" } ] } }, { "_index" : "movie_chn_1", "_type" : "movie", "_id" : "3", "_score" : 0.2876821, "_source" : { "id" : 3, "name" : "红海事件", "doubanScore" : 5.0, "actorList" : [ { "id" : 4, "name" : "张三丰" } ] } } ] } }
-
分析结论
上面查询“海行”命中了三条记录,是因为我们在定义的Index的时候,没有指定分词器,使用的是默认的分词器,对中文是按照每个汉字进行分词的。
6.2 基于中文分词搭建索引-手动定义mapping
-
定义Index,指定mapping
PUT movie_chn_2 { "mappings": { "movie":{ "properties": { "id":{ "type": "long" }, "name":{ "type": "text", "analyzer": "ik_smart" }, "doubanScore":{ "type": "double" }, "actorList":{ "properties": { "id":{ "type":"long" }, "name":{ "type":"keyword" } } } } } } }
-
向Index中放入Document
PUT /movie_chn_2/movie/1
{ "id":1,
"name":"红海行动",
"doubanScore":8.5,
"actorList":[
{"id":1,"name":"张译"},
{"id":2,"name":"海清"},
{"id":3,"name":"张涵予"}
]
}
PUT /movie_chn_2/movie/2
{
"id":2,
"name":"湄公河行动",
"doubanScore":8.0,
"actorList":[
{"id":3,"name":"张涵予"}
]
}
PUT /movie_chn_2/movie/3
{
"id":3,
"name":"红海事件",
"doubanScore":5.0,
"actorList":[
{"id":4,"name":"张三丰"}
]
}}
- 查看手动定义的mapping
GET movie_chn_2/_mapping
{
"movie_chn_2" : {
"mappings" : {
"movie" : {
"properties" : {
"actorList" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "keyword"
}
}
},
"doubanScore" : {
"type" : "double"
},
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer" : "ik_smart"
}
}
}
}
}
}
- 分析结论
上面查询没有命中任何记录,是因为我们在创建Index的时候,指定使用ik分词器进行分词
6.3 索引数据拷贝
ElasticSearch虽然强大,但是却不能动态修改mapping到时候我们有时候需要修改结构的时候不得不重新创建索引;
ElasticSearch为我们提供了一个reindex的命令,就是会将一个索引的快照数据copy到另一个索引,默认情况下存在相同的_id会进行覆盖(一般不会发生,除非是将两个索引的数据copy到一个索引中),可以使用POST _reindex命令将索引快照进行copy
POST _reindex
{
"source": {
"index": "my_index_name"
},
"dest": {
"index": "my_index_name_new"
}
}
7. 索引别名 _aliases
索引别名就像一个快捷方式或软连接,可以指向一个或多个索引,也可以给任何一个需要索引名的API来使用。
7.1 创建索引别名
- 创建Index的时候声明
PUT 索引名
{
"aliases": {
"索引别名": {}
}
}
#创建索引的时候,手动mapping,并指定别名
PUT movie_chn_3
{
"aliases": {
"movie_chn_3_aliase": {}
},
"mappings": {
"movie":{
"properties": {
"id":{
"type": "long"
},
"name":{
"type": "text",
"analyzer": "ik_smart"
},
"doubanScore":{
"type": "double"
},
"actorList":{
"properties": {
"id":{
"type":"long"
},
"name":{
"type":"keyword"
}
}
}
}
}
}
}
- 为已存在的索引增加别名
POST _aliases
{
"actions": [
{ "add":{ "index": "索引名", "alias": "索引别名" }}
]
}
#给movie_chn_3添加别名
POST _aliases
{
"actions": [
{ "add":{ "index": "movie_chn_3", "alias": "movie_chn_3_a2" }}
]
}
7.2 查询别名列表
GET _cat/aliases?v
alias index filter routing.index routing.search
movie_chn_3_a2 movie_chn_3 - - -
movie_chn_3_aliase movie_chn_3 - - -
.kibana .kibana_1 - - -
7.3 使用索引别名查询
与使用普通索引没有区别
GET 索引别名/_search
7.4 删除某个索引的别名
POST _aliases
{
"actions": [
{ "remove": { "index": "索引名", "alias": "索引别名" }}
]
}
POST _aliases
{
"actions": [
{ "remove": { "index": "movie_chn_3", "alias": "movie_chn_3_aliase" }}
]
}
7.5 使用场景
- 给多个索引分组 (例如, last_three_months)
POST _aliases
{
"actions": [
{ "add": { "index": "movie_chn_1", "alias": "movie_chn_query" }},
{ "add": { "index": "movie_chn_2", "alias": "movie_chn_query" }}
]
}
GET movie_chn_query/_search
- 给索引的一个子集创建视图
相当于给Index加了一些过滤条件,缩小查询范围
POST _aliases
{
"actions": [
{
"add":
{
"index": "movie_chn_1",
"alias": "movie_chn_1_sub_query",
"filter": {
"term": { "actorList.id": "4"}
}
}
}
]
}
GET movie_chn_1_sub_query/_search
- 在运行的集群中可以无缝的从一个索引切换到另一个索引
POST /_aliases
{
"actions": [
{ "remove": { "index": "movie_chn_1", "alias": "movie_chn_query" }},
{ "remove": { "index": "movie_chn_2", "alias": "movie_chn_query" }},
{ "add": { "index": "movie_chn_3", "alias": "movie_chn_query" }}
]
}
整个操作都是原子的,不用担心数据丢失或者重复的问题
8 索引模板
8.1 创建索引模板
PUT _template/template_movie2020
{
"index_patterns": ["movie_test*"],
"settings": {
"number_of_shards": 1
},
"aliases" : {
"{index}-query": {},
"movie_test-query":{}
},
"mappings": {
"_doc": {
"properties": {
"id": {
"type": "keyword"
},
"movie_name": {
"type": "text",
"analyzer": "ik_smart"
}
}
}
}
}
其中 "index_patterns": ["movie_test*"]的含义就是凡是往movie_test开头的索引写入数据时,如果索引不存在,那么ES会根据此模板自动建立索引。
在 "aliases" 中用{index}表示,获得真正的创建的索引名。aliases中会创建两个别名,一个是根据当前索引创建的,另一个是全局固定的别名。
8.2 测试
- 向索引中添加数据
POST movie_test_202011/_doc
{
"id":"333",
"name":"zhang3"
}
- 查询Index的mapping,就是使用我们的索引模板创建的
GET movie_test_202011-query/_mapping
- 根据模板中取的别名查询数据
GET movie_test-query/_search
8.3 查看系统中已有的模板清单
GET _cat/templates
8.4 查看某个模板详情
GET _template/template_movie2020
或者
GET _template/template_movie*
8.5 使用场景
- 分割索引
分割索引就是根据时间间隔把一个业务索引切分成多个索引。
比如 把order_info 变成 order_info_20200101,order_info_20200102 …..
这样做的好处有两个:
-
结构变化的灵活性
因为ES不允许对数据结构进行修改。但是实际使用中索引的结构和配置难免变化,那么只要对下一个间隔的索引进行修改,原来的索引维持原状。这样就有了一定的灵活性。
要想实现这个效果,我们只需要在需要变化的索引那天将模板重新建立即可。
-
查询范围优化
因为一般情况并不会查询全部时间周期的数据,那么通过切分索引,物理上减少了扫描数据的范围,也是对性能的优化。
8.6 注意
使用索引模板,一般在向索引中插入第一条数据创建索引,如果ES中的Shard特别多,有可能创建索引会变慢,如果延迟不能接受,可以不使用模板,使用定时脚本在头一天提前建立第二天的索引。
网友评论