ElasticSearch 全文检索技术（一）

作者: So_ProbuING | 来源:发表于2020-02-24 19:30 被阅读0次

ElasticSearch 全文检索技术

简介

Elastic官网：https://www.elastic.co/cn/
Elastic有一条完整的产品线：

ElasticSearch
Kibana
Logstash

ElasticSearch

Elasticsearch官网：https://www.elastic.co/cn/products/elasticsearch
ElasticSearch是一个分布式的RESTful风格的搜索和数据分析引擎，能够解决不断涌现出的各种用例。
ElasticSearch具备的特点：

分布式，无需人工搭建集群
Restful风格，一切API都遵循Rest原则，容易上手
近实时搜索，数据更新在Elasticsearch中几乎是完全同步的

版本

目前ElasticSearch最新的版本是6.2.4
需要JDK 1.8及以上

安装和配置(windows)

解压缩ElasticSearch压缩包

image.png

修改配置文件

修改索引数据和日志数据存储的路径

image.png

# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: E:\projectdata\es\data
#
# Path to log files:
#
path.logs: E:\projectdata\es\log

运行elasticSearch.bat

image.png

运行后可以看到绑定的端口

image.png

9300：集群节点间通讯接口，接收tcp协议
9200：客户端访问接口，接收Http协议
我们可以直接在浏览器中访问：
http://127.0.0.1:9200

image.png

安装kibana

什么是kibana

kibana是一个基于Node.js的ElasticSearch索引库数据统计工具，可以利用ElasticSearch的聚合功能，生成各种图表

安装

kibana依赖于node，首先查看node的版本

node -v

解压压缩包，修改config/kibana.yml/

elasticsearch.url: "http://127.0.0.1:9200"

image.png

控制台

kibana控制台界面

安装ik分词器

解压elasticsearch-analysis-ik-6.2.4.zip后,将解压后的文件夹拷贝到elasticsearch-6.2.4\plugins
下，并重命名文件夹为ik

image.png

重新启动ElasticSearch，即可加载IK分词器

测试

在kibana控制台中输入

GET _analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人"
}

得到结果

{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
},
{
"token": "中国人",
"start_offset": 2,
"end_offset": 5,
"type": "CN_WORD",
"position": 2
},
{
"token": "中国",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "国人",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 4
}
]
}

API

ElasticSearch提供了Rest风格的API，即http请求接口，而且也提供了各种语言的客户端API

Rest风格API

文档地址：https://www.elastic.co/guide/en/elasticsearch/reference/6.2/index.html

索引库操作

基本概念

ElasticSearch也是基于Lucene的全文检索，本质也是存储数据。很多概念与MySQL类似

索引库（indices）---------------------------------Database 数据库
类型（type）----------------------------------Table 数据表
文档（Document）--------------------------Row 行
字段（Field）---------------------Columns 列

创建索引库

语法

ElasticSearch采用Rest风格API因此其API就是一次Http请求。

请求方式：PUT
请求路径：/索引库名
请求参数：json格式

{
"settings": {
"属性名": "属性值"
}
}

查看索引库

语法
GET /索引库名

删除索引库

语法
DELETE /索引库名

类型及映射操作

有了索引库，等于有个数据库中的database。
数据库中的表就相当于索引库中的类型
数据库表中的约束就叫做字段映射

创建字段映射

语法

PUT /索引库名/_mapping/类型名称
{
  "properties":{
      "字段名":{
          "type":"类型",
          "index":true,
          "store":true,
          "analyzer":"分词器"
}
}
}

type：类型，可以是text、long、short、date、integer、object等
index：是否索引，默认为true
store：是否存储，默认为false
analyzer：分词器，ik_max_word 使用ik分词器

PUT heima/_mapping/goods
{
  "properties": {
    "title":{
      "type": "text"
      , "analyzer": "ik_max_word"
    },
    "images":{
      "type": "keyword",
      "index": false
    },
    "price":{
      "type": "float"
    }
  }
}

响应结果

{
  "acknowledged": true
}

查看映射关系

语法
查看某个索引库中的所有类型的映射
GET /索引库名/_mapping
如果要查看某个类型映射，可以再路径后面跟上类型名称，即：

GET /索引库名/_mapping/映射名

示例
查看所有映射关系

GET /heima/_mapping

响应

{
  "heima": {
    "mappings": {
      "goods": {
        "properties": {
          "images": {
            "type": "keyword",
            "index": false
          },
          "price": {
            "type": "float"
          },
          "title": {
            "type": "text",
            "analyzer": "ik_max_word"
          }
        }
      }
    }
  }
}

映射属性详解

type

String类型
- text：可分词，不可参与聚合
- keyword：不可分词，数据会作为完整字段进行匹配，可以参与聚合
Numerical：数值类型
- 基本数据类型：long、integer、short、byte、double、float、half_float
- 浮点数的高精度类型：scaled_float
  - 需要指定一个精度因子，例如10或100。elasticsearch会把真实值乘以这个因子后存储，取出时再还原
Date：日期类型
elasticsearch可以对日期格式化为字符串存储。但是我们一般都存储为毫秒值，存储为long。节省空间
Array：数组类型
- 进行匹配时，任意一个元素满足，都认为满足
- 排序时，升序则用数组中的最小值来排序，降序则用数组中的最大值来排序
Object：对象

{
name:"Jack",
age:21,
girl:{
name: "Rose",
age:21
}
}

index
index影响字段的索引情况

true：字段会被索引，表示可以用来进行搜索过滤，默认值是true
false：字段不会被索引，不能用来搜索
index的默认值就是true，也就是说不进行任何配置，所有字段都会被索引
如果我们不希望被索引的，就需要手动设置index为false

store
是否将数据进行额外存储
在学习lucenc和solr时，如果一个字段的store设置为false，那么在文档列表中就不会有这个字段的值
在Elasticsearch中，即便store设置为false,也可以搜索到结果
Elasticsearch在创建文档索引时，会将文档中的原始数据备份，保存到一个叫做_source的属性中，我们可以通过过滤_source来选择哪些要显示，哪些不显示。
如果设置store为true，就会在_source以外额外存储一份数据。比较多余，因此一般我们都会将store设置为false，store的默认值就是false。
boost
权重，新增数据时，可以指定该数据的权重，权重越高，得分越高，排名越靠前

一次创建索引库和类型

我们也可以在创建索引库的同时，直接指定索引库中的类型，基本语法

PUT /索引库名
{
  "settings":{
      "索引库属性名":"索引库属性值"
    },
    "mappings":{
    "类型名":{
        "properties":{
            "字段名":{
                "映射属性名":"映射属性值"
          }
      }
    }
}
}

示例

PUT /heima
{
  "settings": {},
  "mappings": {
    "goods": {
      "properties": {
        "title":{
          "type": "text",
          "analyzer": "ik_max_word"
        }
      }
    }
  }
}

结果

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "heima"
}

文档操作

文档，即索引库中某个类型下的数据，会根据规则创建索引，将来用来搜索，相当于数据库中的每一行数据

新增文档

新增并随机生成id

通过POST请求，可以向一个已经存在的索引库中添加文档数据

语法

POST /索引库名/类型名
{
  "key":"value"
}

示例

P
OST /heima/goods/
{
"title":"小米手机",
"images":"http://image.leyou.com/12479122.jpg",
"price":2699.00
}

响应:

{
"_index": "heima",
"_type": "goods",
"_id": "r9c1KGMBIhaxtY5rlRKv",
"_version": 1,
"result": "created",
"_shards": {
"total": 3,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 2
}

可以看到在响应结果中有个_id字段，这个就是这条文档数据的唯一标识，以后的增删改查都依赖这个id作为唯一标识，可以看到id的值为r9c1KGMBIhaxtY5rlRKv，这里是ElasticSearch帮我们随机生成的id。

新增文档并自定义id

POST /索引库名/类型/id值
{
...
}

查看文档

GET /heima/goods/r9c1KGMBIhaxtY5rlRKv

修改数据

新增的请求方式改为PUT就是修改了，修改必须指定id

id对应文档存在，则修改
id对应文档不存在，则新增

删除数据

删除使用DELETE请求，同样，需要根据id进行删除

语法

DELETE /索引库名/类型名/id值

ElasticSearch的智能判断

ElasticSearch非常智能，不需要给索引库设置任何mapping映射，可以根据输入的数据来判断类型，动态添加数据映射。
通过kibana添加

POST /heima/goods/3
{
"title":"超大米手机",
"images":"http://image.leyou.com/12479122.jpg",
"price":3299.00,
"stock": 200,
"saleable":true,
"subTitle":"哈哈"
}

响应

{
  "_index": "heima",
  "_type": "goods",
  "_id": "3",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

查看索引库映射关系

{
  "heima": {
    "mappings": {
      "goods": {
        "properties": {
          "images": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "price": {
            "type": "float"
          },
          "saleable": {
            "type": "boolean"
          },
          "stock": {
            "type": "long"
          },
          "subTitle": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "title": {
            "type": "text",
            "analyzer": "ik_max_word"
          }
        }
      }
    }
  }
}

动态映射模版

动态模版的语法

image.png

模版名称
匹配条件，凡是符合条件未定义字段，都会按照这个规则来映射
映射规则，匹配成功后的映射规则
我们把所有未映射的string类型自动映射为keyword类型

PUT heima3
{
"mappings": {
"goods": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
}
},
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword"
}
}：
}
]
}
}
}

title字段：统一映射为text类型，并指定分词器
其他字段：只要是string类型，统一处理为keyword类型

查询

基本查询

基本语法

GET /索引库名/_search
{
  "query":{
      "查询类型":{
            "查询条件":"查询条件值"
      }
  }
}

这里的query代表一个查询对象，里面可以有不同的查询属性

查询类型
- 例如：match_all,match,term,range
查询条件：查询条件会根据类型的不同，写法也有差异

查询所有(match_all)

示例

GET /heima/_search
{
    "query":{
      "match_all":{}
    }
}

结果

{
  "took": 59,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "3",
        "_score": 1,
        "_source": {
          "title": "超大米手机",
          "images": "http://image.leyou.com/12479122.jpg",
          "price": 3299,
          "stock": 200,
          "saleable": true,
          "subTitle": "哈哈"
        }
      }
    ]
  }
}

took：查询花费时间，单位是毫秒
time_out：是否超时
_shards：分片信息
hits：搜索结果总览对象
- total：搜索到的总条数
- max_score：所有结果中文档得分的最高分
- hits：搜索结果的文档对象数组，每个元素是一条搜索到的文档信息
  - _index：索引库
  - _type：文档类型
  - _id：文档id
  - _score：文档得分
  - _source：文档的源数据

匹配查询(match)

match类型查询，会把查询条件进行分词，然后进行查询，多个词条之间是or的关系

GET /heima/_search
{
"query":{
"match":{
"title":"小米电视"
            }
        }
}

and关系
某些情况下，我们需要更精确查找，我们科将这个关系变成and

GET /goods/_search
{
"query":{
"match":{
"title":{"query":"小米电视","operator":"and"}
}
}
}

词条匹配

term查询被用于精确值匹配，这些精确值可能是数字、时间、布尔或者那些未分词的字符串

GET /heima/_search
{
"query":{
"term":{
"price":5000
}
}
}

查询结果

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "小米手机9",
          "images": "http://image.leyou.com/12479138.jpg",
          "price": 5000
        }
      },
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "3",
        "_score": 1,
        "_source": {
          "title": "大米手机",
          "images": "http://image.leyou.com/12479112.jpg",
          "price": 5000
        }
      }
    ]
  }
}

布尔组合(bool)

bool把各种其它查询通过must(与)、must_not(非)、should(或)的方式进行组合

GET /heima/_search
{
"query":{
"bool":{
"must": { "match": { "title": "大米" }},
"must_not": { "match": { "title": "电视" }},
"should": { "match": { "title": "手机" }}
}
}
}

结果

{
"took": 10,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5753642,
"hits": [
{
"_index": "heima",
"_type": "goods",
"_id": "2",
"_score": 0.5753642,
"_source": {
"title": "大米手机",
"images": "http://image.leyou.com/12479122.jpg",
"price": 2899
}
}
]
}
}

范围查询(range)

range查询找出那些落在指定区间内的数字或者时间

GET /heima/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 1000,
        "lt": 2800
      }
    }
  }
}

range查询允许的字符

操作符	说明
gt	大于
gte	大于或等于
lt	小于
lte	小于等于

模糊查询(fuzzy)

新增一个商品

POST /heima/goods/5
{
    "title":"apple手机",
    "images":"http://image.leyou.com/1231231.jpg",
    "price":6899
}

模糊查询

GET /heima/_search
{
    "query":{
        "fuzzy":{
            "title":"appla"
          }
      }
}

结果过滤

elasticsearch在搜索的结果中，会把文档中保存在_source的所有字段都返回
如果我们只想获取其中的部分字段，我们可以添加_source的过滤

示例

GET /heima/_search
{
  "_source": [
    "title",
    "price"
  ],
  "query": {
    "term": {
      "price": 5000
    }
  }
}

返回的结果：

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "2",
        "_score": 1,
        "_source": {
          "price": 5000,
          "title": "小米手机9"
        }
      },
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "3",
        "_score": 1,
        "_source": {
          "price": 5000,
          "title": "大米手机"
        }
      }
    ]
  }
}

指定includes和excludes

我们可以通过：

includes：来指定想要显示的字段
excludes：来指定不想显示的字段

示例指定显示title price

GET /heima/_search
{
  "_source": {
    "includes": [
      "title",
      "price"
    ] # 指定要显示的字段
  },
  "query": {
    "term": {
      "price": 5000
    }
  }
}

示例不显示image

GET /heima/_search
{
  "_source": {
    "excludes": [
      "images"
    ]
  },
  "query": {
    "term": {
      "price": 5000
    }
  }
}

过滤(filter)

条件查询中进行过滤
所有的查询都会影响到文档的评分及排名，如果我们需要再查询结果中进行过滤，并且不希望过滤条件影响评分，那么就不要把过滤条件作为查询条件来用。而是使用filter方式

GET /heima/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "title": "小米手机"
        }
      },
      "filter": {
        "range": {
          "price": {
            "gt": 2000,
            "lt": 5000
          }
        }
      }
    }
  }
}

排序

单字段排序

sort可以让我们按照不同的字段进行排序，并且通过order指定排序的方式

GET /heima/_search
{
  "query": {
    "match": {
      "title": "小米手机"
    }
  },
  "sort": [
    {
      "price": {
        "order": "desc"
      }
    }
  ]
}

返回的结果

{
  "took": 30,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": null,
    "hits": [
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "5",
        "_score": null,
        "_source": {
          "title": "apple手机",
          "images": "http://image.leyou.com/1231231.jpg",
          "price": 6899
        },
        "sort": [
          6899
        ]
      },
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "2",
        "_score": null,
        "_source": {
          "title": "小米手机9",
          "images": "http://image.leyou.com/12479138.jpg",
          "price": 5000
        },
        "sort": [
          5000
        ]
      },
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "3",
        "_score": null,
        "_source": {
          "title": "大米手机",
          "images": "http://image.leyou.com/12479112.jpg",
          "price": 5000
        },
        "sort": [
          5000
        ]
      },
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "1",
        "_score": null,
        "_source": {
          "title": "小米电视4A",
          "images": "http://image.leyou.com/12479122.jpg",
          "price": 3899
        },
        "sort": [
          3899
        ]
      }
    ]
  }
}

多字段排序

假定我们想要结合使用price和_score进行查询，并且匹配的结果首先按照价格排序，然后按照相关性得分排序

GET /heima/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "title": "小米手机"
        }
      },
      "filter": {
        "range": {
          "price": {
            "gt": 1,
            "lt": 300000
          }
        }
      }
    }
  },
  "sort": [
    {
      "price": {
        "order": "desc"
      }
    },
    {
      "_score": {
        "order": "desc"
      }
    }
  ]
}

分页

elasticsearch的分页与mysql数据库非常相似，都是指定两个值

from：开始位置
size：每页大小

GET /heima/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "price": {
        "order": "asc"
      }
    }
  ],
  "from": 3,
  "size": 3
}

结果

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": null,
    "hits": [
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "5",
        "_score": null,
        "_source": {
          "title": "apple手机",
          "images": "http://image.leyou.com/1231231.jpg",
          "price": 6899
        },
        "sort": [
          6899
        ]
      }
    ]
  }
}

高亮

高亮原理

服务端搜索数据，得到搜索结果
把搜索结果中，搜索关键字都加上约定好的标签
前端页面提前写好标签css样式，即可实现高亮

GET /heima/_search
{
  "query": {
    "match": {
      "title": "手机"
    }
  },
  "highlight": {
    "pre_tags": "<em>",
    "post_tags": "</em>",
    "fields": {
      "title": {}
    }
  }
}

在使用match查询的同时，加上highlight属性

pre_tags：前置标签
post_tags：后置标签
fields：需要高亮的字段
- title：这里声明title字段需要高亮，后面可以为这个字段设置特有配置，也可以为空
结果

{
  "took": 68,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "5",
        "_score": 0.2876821,
        "_source": {
          "title": "apple手机",
          "images": "http://image.leyou.com/1231231.jpg",
          "price": 6899
        },
        "highlight": {
          "title": [
            "apple<em>手机</em>"
          ]
        }
      },
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "title": "小米手机9",
          "images": "http://image.leyou.com/12479138.jpg",
          "price": 5000
        },
        "highlight": {
          "title": [
            "小米<em>手机</em>9"
          ]
        }
      },
      {
        "_index": "heima",
        "_type": "goods",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "大米手机",
          "images": "http://image.leyou.com/12479112.jpg",
          "price": 5000
        },
        "highlight": {
          "title": [
            "大米<em>手机</em>"
          ]
        }
      }
    ]
  }
}

聚合aggregations

聚合可以让我们极其方便的实现对数据的统计、分析

统计哪些商品最受欢迎
这些手机的平均价格、最高价格、最低价格
这些手机每月的销售情况如何
实现这些统计功能呢个的比数据库sql要方便的多，而且查询速度非常快，几乎接近实时搜索效果

基本概念

ElasticSearch中的聚合，包含多种类型，最常用的两种，一个叫桶，一个叫度量

桶(bucket)
桶的作用，是按照某种方式对数据进行分组，每一组数据在ES中称为一个桶
ElasticSearch中提供的划分桶的方式有很多：

Date Histogram Aggregation：根据日期阶梯分组，例如给定阶梯为周，会自动每周分为一组
Histogram Aggregation：根据数值阶梯分组，与日期类似，需要知道分组的间隔（interval）
Terms Aggregation：根据词条内容分组，词条内容完全匹配的为一组
Range Aggregation：数值和日期的范围分组，指定开始和结束，然后按段分组
……

度量(metrics)
分组完成以后，我们一般会对组中的数据进行聚合运算，例如求平均值、最大、最小、求和等，这些在ES中称为度量
常用的一些度量聚合方式：

Avg Aggregation：求平均值
Max Aggregation：求最大值
Min Aggregation：求最小值
Percentiles Aggregation：求百分比
Stats Aggregation：同时返回avg、max、min、sum、count等
Sum Aggregation：求和
Top hits Aggregation：求前几
Value Count Aggregation：求总数
……
创建索引：

PUT /car
{
  "mappings": {
    "orders": {
      "properties": {
        "color": {
          "type": "keyword"
        },
        "make": {
          "type": "keyword"
        }
      }
    }
  }
}

在ES中，需要进行聚合、排序、过滤的字段其处理方式比较特殊，因此不能被分词，必须使用keyword或数值类型。
导入数据：

P
OST /car/orders/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "红", "make" : "本田", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "红", "make" : "本田", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "绿", "make" : "福特", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "蓝", "make" : "丰田", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "绿", "make" : "丰田", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "红", "make" : "本田", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "红", "make" : "宝马", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "蓝", "make" : "福特", "sold" : "2014-02-12" }

聚合为桶

导入数据后，我们首先按照汽车的颜色color来划分桶，

GET /car/_search
{
  "size": 0,
  "aggs": {
    "popular_colors": {
      "terms": {
        "field": "color"
      }
    }
  }
}

size：查询条数
aggs：声明这是一个聚合查询，是aggregations的缩写
- popular_colors：给这次聚合起一个名字，可任意指定
- terms：聚合的类型，这里选择terms，是根据词条(也就是我们指定的颜色)划分
  - field：划分桶时依赖的字段

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 8,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "popular_colors": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "红",
          "doc_count": 4
        },
        {
          "key": "绿",
          "doc_count": 2
        },
        {
          "key": "蓝",
          "doc_count": 2
        }
      ]
    }
  }
}

桶内度量

前面的例子告诉我们每个桶里面的文档数量，我们的应用通常需要提供更复杂的文档度量。例如每种颜色汽车的平均价格是多少
因此，我们需要告诉Elasticsearch使用哪个字段，使用何种度量方式进行运算，这些信息要嵌套在桶内，度量的运算会基于桶内的文档进行
我们为刚刚的聚合结果添加求价格平均值的度量

GET /car/_search
{
  "size": 0,
  "aggs": {
    "popular_colors": {
      "terms": {
        "field": "color"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

aggs：我们在上一个aggs中添加新的aggs，可见度量也是一个聚合
avg_price：聚合的名称
avg：度量的类型
field：度量运算的手段
结果

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 8,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "popular_colors": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "红",
          "doc_count": 4,
          "avg_price": {
            "value": 32500
          }
        },
        {
          "key": "绿",
          "doc_count": 2,
          "avg_price": {
            "value": 21000
          }
        },
        {
          "key": "蓝",
          "doc_count": 2,
          "avg_price": {
            "value": 20000
          }
        }
      ]
    }
  }
}

桶内嵌套桶

如果我们想统计每种颜色的汽车中，分别属于哪个制造商，按照make字段进行分桶

GET /car/_search
{
  "size": 0,
  "aggs": {
    "popular_colors": {
      "terms": {
        "field": "color"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "maker": {
          "terms": {
            "field": "make"
          }
        }
      }
    }
  }
}

返回结果

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 8,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "popular_colors": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "红",
          "doc_count": 4,
          "maker": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "本田",
                "doc_count": 3
              },
              {
                "key": "宝马",
                "doc_count": 1
              }
            ]
          },
          "avg_price": {
            "value": 32500
          }
        },
        {
          "key": "绿",
          "doc_count": 2,
          "maker": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "丰田",
                "doc_count": 1
              },
              {
                "key": "福特",
                "doc_count": 1
              }
            ]
          },
          "avg_price": {
            "value": 21000
          }
        },
        {
          "key": "蓝",
          "doc_count": 2,
          "maker": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "丰田",
                "doc_count": 1
              },
              {
                "key": "福特",
                "doc_count": 1
              }
            ]
          },
          "avg_price": {
            "value": 20000
          }
        }
      ]
    }
  }
}

划分桶的其他方式

Date Histogram Aggregation：根据日期阶梯分组，例如给定阶梯为周，会自动每周分为一组
Histogram Aggregation：根据数值阶梯分组，与日期类似
Terms Aggregation：根据词条内容分组，词条内容完全匹配的为一组
Range Aggregation：数值和日期的范围分组，指定开始和结束，然后按段分组

阶段分桶Histogram

原理
histogram 是把数值类型的字段，按照一定的阶梯大小进行分组，需要指定一个阶梯值来划分阶梯大小
如果一件商品的价格是450，会落入哪个阶梯区间呢？计算公式如下：

bucket_key = Math.floor((value - offset) / interval) * interval + offset

value：就是当前数据的值
offset：起始偏移量，默认为0
interval：阶梯间隔

GET /car/_search
{
  "size": 0,
  "aggs": {
    "price": {
      "histogram": {
        "field": "price",
        "interval": 5000
      }
    }
  }
}

结果

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 8,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "price": {
      "buckets": [
        {
          "key": 10000,
          "doc_count": 2
        },
        {
          "key": 15000,
          "doc_count": 1
        },
        {
          "key": 20000,
          "doc_count": 2
        },
        {
          "key": 25000,
          "doc_count": 1
        },
        {
          "key": 30000,
          "doc_count": 1
        },
        {
          "key": 35000,
          "doc_count": 0
        },
        {
          "key": 40000,
          "doc_count": 0
        },
        {
          "key": 45000,
          "doc_count": 0
        },
        {
          "key": 50000,
          "doc_count": 0
        },
        {
          "key": 55000,
          "doc_count": 0
        },
        {
          "key": 60000,
          "doc_count": 0
        },
        {
          "key": 65000,
          "doc_count": 0
        },
        {
          "key": 70000,
          "doc_count": 0
        },
        {
          "key": 75000,
          "doc_count": 0
        },
        {
          "key": 80000,
          "doc_count": 1
        }
      ]
    }
  }
}

范围分桶range

范围分桶与阶梯分桶类似，也是把数字按照阶段进行分组，range方式需要自己指定每一组的起始和结束大小

ElasticSearch 全文检索技术（一）

ElasticSearch 全文检索技术

简介

ElasticSearch

版本

安装和配置(windows)

修改配置文件

运行elasticSearch.bat

安装kibana

什么是kibana

安装

控制台

安装ik分词器

测试

API

Rest风格API

索引库操作

基本概念

创建索引库

语法

查看索引库

删除索引库

类型及映射操作

创建字段映射

查看映射关系

映射属性详解

一次创建索引库和类型

文档操作

新增文档

新增并随机生成id

新增文档并自定义id

查看文档

修改数据

删除数据

ElasticSearch的智能判断

动态映射模版

查询

基本查询

查询所有(match_all)

匹配查询(match)

词条匹配

布尔组合(bool)

范围查询(range)

模糊查询(fuzzy)

结果过滤

指定includes和excludes

过滤(filter)

排序

单字段排序

多字段排序

分页

高亮

聚合aggregations

基本概念

聚合为桶

桶内度量

桶内嵌套桶

划分桶的其他方式

阶段分桶Histogram

范围分桶range

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读