elasticsearch

作者: Aiibai | 来源:发表于2019-05-13 12:02 被阅读0次

【笔记】ElasticSearch入门
百亿级数据搜索引擎，Elasticsearch与SpringBo
ElasticSearch
ElasticSearch搜索引擎安装入门
elasticsearch添加用户名密码
Elasticsearch
Elasticsearch学习笔记目录
1、elasticsearch安装
2、elasticsearch概念
3 、elasticsearch的crud、批量操作

https://www.ibm.com/developerworks/cn/java/j-lo-lucene1/index.html

Elasticsearch 是一个高可扩展性、开源、全文检索和分析引擎。几乎能达到实时存储、快速查询和分析大数据。

基本概念

Near Realtime(NRT)
这里的实时指的是新数据可被查询的延时比较短，一般是1s。这里并不像传统数据库那样，只要事务结束，数据就是可查询的。
Cluster
集群默认的名称是：elasticsearch，也可以通过配置 cluster.name 指定，节点就是通过该名称查找该加入的集群。
Node
存储数据，参与集群的 indexing 和 search。节点可以指定名称，默认是一个随机UUID。集群中有四种类型的几点：主节点、数据节点、客户端节点（负载均衡节点）和部落节点。后面详细介绍每种类型的节点。
Index
Index 是比较相似的 Document 组成的集合。一个集群可以定义很多 Index
Type
Type 被当做 Index 中 Document 的一个逻辑分组，一个Index 里可以有很多不同的 Type 。在6.0.0废除了该特性。
Document
Document 是可以被查询的基本单元，JSON格式。虽然 Document 在物理上是存储到 Index 里的，但是实际上每个 Document 必须被分配到 Index 的 Type 中。
7.Shards & Replicas
每个 Index 在创建的时候可以指定多个 Shards，每个 Shard 存储 Index 的一部分数据，可以将其作为单独的存储单元存储的不同的 Node上。使用Shard 主要是为了满足水平可扩展以及提高性能和吞吐率，它对用户来说是透明的。Replica 是 Shard 副本，可以在运行过程中动态修改，也可以在创建时指定。但是它不能和它的 Shard 存储到一台机器。 Replica 主要是为了提高可用性和吞吐率。
如果不指定，默认 ElasticSearch 会给 Index 创建5个 Shard ，为每个Shard 创建一个 Replica（至少是两台 Node，如果是一台Node不会创建，并且 Index 状态会变为 Yellow）。

安装

系统要求
JDK 1.8及以上
安装

$curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.0.1.tar.gz
Then
$tar -xvf elasticsearch-6.0.1.tar.gz
$cd elasticsearch-6.0.1/bin
$./elasticsearch

验证

curl -X GET "localhost:9200/_cat/health?v"

初探集群

ElasticSearch 提供了丰富强大的 RestAPI ，通过这些接口可以进行如下操作：

检查集群、节点、以及索引的健康、状态以及静态数据。
管理集群、节点、索引数据以及元数据
执行 CRUD
执行高级查询，如：分页、排序、过滤、脚本执行、聚合等等。

任何可以发送 Http请求的工具都可以使用，这里我们使用 curl 或者使用 kibana 中的 console。这里我们在单台服务器搭建了一个集群，共有四个节点，其中一个节点是负载均衡节点，用于和 kibana 连接。

Cluster Health

curl -X GET "localhost:9200/_cat/health?v"

epoch      timestamp cluster   status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1557718511 11:35:11  cluster_1 green           4         3     16   8    0    0        0             0                  -                100.0%

curl -X GET "localhost:9200/_cat/health?v"

ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.0.67           42          98   1    0.10    0.17     0.27 mdi       *      node-3
192.168.0.67           39          98   1    0.10    0.17     0.27 mdi       -      node-2
192.168.0.67           34          98   1    0.10    0.17     0.27 mdi       -      node-1
192.168.0.67           26          98   1    0.10    0.17     0.27 i         -      node-load-balancing

状态说明：
Green：一切OK
Yellow：所有数据可用，但是有些副本没有分配。
Red：有些数据不可用。

List All Indices

curl -X GET "localhost:9200/_cat/indices?v"

health status index                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .kibana                         Izf_5dzjS0G2bt2dhj-qSg   1   1          1            0        8kb            4kb
green  open   .monitoring-kibana-6-2019.05.13 _67gjv8uQW6OTE_13GSIgQ   1   1        204            0    523.1kb        284.4kb
green  open   .monitoring-es-6-2019.05.13     P0ihAKABRBiz1VHO8w-pFg   1   1       2852            0        3mb          1.5mb
green  open   engine                          DEdxPI9MSGWTdZz_aYBMgQ   5   1         21            0    158.4kb         79.2kb

前面在我们看到一共有16个 shards，这里我们可以看到都有哪些 shards ，前面的 rep 指的是 shards 的倍数。

Create an Index

curl -X PUT "localhost:9200/customer?pretty"
curl -X GET "localhost:9200/_cat/indices?v"

health status index                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   customer                        IdDeGz5fQg6K_U7D_ZaWGQ   5   1          0            0      1.3kb           690b

这时候我们还没有插入数据，所以docs.count 是0

Index and Query a Docuemnt
让我们添加一些数据到Customer Index 中，如果指定的 Index 不存在会自动创建

curl -X PUT "localhost:9200/customer/doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "name": "John Doe"
}'
//response
{
  "_index": "customer",
  "_type": "doc",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

url中 ../doc/1 ，doc 指的是 type ， 1 指的该条记录的 id ，可以不指定id，会自动生成随机ID。

查看一下刚刚插入的数据

curl -X GET "localhost:9200/customer/doc/1?pretty"
//response
{
  "_index": "customer",
  "_type": "doc",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "name": "John Doe"
  }
}

source 属性存储的是数据

Delete an Index

curl -X DELETE "localhost:9200/customer?pretty"

修改数据

Indexing/Replacing Documents

curl -X PUT "localhost:9200/customer/doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "name": "Jane Doe"
}
'
//response
{
  "_index": "customer",
  "_type": "doc",
  "_id": "1",
  "_version": 2,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 1,
  "_primary_term": 1
}

前面我们已经执行了上面的命令，如果ID存在，这条命令执行的效果就是替换，如果不存在，执行的效果就是插入。

Updating Documents
除了替换 document ，我们还可以更新 document，这里的更新并不是正真的更新，而是将旧的 document 删除，插入执行更新操作后的新的 document。

POST /customer/doc/1/_update?pretty
{
  "doc": {  "age": 20 }
}
//response
{
  "_index": "customer",
  "_type": "doc",
  "_id": "1",
  "_version": 3,
  "_score": 1,
  "_source": {
    "name": "Jane Doe",
    "age": 20
  }
}

如果这里使用的替换的话，结果应该是：（没有 name 字段）

{
  "_index": "customer",
  "_type": "doc",
  "_id": "1",
  "_version": 3,
  "_score": 1,
  "_source": {
    "age": 20
  }
}

也可以执行节点的脚本

POST /customer/doc/1/_update?pretty
{
  "script" : "ctx._source.age += 5"
}

ctx._source引用的就是id=1的 document。

Delete Document

DELETE /customer/doc/2?pretty

也可以传递复杂的删除条件，将会在 Delete By Query API 章节介绍。如果要删除一个 Index 中的所有 document ，相对来说删除 Index 效率更高。

Batch Processing
Elasticsearch 提供了 _bluk API 来实现批量操作的能力，可以减少网络开销。
下面是一个 bulk 操作包含两次 Index （替换）操作。

curl -X POST "localhost:9200/customer/doc/_bulk?pretty" -H 'Content-Type: application/json' -d'
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'

下面是更新ID=1的 document ，删除ID=2的 document

curl -X POST "localhost:9200/customer/doc/_bulk?pretty" -H 'Content-Type: application/json' -d'
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}
'

bulk 操作并不会因为其中一个操作失败而终止，不管如何都会执行完所有的操作，在返回值中标识每次操作的状态。

探索数据

Sample Dataset
下面是一个比较接近实际应用场景的数据结构以及数据。

{
    "account_number": 0,
    "balance": 16623,
    "firstname": "Bradshaw",
    "lastname": "Mckenzie",
    "age": 29,
    "gender": "F",
    "address": "244 Columbus Place",
    "employer": "Euron",
    "email": "bradshawmckenzie@euron.com",
    "city": "Hobucken",
    "state": "CO"
}

样本数据可以在这里获取：https://github.com/elastic/elasticsearch/blob/master/docs/src/test/resources/accounts.json?raw=true，也可以在这儿自定义生成：www.json-generator.com/

Loading the Sample Dataset

curl -H "Content-Type: application/json" -XPOST 'localhost:9200/bank/account/_bulk?pretty&refresh' --data-binary "@accounts.json"
curl 'localhost:9200/_cat/indices?v'
//response
health status index                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   bank                            u2jDexzNRPe_cH_UtVUozQ   5   1       1000            0    949.4kb        474.7kb

The Search API
有两种方式执行查询，第一种是将查询参数放在URL中，另外一种是放在 request body 中。
查询 bank 中所有的文档，并用 account_number 升序排序。

curl  -X GET "localhost:9200/bank/_search?q=*&sort=account_number:asc&pretty"
//response
{
  "took": 26,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1000,
    "max_score": null,
    "hits": [
      {
        "_index": "bank",
        "_type": "account",
        "_id": "0",
        "_score": null,
        "_source": {
          "account_number": 0,
          "balance": 16623,
          "firstname": "Bradshaw",
          "lastname": "Mckenzie",
          "age": 29,
          "gender": "F",
          "address": "244 Columbus Place",
          "employer": "Euron",
          "email": "bradshawmckenzie@euron.com",
          "city": "Hobucken",
          "state": "CO"
        },
        "sort": [
          0
        ]
      }
  ]
}

解释一下返回值中的字段：

took：执行查询耗时
timed_out：是否超时
_shards：有多少个 shard 被搜索
hits：查询结果
hits.total：匹配的文档个数
hits.hits：结果数组（默认会返回前十条）
hists.sort：排序使用的值，如果没有指定配置，则为空。
-hits._score 和 max_score：这里暂时忽略，后面有介绍。

下面是使用 request body 实现相同的查询

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ]
}
'

rest api 是无状态的，所以和执行 sql 不一样（可以返回部分结果，下次再操作剩余结果），如果请求返回以后，服务端没有任何与这次查询相关的状态记录。

Introducing the Query Language
Elasticsearch 提供了用于执行查询的 JSON 风格的领域特定语言，我们将会在 Query DSL 章节详细介绍。下面我们使用一些简单的语法。

按 balance 降序排序，返回前十条文档。

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "from": 0,
  "size": 10,
   "sort": { "balance": { "order": "desc" } }
}
'

Executing Searches
默认会返回文档的所有字段，可以通过 _source 指定需要返回的字段，类似于 sql 中的 select 操作。

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "_source": ["account_number", "balance"]
}
'

match query
获取 account_number 是20 的值

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "account_number": 20 } }
}
'

获取 address 中包含 mill 分词的 document

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "address": "mill" } }
}
'

获取 address 中包含 mill 或 lane 分词的 document

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "address": "mill lane" } }
}
'

获取 address 中包含 mill lane 短语的 document

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": { "match_phrase": { "address": "mill lane" } }
}
'

bool query
bool 查询允许我们将多个小查询使用 bool 逻辑组成一个大查询。
获取 address 中既包含分词 mill ，也包含分词 lane 的 document

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}
'

获取 address 中包含分词 mill 或 lane 的 document

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "should": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}
'

获取 address 中既不包含分词 mill ，也不包含 lane 的 document

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must_not": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}
'

还可以将 must，should以及 must not组合使用
获取年龄是40，并且 state 不是 ID 的 document

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must_not": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}
'

Executing Filters
前面的章节中我们碰到 _score 字段，score是一个数值，表示 document 与查询条件的匹配度。但是不是所有的查询都需要这个分数，这时候我们可以使用 filtering 过滤，Elasticserach 会自动检测这些情形，并优化查询执行，避免产生无用的 score。使用 filter ， score 的计算不会改变，并不是说不会计算。查询与过滤的区别可以查看这篇文章：https://www.elastic.co/guide/cn/elasticsearch/guide/current/_queries_and_filters.html

下面我们使用一个 range query 进行过滤， range query 只能适用于数值和日期字段。
查询余额大于等于 20000，并且小于等于30000的所有 document

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}
'

Executing Aggregations
这里的 Aggregations 类似于 sql 中的 group 操作。
Elasticsearch 中一个请求中既可以返回查询结果，同时可以返回聚合结果，较少网络开销。

按 state 分组，并且按 state count 排序。这里的 size 为0，也就是不想返回查询结果，只想看聚合结果。

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      }
    }
  }
}
'

state.keyword 后面解释，group_by_state 名称没有约束。
上面的查询类似于：

SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC

结果

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1000,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_state": {
      "doc_count_error_upper_bound": 20,
      "sum_other_doc_count": 770,
      "buckets": [
        {
          "key": "ID",
          "doc_count": 27
        },
        {
          "key": "TX",
          "doc_count": 27
        },
        {
          "key": "AL",
          "doc_count": 25
        },
        {
          "key": "MD",
          "doc_count": 25
        },
        {
          "key": "TN",
          "doc_count": 23
        },
        {
          "key": "MA",
          "doc_count": 21
        },
        {
          "key": "NC",
          "doc_count": 21
        },
        {
          "key": "ND",
          "doc_count": 21
        },
        {
          "key": "ME",
          "doc_count": 20
        },
        {
          "key": "MO",
          "doc_count": 20
        }
      ]
    }
  }
}

按 state 分组，计算 balance 的平均值

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}
'

上面的执行类似于：

SELECT state, COUNT(*) ,avg(balance) FROM bank GROUP BY state ORDER BY COUNT(*) DESC

结果

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1000,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_state": {
      "doc_count_error_upper_bound": 20,
      "sum_other_doc_count": 770,
      "buckets": [
        {
          "key": "ID",
          "doc_count": 27,
          "average_balance": {
            "value": 24368.777777777777
          }
        },
        {
          "key": "TX",
          "doc_count": 27,
          "average_balance": {
            "value": 27462.925925925927
          }
        },
        {
          "key": "AL",
          "doc_count": 25,
          "average_balance": {
            "value": 25739.56
          }
        },
        {
          "key": "MD",
          "doc_count": 25,
          "average_balance": {
            "value": 24963.52
          }
        },
        {
          "key": "TN",
          "doc_count": 23,
          "average_balance": {
            "value": 29796.782608695652
          }
        },
        {
          "key": "MA",
          "doc_count": 21,
          "average_balance": {
            "value": 29726.47619047619
          }
        },
        {
          "key": "NC",
          "doc_count": 21,
          "average_balance": {
            "value": 26785.428571428572
          }
        },
        {
          "key": "ND",
          "doc_count": 21,
          "average_balance": {
            "value": 26303.333333333332
          }
        },
        {
          "key": "ME",
          "doc_count": 20,
          "average_balance": {
            "value": 19575.05
          }
        },
        {
          "key": "MO",
          "doc_count": 20,
          "average_balance": {
            "value": 24151.8
          }
        }
      ]
    }
  }
}

下面的查询是将年龄按段（20-29,30-39,40-49）分组，然后再按性别分组，然后计算平均余额。

curl -X GET "localhost:9200/bank/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "group_by_age": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 20,
            "to": 30
          },
          {
            "from": 30,
            "to": 40
          },
          {
            "from": 40,
            "to": 50
          }
        ]
      },
      "aggs": {
        "group_by_gender": {
          "terms": {
            "field": "gender.keyword"
          },
          "aggs": {
            "average_balance": {
              "avg": {
                "field": "balance"
              }
            }
          }
        }
      }
    }
  }
}
'

结果是这样的

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1000,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_age": {
      "buckets": [
        {
          "key": "20.0-30.0",
          "from": 20,
          "to": 30,
          "doc_count": 451,
          "group_by_gender": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "M",
                "doc_count": 232,
                "average_balance": {
                  "value": 27374.05172413793
                }
              },
              {
                "key": "F",
                "doc_count": 219,
                "average_balance": {
                  "value": 25341.260273972603
                }
              }
            ]
          }
        },
        {
          "key": "30.0-40.0",
          "from": 30,
          "to": 40,
          "doc_count": 504,
          "group_by_gender": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "F",
                "doc_count": 253,
                "average_balance": {
                  "value": 25670.869565217392
                }
              },
              {
                "key": "M",
                "doc_count": 251,
                "average_balance": {
                  "value": 24288.239043824702
                }
              }
            ]
          }
        },
        {
          "key": "40.0-50.0",
          "from": 40,
          "to": 50,
          "doc_count": 45,
          "group_by_gender": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "M",
                "doc_count": 24,
                "average_balance": {
                  "value": 26474.958333333332
                }
              },
              {
                "key": "F",
                "doc_count": 21,
                "average_balance": {
                  "value": 27992.571428571428
                }
              }
            ]
          }
        }
      ]
    }
  }
}

这个和sql中 group by后面有多个字段还是有区别的。更多的聚合操作将会在 aggregations reference guide 章节介绍。

至此，我们算是入门了 Elasticsearch 的一些简单操作，明白了它是干什么的，以及如何干。后面我们会深入了解 Elasticserach，了解它的实现原理以及如何能更好的使用它。

【笔记】ElasticSearch入门
ElasticSearch 是什么？ ElasticSearch 优点 Lucene ElasticSearch中...
百亿级数据搜索引擎，Elasticsearch与SpringBo
Elasticsearch快照 Elasticsearch恢复 Spring与Elasticsearch整合快照...
ElasticSearch
ElasticSearch插件 ELASTICSEARCH-HEAD（需要在ELASTICSEARCH CONFI...
ElasticSearch搜索引擎安装入门
ElasticSearch 下载: ElasticSearch 单实例的安装: ElasticSearch-hea...
elasticsearch添加用户名密码
elasticsearch版本5.6.4 运行Elasticsearch Elasticsearch已经准备就绪，...
Elasticsearch
1 Elasticsearch 简介 1.1 什么是 Elasticsearch？ Elasticsearch是一...
Elasticsearch学习笔记目录
Elasticsearch学习笔记（1） Elasticsearch学习笔记（2） Elasticsearch学习...
1、elasticsearch安装
1、elasticsearch安装2、elasticsearch概念3 、elasticsearch的crud、批...
2、elasticsearch概念
1、elasticsearch安装2、elasticsearch概念3 、elasticsearch的crud、批...
3 、elasticsearch的crud、批量操作
1、elasticsearch安装2、elasticsearch概念3 、elasticsearch的crud、批...

elasticsearch

基本概念

安装

初探集群

修改数据

探索数据

相关文章

【笔记】ElasticSearch入门

百亿级数据搜索引擎，Elasticsearch与SpringBo

ElasticSearch

ElasticSearch搜索引擎安装入门

elasticsearch添加用户名密码

Elasticsearch

Elasticsearch学习笔记目录

1、elasticsearch安装

2、elasticsearch概念

3 、elasticsearch的crud、批量操作

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读