ES 节点离线时间过长

作者: MasonChan | 来源:发表于2020-10-31 02:36 被阅读0次

ES 节点离线时间过长
ES6 Study Notes
ES-Spark连接ES后，ES Client节点流量打满分析
elasticsearch 原理
Elasticsearch 数据写入过程
es源码笔记-如何选择协调节点
OpenShift自带的日志搜索引擎ES服务的扩容
ES节点的自动重启
在线不停服迁移自建ES集群至腾讯云ES
elasticsearch 集群环境搭建

ES 大部分节点离线过长，会出现一个头痛的情况就是：节点回来了，数据没回来。

这时我们会用到 /_cluster/reroute 这个 api。

reroute

REF: https://www.elastic.co/guide/en/elasticsearch/reference/7.9/cluster-reroute.html

查看 allocation 失败的原因

GET /_cluster/allocation?explain
{

}

curl -XGET "http://localhost:9200/_cluster/allocation/explain?pretty"

网络、GC 等故障导致单个主副分片同时离线的时间过长，主副分片 allocation 的尝试次数超过 5 次时，无法自动恢复，需要强行 reroute。

主要分为 4 种场景：

单纯的节点下线导致的主副分片离线过长，节点上线后直接手动强行 reroute 一次即可，无需对账

POST /_cluster/reroute?retry_failed=true
{

}

curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed=true&pretty'

主分片丢失，副本分片离线过长，但数据完整，任意将一个副本恢复上线，无需对账

POST /_cluster/reroute?retry_failed=true
{
    "commands" : [
        {
          "allocate_replica" : {
                "index" : "test",
                "shard" : 39,
                "node" : "host1"
          }
        }
    ]
}

主分片丢失，尚存一个数据【可能】不完整的 replica 时，为了尽快速恢复数据（数据可能部分丢失），将此副本提升为 primary，迟些再对账

POST /_cluster/reroute?retry_failed=true
{
    "commands" : [
        {
          "allocate_stale_primary" : {
                "index" : "test",
                "shard" : 8,
                "node" : "WI1231AOQNmHdgo6TH-dZQ",
                "accept_data_loss" : true
          }
        }
    ]
}

主副分片都不存在，需要创建一个空的分片，先让数据继续正常写入，迟些再对账

POST /_cluster/reroute?retry_failed=true
{
    "commands" : [
        {
          "allocate_empty_primary" : {
                "index" : "test",
                "shard" : 8,
                "node" : "WI1231AOQNmHdgo6TH-dZQ",
                "accept_data_loss" : true
          }
        }
    ]
}