ES存在unassinged shard的调试方式

作者: YG_9013 | 来源:发表于2017-09-05 22:47 被阅读0次

ES存在unassinged shard的调试方式
ES的分布式架构原理?
节点回来shard仍然delayed原因
ES shard Unassigned恢复提示corruptio
98_es生产集群部署之针对集群重启时的shard恢复耗时过长问
Elasticsearch 基于磁盘的shard分配机制浅析
ES-primary shard 和 replica shard
ES数据副本模型
ElasticSearch数据副本模型
es shard allocation config

这篇博客是 http://www.jianshu.com/p/443cf6ce87d5 的一个补充。

查看ES的状态

 curl -XGET 'http://unknow.com/_cat/health?v&pretty'

发现集群的状态为red，且存在unassinged shard。

查看哪些shard是unassigned

curl -XGET 'http://unknow.com/_cat/shards?v&pretty' | grep UNASSIGNED

为什么会存在unassinged的shard呢？

ES查看unassigned 原因的命令：

curl noahes.isec.oa.com/_cluster/allocation/explain?pretty -d '{"index":"index-name","shard":0,"primary":true}'

查看原因可知是因为[设备上没有足够的空间]。

查看各结点的存储使用情况

curl -XGET 'http://unknow.com/_cat/allocation?v&pretty'

这个可以查看每个结点的磁盘使用情况，奇怪的是并没有结点的存储满了，最高的也使用不到70%。为什么呢？
后面查看节点各个盘的使用情况，发现有一个盘的使用量超出了ES的默认配置，达到了87%，在merge的过程中，磁盘使用量超出了限制，所以会引起shard unassigned。

为什么存储会超过磁盘限制呢？ES不是有自动rebalance的策略吗？

原因是设置的shard数太少了，查看shard状态可知，一个shard的大小有100多G，磁盘大小只有250G，如果该磁盘上有两个这样的shard，在Segment合并是肯定会出现问题。在数据量较小的时候，ES给该磁盘分配了两个shard，随着数据的增加，shard越来越大，导致问题出现。

解决方案

参考ES提供的reroute api手动移动分片即可。

curl -XPOST 'http://unknow.com/_cluster/reroute?retry_failed=5&pretty' -d '
{
  "commands" : [ {
    "allocate_stale_primary" : {
        "index" : "myindex",
        "shard" :0,
        "node" : "node-ip",
        "accept_data_loss" : true
    }
  }]
}'

自动移动分片的命令如下：

  curl -XPOST 'http://unknow.com/_cluster/reroute?retry_failed

如何避免问题再现？

先把大的分片移到剩余空间大的结点，增加shard数。

为什么上午执行retry_failed命令，unassigned shard没有被分配，下午执行同样的命令就被分配了？

关于ES集群的Cluster Level Shard Allocation和Disk-Based Shard Allocation，大家可以自己看一下。在Disk-Based Shard Allocation中有提到一个属性cluster.routing.allocation.disk.include_relocations，这个属性为true时，ES会自动检测磁盘的占用量。如果磁盘占用量超出cluster.routing.allocation.disk.watermark.high 配置的阈值，ES会自动relocate shard，上午ES集群还没有或者说正在移动超过磁盘限制的shard，retry_failed失败。下午超出磁盘限制的shard已经被重新relocate，自然可以retry_failed。

cluster.routing.allocation.disk.watermark.high
Controls the high watermark. It defaults to 90%, meaning ES will attempt to relocate shards to another node if the node disk usage rises above 90%. It can also be set to an absolute byte value (similar to the low watermark) to relocate shards once less than the configured amount of space is available on the node.

网友评论

本文标题：ES存在unassinged shard的调试方式

本文链接：https://www.haomeiwen.com/subject/jwbpjxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！