Draino 入门

作者: 王勇1024 | 来源:发表于2022-07-15 18:36 被阅读0次

Draino 入门
Draino源码分析
入门级知识
springboot+Spring Cloud-hystrix整
dubbo服务降级(五)
dubbo集群实现负载均衡配置(四)
Dubbo使用@Transactional，服务发布不成功(二)
zookeeper注册中心宕机，消费者能否调用提供者(三)
分布式RPC框架Apache Dubbo(一)
MacDown编辑器初用笔记

1. 简介

Draino 基于标签和 node conditions 自动排干 Kubernetes 节点。匹配了所有指定标签和任意指定 node condition 的节点会立即被禁用（cordon），并在等待 drain-buffer 时间后排干（drain）节点上的 pod。

Draino 通常是与 Node Problem Detector 及 Cluster Autoscaler 一起使用。NPD 通过监控节点日志或者执行某一脚本来探测节点健康状态，当 NPD 探测到某个节点上存在异常时，就会给该节点设置一个 node condition。Cluster Autoscaler 可以配置为删除未充分利用的节点。这两者搭配上 Draino 可以实现一些场景下的自动故障补救：

NPD 探测到节点存在一个永久问题，并且给该节点设置相应的 node condition。
Draino 发现了这个 node condition，它会马上禁用该节点，从而避免有新的 pod 调度到这个故障节点，并开启定时任务来排干这个节点。
一旦该故障节点被排干，Cluster Autoscaler 会认为该节点未充分利用，Autoscaler 等待一段时间后将该节点缩容掉。

2. 使用

启动命令

$ docker run planetlabs/draino /draino --help
usage: draino [<flags>] <node-conditions>...

Automatically cordons and drains nodes that match the supplied conditions.

Flags:
      --help                     Show context-sensitive help (also try --help-long and --help-man).
  -d, --debug                    Run with debug logging.
      --listen=":10002"          Address at which to expose /metrics and /healthz.
      --kubeconfig=KUBECONFIG    Path to kubeconfig file. Leave unset to use in-cluster config.
      --master=MASTER            Address of Kubernetes API server. Leave unset to use in-cluster config.
      --dry-run                  只发出事件，不禁用和排干匹配到的节点
      --max-grace-period=8m0s    驱逐Pod时，允许Pod优雅终止的最长等待时间
      --eviction-headroom=30s    Additional time to wait after a pod\'s termination grace period for it to have been deleted.
      --drain-buffer=10m0s       执行两次排干操作的最小间隔时间，节点通常会被立刻禁用
      --node-label="foo=bar"     (已过期) 只有配置了该标签的节点将会被执行禁用和排干操作。可能会被设置多次
      --node-label-expr="metadata.labels.foo == 'bar'"
                                 This is an expr string https://github.com/antonmedv/expr that must return true or false. See `nodefilters_test.go` for examples
      --namespace="kube-system"  将会在该命名空间下创建 leader 选举锁对象
      --leader-election-lease-duration=15s
                                 Lease duration for leader election.
      --leader-election-renew-deadline=10s
                                 Leader election renew deadline.
      --leader-election-retry-period=2s
                                 Leader election retry period.
      --skip-drain               禁用节点后是否执行排干操作
      --evict-daemonset-pods     驱逐被现存的 DaemonSet 创建的 pod
      --evict-emptydir-pods      驱逐使用了 emptyDir 本地卷的 pod
      --evict-unreplicated-pods  驱逐不是被 replication controller 创建的 pod
      --protected-pod-annotation=KEY[=VALUE] ...
                                 配置了这些注解的 pod 将会免于被驱逐。可能会被设置多次

Args:
  <node-conditions>  Nodes for which any of these conditions are true will be cordoned and drained.

标签和标签表达式

Draino 允许通过 --node-label 和 --node-label-expr 参数来过滤符合条件的节点列表。--node-label只能对指定的多个标签进行 AND 判断。为了表达更复杂的匹配规则，新的 --node-label-expr 参数能够支持 OR/AND/NOT 的逻辑的混合使用。详见：https://github.com/antonmedv/expr。

--node-label-expr 示例：

(metadata.labels.region == 'us-west-1' && metadata.labels.app == 'nginx') || (metadata.labels.region == 'us-west-2' && metadata.labels.app == 'nginx')

3. 注意事项

部署 Draino 之前需要记住以下几点：

先以 --dry-run模式运行 Draino 来验证它是否能正确排干节点。dry-run 模式下，Draino 只会上报日志、指标和事件，而不会真正禁用或排干节点。
Draino 会立刻禁用满足它所配置的标签和 node conditions 的节点，但会在每排干一个节点后等待一段时间（通过 --drain-buffer 参数配置，默认是10min）再排干下一个节点。即，如果两个节点同时触发了一个 node condition，一个节点会立即被排干，另一个节点会等待10分钟后再排干。
如果有任意一个被触发驱逐的 pod 驱逐失败，Draino 会认为此次排干失败。如果被触发驱逐的 5个pod 中2个驱逐失败，Draino 会认为此次排干失败，但它会继续驱逐另外3个pod。
不能被 cluster-autoscaler 驱逐的 pod 也不会被 Draino 驱逐。

4. 部署

Draino 会自动从master分支构建，并被推送到 Docker Hub。镜像 tag 为 planetlabs/draino:$(git rev-parse --short HEAD)。

可以通过 example Kubernetes deployment manifest 部署 Draino。

5. 监控

Metrics

Draino 提供了一个简单的健康检查站点 /healthz和 Prometheus 指标站点 /metrics。会上报以下指标：

$ kubectl -n kube-system exec -it ${DRAINO_POD} -- apk add curl
$ kubectl -n kube-system exec -it ${DRAINO_POD} -- curl http://localhost:10002/metrics
# HELP draino_cordoned_nodes_total Number of nodes cordoned.
# TYPE draino_cordoned_nodes_total counter
draino_cordoned_nodes_total{result="succeeded"} 2
draino_cordoned_nodes_total{result="failed"} 1
# HELP draino_drained_nodes_total Number of nodes drained.
# TYPE draino_drained_nodes_total counter
draino_drained_nodes_total{result="succeeded"} 1
draino_drained_nodes_total{result="failed"} 1

Events

Draino 会在驱逐过程的每一个关键步骤生成一个事件。下面是一个以 DrainFailed 结尾的示例。当所有步骤都运行正常时，最后会生成一个 DrainSucceeded事件。

> kubectl get events -n default | grep -E '(^LAST|draino)'

LAST SEEN   FIRST SEEN   COUNT   NAME                                               KIND TYPE      REASON             SOURCE MESSAGE
5m          5m           1       node-demo.15fe0c35f0b4bd10    Node Warning   CordonStarting     draino Cordoning node
5m          5m           1       node-demo.15fe0c35fe3386d8    Node Warning   CordonSucceeded    draino Cordoned node
5m          5m           1       node-demo.15fe0c360bd516f8    Node Warning   DrainScheduled     draino Will drain node after 2020-03-20T16:19:14.91905+01:00
5m          5m           1       node-demo.15fe0c3852986fe8    Node Warning   DrainStarting      draino Draining node
4m          4m           1       node-demo.15fe0c48d010ecb0    Node Warning   DrainFailed        draino Draining failed: timed out waiting for evictions to complete: timed out

Conditions

当一次排干动作开始时，Draino 会给目标节点的 status 中添加一个 DrainScheduled类型的 condition，这个 condition 会记录此次排干动作的开始和结束信息。

> kubectl describe node {node-name}
......
Unschedulable:      true
Conditions:
  Type                  Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                  ------  -----------------                 ------------------                ------                       -------
  OutOfDisk             False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure        False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure          False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure           False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                 True    Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:02:09 +0100   KubeletReady                 kubelet is posting ready status. AppArmor enabled
  ec2-host-retirement   True    Fri, 20 Mar 2020 15:23:26 +0100   Fri, 20 Mar 2020 15:23:26 +0100   NodeProblemDetector          Condition added with tooling
  DrainScheduled        True    Fri, 20 Mar 2020 15:50:50 +0100   Fri, 20 Mar 2020 15:23:26 +0100   Draino                       Drain activity scheduled 2020-03-20T15:50:34+01:00

之后，当排干动作执行完成后，Draino 会将执行结果补充到 condition 中，以便你能知道执行是成功还是失败：

> kubectl describe node {node-name}
......
Unschedulable:      true
Conditions:
  Type                  Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                  ------  -----------------                 ------------------                ------                       -------
  OutOfDisk             False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure        False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure          False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure           False   Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:01:59 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                 True    Fri, 20 Mar 2020 15:52:41 +0100   Fri, 20 Mar 2020 14:02:09 +0100   KubeletReady                 kubelet is posting ready status. AppArmor enabled
  ec2-host-retirement   True    Fri, 20 Mar 2020 15:23:26 +0100   Fri, 20 Mar 2020 15:23:26 +0100   NodeProblemDetector          Condition added with tooling
  DrainScheduled        True    Fri, 20 Mar 2020 15:50:50 +0100   Fri, 20 Mar 2020 15:23:26 +0100   Draino                       Drain activity scheduled 2020-03-20T15:50:34+01:00 | Completed: 2020-03-20T15:50:50+01:00

6. 排干重试

有时候排干动作会因为 Pod Disruption Budget 限制或其他的 Draino 以外的原因失败。这时，目标节点还处于禁用（cordon）状态，且驱逐 condition 会被标记为 Failed。如果你想再次尝试在该节点执行排干动作，可以给该节点添加 draino/drain-retry: true 注解，Draino 就会再次尝试在该节点执行排干操作。

注意：如果排干重试失败，目标节点上的 draino/drain-retry: true 注解不会被修改或移除，而是会再次等待重试。

kubectl annotate node {node-name} draino/drain-retry=true

7. 运行模式

Dry Run：这种模式下，Draino 匹配到故障节点后，只会上报事件，不会禁用和排干匹配到的节点。可以通过指定 --dry-on 参数启动该模式。

Cordon Only：这种模式下，Draino 匹配到故障节点后，之后禁用节点，而不会排干节点上的 Pod。可以通过 --skip-drain 参数启动该模式。

Draino 入门
1. 简介 Draino 基于标签和 node conditions 自动排干 Kubernetes 节点。匹配了...
Draino源码分析
启动参数参考：draino 入门[https://www.jianshu.com/p/cc6a45cf3208]...
入门级知识
入门知识_1 入门知识_2 入门知识_3 入门知识4
springboot+Spring Cloud-hystrix整
Dubbo入门案例(一)Dubbo入门案例(二)Dubbo入门案例(三)Dubbo入门案例(四)Dubbo入门案例...
dubbo服务降级(五)
Dubbo入门案例(一)Dubbo入门案例(二)Dubbo入门案例(三)Dubbo入门案例(四)Dubbo入门案例...
dubbo集群实现负载均衡配置(四)
Dubbo入门案例(一)Dubbo入门案例(二)Dubbo入门案例(三)Dubbo入门案例(四)Dubbo入门案例...
Dubbo使用@Transactional，服务发布不成功(二)
Dubbo入门案例(一)Dubbo入门案例(二)Dubbo入门案例(三)Dubbo入门案例(四)Dubbo入门案例...
zookeeper注册中心宕机，消费者能否调用提供者(三)
Dubbo入门案例(一)Dubbo入门案例(二)Dubbo入门案例(三)Dubbo入门案例(四)Dubbo入门案例...
分布式RPC框架Apache Dubbo(一)
Dubbo入门案例(一)Dubbo入门案例(二)Dubbo入门案例(三)Dubbo入门案例(四)Dubbo入门案例...
MacDown编辑器初用笔记
MacDown——入门 MacDown——入门 MacDown——入门 Mackdown——入门没有井号是普通字...

Draino 入门

1. 简介

2. 使用

启动命令

标签和标签表达式

3. 注意事项

4. 部署

5. 监控

Metrics

Events

Conditions

6. 排干重试

7. 运行模式

相关文章

Draino 入门

Draino源码分析

入门级知识

springboot+Spring Cloud-hystrix整

dubbo服务降级(五)

dubbo集群实现负载均衡配置(四)

Dubbo使用@Transactional，服务发布不成功(二)

zookeeper注册中心宕机，消费者能否调用提供者(三)

分布式RPC框架Apache Dubbo(一)

MacDown编辑器初用笔记

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Kubernetes