美文网首页
OpenShift 4 灾难恢复-多master集群中有一个ma

OpenShift 4 灾难恢复-多master集群中有一个ma

作者: 陈光辉_6c9f | 来源:发表于2020-04-21 21:22 被阅读0次

    故障场景

    • OpenShift 4 离线环境多 master 集群中有一个 master 节点出现故障(机器不可用)
    • 这种场景下集群依然可以正常使用
      • 为了让集群处于完整的高可用状态下,我们需要将故障节点移除,再重新添加 master节点

    集群当前状态

    1. 检查节点状态
    • 可以看到故障节点已经处于 NotReady 状态
    [root@kr8s-ocp-tools ~]# oc get nodes -l node-role.kubernetes.io/master
    NAME                                 STATUS     ROLES    AGE   VERSION
    master-0.ocp4-cluster1.guachen.ocp   Ready      master   21d   v1.16.2
    master-1.ocp4-cluster1.guachen.ocp   Ready      master   21d   v1.16.2
    master-2.ocp4-cluster1.guachen.ocp   NotReady   master   21d   v1.16.2
    
    [root@kr8s-ocp-tools ~]# oc get pod -A|grep -Ev "Running|Completed"
    NAMESPACE                                               NAME                                                              READY   STATUS        RESTARTS   AGE
    openshift-machine-config-operator                       etcd-quorum-guard-58696fdc97-422jn                                1/1     Terminating   0          144m
    openshift-machine-config-operator                       etcd-quorum-guard-58696fdc97-nsnnp                                0/1     Pending       0          6m13s
    
    1. 检查 etcd cluster-health
    • 登陆到剩余的正常 master 节点操作,比如 master-0.ocp4-cluster1.guachen.ocp
      • 目前 etcd cluster 处于 degraded 状态,只有两个 membership 可用
    [root@kr8s-ocp-tools ~]# ssh core@master-0.ocp4-cluster1.guachen.ocp
    [core@master-0 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
    sh-4.2# export ETCDCTL_API=2
    sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379 \
      --ca-file=/etc/ssl/etcd/ca.crt     \
      --cert-file=$(find /etc/ssl/ -name *peer*crt)     \
      --key-file=$(find /etc/ssl/ -name *peer*key) cluster-health
    ~~~
    member 57c7ac1766477035 is healthy: got healthy result from https://10.72.44.173:2379
    failed to check the health of member d8cb362c01859289 on https://10.72.44.174:2379: Get https://10.72.44.174:2379/health: dial tcp 10.72.44.174:2379: connect: no route to host
    member d8cb362c01859289 is unreachable: [https://10.72.44.174:2379] are all unreachable
    member e18cfd0175af8004 is healthy: got healthy result from https://10.72.44.172:2379
    cluster is degraded
    

    处理过程

    1. 删除故障节点

    [root@kr8s-ocp-tools ~]# oc delete node master-2.ocp4-cluster1.guachen.ocp
    node "master-2.ocp4-cluster1.guachen.ocp" deleted
    [root@kr8s-ocp-tools ~]# oc get nodes -l node-role.kubernetes.io/master
    NAME                                 STATUS   ROLES    AGE   VERSION
    master-0.ocp4-cluster1.guachen.ocp   Ready    master   22d   v1.16.2
    master-1.ocp4-cluster1.guachen.ocp   Ready    master   22d   v1.16.2
    

    2. 删除故障 etcd membership

    • 登陆到剩余的正常 master 节点操作,比如 master-0.ocp4-cluster1.guachen.ocp
    [root@kr8s-ocp-tools ~]# ssh core@master-0.ocp4-cluster1.guachen.ocp
    [core@master-0 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
    sh-4.2# export ETCDCTL_API=2
    sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379 \
      --ca-file=/etc/ssl/etcd/ca.crt     \
      --cert-file=$(find /etc/ssl/ -name *peer*crt)     \
      --key-file=$(find /etc/ssl/ -name *peer*key) member remove 3d95fa872c4a2282
    Removed member 3d95fa872c4a2282 from cluster
    sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379   --ca-file=/etc/ssl/etcd/ca.crt       --cert-file=$(find /etc/ssl/ -name *peer*crt)       --key-file=$(find /etc/ssl/ -name *peer*key) cluster-health                
    member 57c7ac1766477035 is healthy: got healthy result from https://10.72.44.173:2379
    member e18cfd0175af8004 is healthy: got healthy result from https://10.72.44.172:2379
    cluster is healthy
    

    3. 重新添加新的节点作为 master 节点,以恢复完整的高可用集群

    • 离线集群添加节点的方式跟部署集群时一致,使用 master 的 ign 文件重新引导一个 RHCOS 节点。
      • 可以复用集群部署时该节点的ign文件,如果还在的话,若不在了按照部署集群时的方法重新生成即可
      • 具体参考集群部署步骤
    • approve 新添加的节点生成的 csr,有4个
    [root@kr8s-ocp-tools ~]# oc get csr -o name | xargs oc adm certificate approve
    

    4. 恢复 etcd membership 至完整的 etcd 集群
    a. 部署 etcd-signer Pod

    • 登陆到剩余的正常 master 节点操作,比如 master-0.ocp4-cluster1.guachen.ocp

    i. login 到 OpenShift 集群

    [root@kr8s-ocp-tools ~]# ssh core@master-0.ocp4-cluster1.guachen.ocp
    # 需要cluster-admin权限的user
    [core@master-0 ~]$ oc login https://localhost:6443
    Authentication required for https://localhost:6443 (openshift)
    Username: admin
    Password: 
    Login successful.
    

    ii. 获取 kube-etcd-signer-server 镜像的 pull specification

    export KUBE_ETCD_SIGNER_SERVER=$(sudo oc adm release info --image-for kube-etcd-signer-server --registry-config=/var/lib/kubelet/config.json)
    

    上面的命令取到的值是quay.io的,离线环境我们需要另外的处理,转换成本地的registry

    export KUBE_ETCD_SIGNER_SERVER=$(sudo crictl pull $(your-local-registry):5000/ocp4/openshift4:$(your-version)-kube-etcd-signer-server |awk '{print $7}')
    ### 比如我的环境
    export KUBE_ETCD_SIGNER_SERVER=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-kube-etcd-signer-server |awk '{print $7}')
    

    iii. 生成kube-etcd-cert-signer.yaml文件

    [core@master-0 ~]$ sudo -E /usr/local/bin/tokenize-signer.sh master-0.ocp4-cluster1.guachen.ocp 
    

    iv. 创建 etcd-signer Pod

    oc create -f assets/manifests/kube-etcd-cert-signer.yaml
    

    b. 将新添加回来的 master 节点恢复到 etcd cluster

    • 登陆到新增加的 master 节点操作,比如 master-2.ocp4-cluster1.guachen.ocp
      i. login 到 OpenShift 集群
    [root@kr8s-ocp-tools ~]# ssh core@master-2.ocp4-cluster1.guachen.ocp
    [core@master-2 ~]$ oc login https://localhost:6443
    Authentication required for https://localhost:6443 (openshift)
    Username: admin
    Password: 
    Login successful.
    

    ii. 获取恢复 etcd cluster 需要的环境变量(etcd-member-recover.sh脚本需要)

    export SETUP_ETCD_ENVIRONMENT=$(sudo oc adm release info --image-for machine-config-operator --registry-config=/var/lib/kubelet/config.json)
    export KUBE_CLIENT_AGENT=$(sudo oc adm release info --image-for kube-client-agent --registry-config=/var/lib/kubelet/config.json)
    

    上面的命令是通过 quay.io 取值的,离线环境我们需要另外的处理,转换成本地的 registry

    # 注意 $your-local-registry 和 $your-version
    [core@master-2 ~]$ export SETUP_ETCD_ENVIRONMENT=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-machine-config-operator |awk '{print $7}')
    [core@master-2 ~]$ export KUBE_CLIENT_AGENT=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-kube-client-agent |awk '{print $7}')
    

    iii. 修改 openshift-recovery-tools,将里面 etcd 的镜像转换成本地镜像仓库的

    # 注意 $your-local-registry 和 $your-version
    [core@master-2 ~]$ export ETCDIMG=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-etcd |awk '{print $7}')
    [core@master-2 ~]$ sudo -E sed -i "s?local etcdimg=.*?local etcdimg=\"$ETCDIMG\"?g" /usr/local/bin/openshift-recovery-tools
    

    iv. 运行恢复 etcd membership 脚本 etcd-member-recover.sh

    sudo -E /usr/local/bin/etcd-member-recover.sh $IP etcd-member-$hostname
    
    • IP 为恢复操作前正常的master节点 ip,master-0.ocp4-cluster1.guachen.ocp 的 ip 10.72.44.172
    • hostname 为需要恢复的etcd membership 节点 hostname,如 master-2.ocp4-cluster1.guachen.ocp
    [core@master-2 ~]$ sudo -E /usr/local/bin/etcd-member-recover.sh 10.72.44.172 etcd-member-master-2.ocp4-cluster1.guachen.ocp
    4320daf71e2d45927d66c6a74f46faa6a1bfe7cabb708d81344255fdc289b5bb
    etcdctl version: 3.3.17
    API version: 3.3
    Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/
    Backing up /etc/etcd/etcd.conf to ./assets/backup/
    Trying to backup etcd client certs..
    etcd client certs found in /etc/kubernetes/static-pod-resources/kube-apiserver-pod-9 backing up to ./assets/backup/
    Stopping etcd..
    Waiting for etcd-member to stop
    Waiting for etcd-member to stop
    Waiting for etcd-member to stop
    Waiting for etcd-member to stop
    Local etcd snapshot file not found, backup skipped..
    Backing up etcd certificates..
    Removing etcd certs..
    Populating template /usr/local/share/openshift-recovery/template/etcd-generate-certs.yaml.template
    Populating template ./assets/tmp/etcd-generate-certs.stage1
    Populating template ./assets/tmp/etcd-generate-certs.stage2
    Starting etcd client cert recovery agent..
    Waiting for certs to generate... (1/60)
    Waiting for certs to generate... (2/60)
    Waiting for certs to generate... (3/60)
    Waiting for certs to generate... (4/60)
    Stopping cert recover..
    Waiting for generate-certs to stop
    Patching etcd-member manifest..
    Updating etcd membership..
    Removing etcd data_dir /var/lib/etcd..
    Member 3c6458d18aa43907 added to cluster a792367fd9b198cc
    
    ETCD_NAME="etcd-member-master-2.ocp4-cluster1.guachen.ocp"
    ETCD_INITIAL_CLUSTER="etcd-member-master-2.ocp4-cluster1.guachen.ocp=https://etcd-2.ocp4-cluster1.guachen.ocp:2380,etcd-member-master-1.ocp4-cluster1.guachen.ocp=https://etcd-1.ocp4-cluster1.guachen.ocp:2380,etcd-member-master-0.ocp4-cluster1.guachen.ocp=https://etcd-0.ocp4-cluster1.guachen.ocp:2380"
    ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-2.ocp4-cluster1.guachen.ocp:2380"
    ETCD_INITIAL_CLUSTER_STATE="existing"
    Starting etcd..
    

    验证处理结果

    1. 检查 node/etcd pod 状态
    [root@kr8s-ocp-tools ~]# oc get nodes -l node-role.kubernetes.io/master
    NAME                                 STATUS   ROLES    AGE   VERSION
    master-0.ocp4-cluster1.guachen.ocp   Ready    master   22d   v1.16.2
    master-1.ocp4-cluster1.guachen.ocp   Ready    master   22d   v1.16.2
    master-2.ocp4-cluster1.guachen.ocp   Ready    master   13m   v1.16.2
    
    [root@kr8s-ocp-tools ~]# oc -n openshift-etcd get pod -owide
    NAME                                             READY   STATUS    RESTARTS   AGE   IP             NODE                                 NOMINATED NODE   READINESS GATES
    etcd-member-master-0.ocp4-cluster1.guachen.ocp   2/2     Running   2          22d   10.72.44.172   master-0.ocp4-cluster1.guachen.ocp   <none>           <none>
    etcd-member-master-1.ocp4-cluster1.guachen.ocp   2/2     Running   2          22d   10.72.44.173   master-1.ocp4-cluster1.guachen.ocp   <none>           <none>
    etcd-member-master-2.ocp4-cluster1.guachen.ocp   2/2     Running   0          68s   10.72.44.174   master-2.ocp4-cluster1.guachen.ocp   <none>           <none>
    
    1. 检查 etcd cluster-health
    • 登陆到新添加的 master 节点操作,比如 master-2.ocp4-cluster1.guachen.ocp
    [root@kr8s-ocp-tools ~]# ssh core@master-2.ocp4-cluster1.guachen.ocp
    [core@master-2 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
    sh-4.2# export ETCDCTL_API=2
    sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379 \
      --ca-file=/etc/ssl/etcd/ca.crt     \
      --cert-file=$(find /etc/ssl/ -name *peer*crt)     \
      --key-file=$(find /etc/ssl/ -name *peer*key) cluster-health
    member 3c6458d18aa43907 is healthy: got healthy result from https://10.72.44.174:2379
    member 57c7ac1766477035 is healthy: got healthy result from https://10.72.44.173:2379
    member e18cfd0175af8004 is healthy: got healthy result from https://10.72.44.172:2379
    cluster is healthy
    

    可以看到 etcd cluster 有 3 个 membership,且 cluster 状态是正常的

    1. 恢复完成后删除 etcd-signer pod
    [root@kr8s-ocp-tools ~]# oc delete pod -n openshift-config etcd-signer
    

    相关文章

      网友评论

          本文标题:OpenShift 4 灾难恢复-多master集群中有一个ma

          本文链接:https://www.haomeiwen.com/subject/mdfmihtx.html