美文网首页openshift
etcd集群升级 v3.2.0 至v3.3.0

etcd集群升级 v3.2.0 至v3.3.0

作者: 夏天的味道_c711 | 来源:发表于2020-04-20 17:25 被阅读0次

    导读:

    k8s环境:openshift3.11

    由于etcdv3.2.x 版本导致的 gRPC-go bug导致的数据文件损坏,etcd无法正常启动,集群有(N-1)/2个节点故障了,导致集群异常。

    恢复集群之后,现在我们需要进行版本升级。(备注:恢复过程参考我另外一篇文档:)

    错误提示:

    Description of problem:
    Etcd can't start. "open wal error: wal: file not found" is found in logs 
    

    etcd 集群升级有要求

    • 一次升级一个小版本

    一次只升级一个小版本。kubernetes 官方参考文档

    例如,我们不能直接从 2.1.x 升级到 2.3.x。在补丁版本中,可以在任意版本之间进行升级和降级。为任何中间版本启动集群,等待集群正常运行,然后关闭集群将执行迁移。例如,要从 2.1.x 升级到 2.3.y,只需在 2.2.z 版本中启动 etcd,等待它正常运行,停止它,然后启动 2.3.y 版本。
    

    查看etcd版本和集群版本

    #curl -k \
    --cert /etc/etcd/server.crt \
    --key /etc/etcd/server.key \
    https://10.x.x.x:2379/version
    

    输出结果etcd版本为v3.2.22 集群版本为3.2

    {"etcdserver":"3.2.22","etcdcluster":"3.2.0"}
    

    查看集群状态

    etcdctl \
    --ca-file=/etc/etcd/ca.crt \
    --cert-file=/etc/etcd/server.crt \
    --key-file=/etc/etcd/server.key  \
    --endpoints=https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 \
    cluster-health
    
    
    member 18ffd2676eb9c81a is healthy: got healthy result from https://10.x.x.x:2379
    member 2a20a3ab7455a879 is healthy: got healthy result from https://10.x.x.x:2379
    member 5249f7cdef6c5a61 is healthy: got healthy result from https://10.x.x.x:2379
    cluster is healthy
    
    

    查看endpoint 状态信息

    #ETCDCTL_API=3 etcdctl \
    --cacert=/etc/etcd/ca.crt \
    --cert=/etc/etcd/server.crt \
    --key=/etc/etcd/server.key  \
    --endpoints=https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 \
    --write-out=table \
    endpoint status 
    
    +----------------------------+------------------+---------+---------+-----------+-----------+------------+
    |          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
    +----------------------------+------------------+---------+---------+-----------+-----------+------------+
    | https://10.x.x.x:2379 | 2a20a3ab7455a879 |  3.2.22 |  481 MB |     false |         4 |     986949 |
    | https://10.x.x.x:2379 | 18ffd2676eb9c81a |  3.2.22 |  481 MB |      true |         4 |     986949 |
    | https://10.x.x.x:2379 | 5249f7cdef6c5a61 |  3.2.22 |  481 MB |     false |         4 |     986949 |
    +----------------------------+------------------+---------+---------+-----------+-----------+------------+
    
    

    etcd升级官方文档

    备份数据

    • 1.备份数据库
    • 2.备份数据目录

    备份数据库

    创建备份存储目录

     mkdir /var/lib/etcdbackup/ 
    

    进行数据库备份

    ETCDCTL_API=3 /bin/etcdctl   \
    --endpoints="https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 "    \
    --cacert=/etc/etcd/ca.crt    \
    --cert=/etc/etcd/server.crt   \
    --key=/etc/etcd/server.key   \
    snapshot save   /var/lib/etcdbackup/`hostname`-etcd_`date +%Y%m%d%H%M`.db
    

    查看备份信息

    #ls  -lh /var/lib/etcdbackup/
    总用量 459M
    -rw-r--r--. 1 root root 459M 4月  15 14:11 szpbs-okd-prd-master1-etcd_202004151411.db
    
    
    #ETCDCTL_API=3 /bin/etcdctl   \
    --endpoints="https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 "  \
    --cacert=/etc/etcd/ca.crt    \
    --cert=/etc/etcd/server.crt   \
    --key=/etc/etcd/server.key   \
    --write-out=table \
    snapshot status /var/lib/etcdbackup/szpbs-okd-prd-master1-etcd_202004151411.db
    
    
    +----------+----------+------------+------------+
    |   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
    +----------+----------+------------+------------+
    | ae62eefd | 52199066 |      22295 |     481 MB |
    +----------+----------+------------+------------+
    

    备份数据目录

    备份下旧的etcd数据目录(回滚的时候可以直接回滚)

    #cp -r /var/lib/etcd /var/lib/etcd-bak-`date +%Y%m%d%H%M`
    
    我们不做直接的删除操作,,直接mv到/tmp目录
    #mv  /var/lib/etcd   /tmp/
    

    停止集群所有etcd服务。

    因为我这里环境是openshift3.11 所以停止etcd容器的方法为以下操作(启动只需要把文件移回来)

    在三个节点执行:

    mkdir /etc/origin/node/pods-stopped/
    
    mv   /etc/origin/node/pods/etcd.yaml   /etc/origin/node/pods-stopped/
    

    修改etcd pod配置文件中镜像版本信息。

    # cat /etc/origin/node/pods/etcd.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
      labels:
        openshift.io/component: etcd
        openshift.io/control-plane: 'true'
      name: master-etcd
      namespace: kube-system
    spec:
      containers:
      - args:
        - '#!/bin/sh
    
          set -o allexport
    
          source /etc/etcd/etcd.conf
    
          exec etcd
    
          '
        command:
        - /bin/sh
        - -c
        image: quay.io/coreos/etcd:v3.3.20  ##修改此处镜像tag
        livenessProbe:
          exec:
            command:
            - etcdctl
            - --cert-file
            - /etc/etcd/peer.crt
            - --key-file
            - /etc/etcd/peer.key
            - --ca-file
            - /etc/etcd/ca.crt
            - --endpoints
            - https://10.x.x.x:2379
            - cluster-health
          initialDelaySeconds: 45
        name: etcd
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /etc/etcd/
          name: master-config
          readOnly: true
        - mountPath: /var/lib/etcd/
          name: master-data
        - mountPath: /etc/localtime
          name: host-localtime
        workingDir: /var/lib/etcd
      hostNetwork: true
      priorityClassName: system-node-critical
      restartPolicy: Always
      volumes:
      - hostPath:
          path: /etc/etcd/
        name: master-config
      - hostPath:
          path: /var/lib/etcd
        name: master-data
      - hostPath:
          path: /etc/localtime
        name: host-localtime
    

    用同样的配置去启动集群所有etcd服务。

    mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/
    

    可以看的版本已经升级到3.3

    #curl -k  --cert /etc/etcd/server.crt --key /etc/etcd/server.key https://10.x.x.x:2379/version
    
    {"etcdserver":"3.3.20","etcdcluster":"3.3.0"}%
    
    2020-04-15 09:36:33.699963 I | etcdserver/api: enabled capabilities for version 3.2
    2020-04-15 09:36:33.700255 N | etcdserver/membership: updated the cluster version from 3.2 to 3.3
    2020-04-15 09:36:33.700348 I | etcdserver/api: enabled capabilities for version 3.3
    2020-04-15 09:36:33.700731 I | etcdserver: 346ef4f1d2df920c as single-node; fast-forwarding 9 ticks (election ticks 10)
    2020-04-15 09:36:34.572518 I | raft: 346ef4f1d2df920c is starting a new election at term 3
    2020-04-15 09:36:34.572544 I | raft: 346ef4f1d2df920c became candidate at term 4
    2020-04-15 09:36:34.572555 I | raft: 346ef4f1d2df920c received MsgVoteResp from 346ef4f1d2df920c at term 4
    2020-04-15 09:36:34.572563 I | raft: 346ef4f1d2df920c became leader at term 4
    2020-04-15 09:36:34.572568 I | raft: raft.node: 346ef4f1d2df920c elected leader 346ef4f1d2df920c at term 4
    2020-04-15 09:36:34.573011 I | etcdserver: published {Name:node1 ClientURLs:[http://10.32.60.56:2379]} to cluster 60675d894ec6ef3
    2020-04-15 09:36:34.573565 I | embed: ready to serve client requests
    2020-04-15 09:36:34.574446 N | embed: serving insecure client requests on [::]:2379, this is strongly discouraged!
    2020-04-15 09:37:11.113622 I | etcdserver/api/etcdhttp: /health OK (status code 200)
    

    最后再次检查集群相关的信息。

    查看集群健康状态

    etcdctl \
    --ca-file=/etc/etcd/ca.crt \
    --cert-file=/etc/etcd/server.crt \
    --key-file=/etc/etcd/server.key  \
    --endpoints=https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 \
    cluster-health
    

    产看endpoint 状态

    ETCDCTL_API=3 etcdctl \
    --cacert=/etc/etcd/ca.crt \
    --cert=/etc/etcd/server.crt \
    --key=/etc/etcd/server.key  \
    --endpoints=https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 \
    --write-out=table \
    endpoint status 
    

    升级完成


    回滚操作

    1.停止所有etcd服务。
    2.还原数据目录。
    3.启动旧版本etcd服务。
    4.验证状态!

    相关文章

      网友评论

        本文标题:etcd集群升级 v3.2.0 至v3.3.0

        本文链接:https://www.haomeiwen.com/subject/jetjihtx.html