导读:
k8s环境:openshift3.11
由于etcdv3.2.x 版本导致的 gRPC-go bug导致的数据文件损坏,etcd无法正常启动,集群有(N-1)/2个节点故障了,导致集群异常。
恢复集群之后,现在我们需要进行版本升级。(备注:恢复过程参考我另外一篇文档:)
错误提示:
Description of problem:
Etcd can't start. "open wal error: wal: file not found" is found in logs
etcd 集群升级有要求
- 一次升级一个小版本
一次只升级一个小版本。kubernetes 官方参考文档
例如,我们不能直接从 2.1.x 升级到 2.3.x。在补丁版本中,可以在任意版本之间进行升级和降级。为任何中间版本启动集群,等待集群正常运行,然后关闭集群将执行迁移。例如,要从 2.1.x 升级到 2.3.y,只需在 2.2.z 版本中启动 etcd,等待它正常运行,停止它,然后启动 2.3.y 版本。
查看etcd版本和集群版本
#curl -k \
--cert /etc/etcd/server.crt \
--key /etc/etcd/server.key \
https://10.x.x.x:2379/version
输出结果etcd版本为v3.2.22 集群版本为3.2
{"etcdserver":"3.2.22","etcdcluster":"3.2.0"}
查看集群状态
etcdctl \
--ca-file=/etc/etcd/ca.crt \
--cert-file=/etc/etcd/server.crt \
--key-file=/etc/etcd/server.key \
--endpoints=https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 \
cluster-health
member 18ffd2676eb9c81a is healthy: got healthy result from https://10.x.x.x:2379
member 2a20a3ab7455a879 is healthy: got healthy result from https://10.x.x.x:2379
member 5249f7cdef6c5a61 is healthy: got healthy result from https://10.x.x.x:2379
cluster is healthy
查看endpoint 状态信息
#ETCDCTL_API=3 etcdctl \
--cacert=/etc/etcd/ca.crt \
--cert=/etc/etcd/server.crt \
--key=/etc/etcd/server.key \
--endpoints=https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 \
--write-out=table \
endpoint status
+----------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.x.x.x:2379 | 2a20a3ab7455a879 | 3.2.22 | 481 MB | false | 4 | 986949 |
| https://10.x.x.x:2379 | 18ffd2676eb9c81a | 3.2.22 | 481 MB | true | 4 | 986949 |
| https://10.x.x.x:2379 | 5249f7cdef6c5a61 | 3.2.22 | 481 MB | false | 4 | 986949 |
+----------------------------+------------------+---------+---------+-----------+-----------+------------+
etcd升级官方文档
备份数据
- 1.备份数据库
- 2.备份数据目录
备份数据库
创建备份存储目录
mkdir /var/lib/etcdbackup/
进行数据库备份
ETCDCTL_API=3 /bin/etcdctl \
--endpoints="https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 " \
--cacert=/etc/etcd/ca.crt \
--cert=/etc/etcd/server.crt \
--key=/etc/etcd/server.key \
snapshot save /var/lib/etcdbackup/`hostname`-etcd_`date +%Y%m%d%H%M`.db
查看备份信息
#ls -lh /var/lib/etcdbackup/
总用量 459M
-rw-r--r--. 1 root root 459M 4月 15 14:11 szpbs-okd-prd-master1-etcd_202004151411.db
#ETCDCTL_API=3 /bin/etcdctl \
--endpoints="https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 " \
--cacert=/etc/etcd/ca.crt \
--cert=/etc/etcd/server.crt \
--key=/etc/etcd/server.key \
--write-out=table \
snapshot status /var/lib/etcdbackup/szpbs-okd-prd-master1-etcd_202004151411.db
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| ae62eefd | 52199066 | 22295 | 481 MB |
+----------+----------+------------+------------+
备份数据目录
备份下旧的etcd数据目录(回滚的时候可以直接回滚)
#cp -r /var/lib/etcd /var/lib/etcd-bak-`date +%Y%m%d%H%M`
我们不做直接的删除操作,,直接mv到/tmp目录
#mv /var/lib/etcd /tmp/
停止集群所有etcd服务。
因为我这里环境是openshift3.11 所以停止etcd容器的方法为以下操作(启动只需要把文件移回来)
在三个节点执行:
mkdir /etc/origin/node/pods-stopped/
mv /etc/origin/node/pods/etcd.yaml /etc/origin/node/pods-stopped/
修改etcd pod配置文件中镜像版本信息。
# cat /etc/origin/node/pods/etcd.yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
labels:
openshift.io/component: etcd
openshift.io/control-plane: 'true'
name: master-etcd
namespace: kube-system
spec:
containers:
- args:
- '#!/bin/sh
set -o allexport
source /etc/etcd/etcd.conf
exec etcd
'
command:
- /bin/sh
- -c
image: quay.io/coreos/etcd:v3.3.20 ##修改此处镜像tag
livenessProbe:
exec:
command:
- etcdctl
- --cert-file
- /etc/etcd/peer.crt
- --key-file
- /etc/etcd/peer.key
- --ca-file
- /etc/etcd/ca.crt
- --endpoints
- https://10.x.x.x:2379
- cluster-health
initialDelaySeconds: 45
name: etcd
securityContext:
privileged: true
volumeMounts:
- mountPath: /etc/etcd/
name: master-config
readOnly: true
- mountPath: /var/lib/etcd/
name: master-data
- mountPath: /etc/localtime
name: host-localtime
workingDir: /var/lib/etcd
hostNetwork: true
priorityClassName: system-node-critical
restartPolicy: Always
volumes:
- hostPath:
path: /etc/etcd/
name: master-config
- hostPath:
path: /var/lib/etcd
name: master-data
- hostPath:
path: /etc/localtime
name: host-localtime
用同样的配置去启动集群所有etcd服务。
mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/
可以看的版本已经升级到3.3
#curl -k --cert /etc/etcd/server.crt --key /etc/etcd/server.key https://10.x.x.x:2379/version
{"etcdserver":"3.3.20","etcdcluster":"3.3.0"}%
2020-04-15 09:36:33.699963 I | etcdserver/api: enabled capabilities for version 3.2
2020-04-15 09:36:33.700255 N | etcdserver/membership: updated the cluster version from 3.2 to 3.3
2020-04-15 09:36:33.700348 I | etcdserver/api: enabled capabilities for version 3.3
2020-04-15 09:36:33.700731 I | etcdserver: 346ef4f1d2df920c as single-node; fast-forwarding 9 ticks (election ticks 10)
2020-04-15 09:36:34.572518 I | raft: 346ef4f1d2df920c is starting a new election at term 3
2020-04-15 09:36:34.572544 I | raft: 346ef4f1d2df920c became candidate at term 4
2020-04-15 09:36:34.572555 I | raft: 346ef4f1d2df920c received MsgVoteResp from 346ef4f1d2df920c at term 4
2020-04-15 09:36:34.572563 I | raft: 346ef4f1d2df920c became leader at term 4
2020-04-15 09:36:34.572568 I | raft: raft.node: 346ef4f1d2df920c elected leader 346ef4f1d2df920c at term 4
2020-04-15 09:36:34.573011 I | etcdserver: published {Name:node1 ClientURLs:[http://10.32.60.56:2379]} to cluster 60675d894ec6ef3
2020-04-15 09:36:34.573565 I | embed: ready to serve client requests
2020-04-15 09:36:34.574446 N | embed: serving insecure client requests on [::]:2379, this is strongly discouraged!
2020-04-15 09:37:11.113622 I | etcdserver/api/etcdhttp: /health OK (status code 200)
最后再次检查集群相关的信息。
查看集群健康状态
etcdctl \
--ca-file=/etc/etcd/ca.crt \
--cert-file=/etc/etcd/server.crt \
--key-file=/etc/etcd/server.key \
--endpoints=https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 \
cluster-health
产看endpoint 状态
ETCDCTL_API=3 etcdctl \
--cacert=/etc/etcd/ca.crt \
--cert=/etc/etcd/server.crt \
--key=/etc/etcd/server.key \
--endpoints=https://10.x.x.x:2379,https://10.x.x.x:2379,https://10.x.x.x:2379 \
--write-out=table \
endpoint status
升级完成
回滚操作
1.停止所有etcd服务。
2.还原数据目录。
3.启动旧版本etcd服务。
4.验证状态!
网友评论