美文网首页
k8s集群故障处理

k8s集群故障处理

作者: 六弦极品 | 来源:发表于2024-03-21 14:18 被阅读0次

一、节点calico pod启动问题

1、故障现象:


image.png

命令查看启动报错

# kubectl describe pod calico-node-r42fc -n kube-system
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.10.4,10.51.10.5
  Warning  Unhealthy  24s (x196 over 29m)  kubelet  (combined from similar events): Readiness probe failed: 2024-03-22 02:39:47.813 [INFO][7095] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.51.10.4,10.51.10.5

2、排查过程
如果不在故障calico-node-r42fc 对应的节点去登录calico-node-r42fc 会报没有route到主机

[root@rzbl-middleware01 ~]# kubectl exec -it calico-node-r42fc -n kube-system bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), mount-bpffs (init)
Error from server: error dialing backend: dial tcp 10.51.10.6:10250: connect: no route to host

要到故障pod的主机登录pod


image.png
[root@rzbl-middleware03 ~]# kubectl  exec -it calico-node-r42fc -n kube-system bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), mount-bpffs (init)
[root@rzbl-middleware03 /]# 

进入故障pod,打开bird配置文件,发现router id为172.21.0.1;,此IP应该是容器网桥网卡地址,命令查看是br-e170164834a4,正常应该是ens192网卡地址:10.51.10.6,如下:

## 查看故障pod的 /etc/calico/confd/config/bird.cfg  配置文件参数"router id"
[root@rzbl-middleware03 /]# cat /etc/calico/confd/config/bird.cfg 
function apply_communities ()
{
}

# Generated by confd
include "bird_aggr.cfg";
include "bird_ipam.cfg";

router id 172.21.0.1;
...


## 查看172.21.0.1 IP对应的网卡
[root@rzbl-middleware03 ~]# ip a |grep 172.21.0.1
    inet 172.21.0.1/16 brd 172.21.255.255 scope global br-e170164834a4

## 查看网卡 ens192对应的ip
[root@rzbl-middleware03 ~]# ifconfig ens192
ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.51.10.6  netmask 255.255.255.0  broadcast 10.51.10.255
        ether 00:50:56:a4:67:b6  txqueuelen 1000  (Ethernet)
        RX packets 1032206644  bytes 168848840934 (157.2 GiB)
        RX errors 0  dropped 418  overruns 0  frame 0
        TX packets 1186182486  bytes 206549006053 (192.3 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

在mastet节点执行:

[root@rzbl-middleware01 ~]# calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 10.51.10.5   | node-to-node mesh | up    | 11:01:56 | Established |
| 172.21.0.1   | node-to-node mesh | start | 02:09:41 | Passive     |
+--------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

综上所述,基本可以确定是节点的calico的BGP网卡设备识别错误导致。

calicoctl下载包地址:https://github.com/projectcalico/calicoctl/releases/

cd /usr/local/src
wget https://github.com/projectcalico/calicoctl/releases/download/v3.20.6/calicoctl
chmod +x calicoctl
mv calicoctl /usr/sbin/

3、修复操作
清掉故障pod所在节点网卡br-e170164834a4

ifconfig br-e170164834a4 down
ip link delete br-e170164834a4
rm -rf /var/lib/cni
rm -f /etc/cni/net.d/*

calico daemonsets 控制器添加环境变量

[root@rzbl-middleware01 ~]# kubectl edit daemonsets.apps calico-node -n kube-system
...
spec:
  template:
    spec:
      containers:
      - env:
        - name: IP_AUTODETECTION_METHOD
          value: interface=ens*
...

查看calico-node 启动及状态

[root@rzbl-middleware01 ~]# kubectl  get pod -n kube-system      
NAME                                        READY   STATUS             RESTARTS         AGE
calico-node-4p8tr                           1/1     Running            1 (7m26s ago)    8m32s
calico-node-8tv4k                           1/1     Running            1 (9m42s ago)    10m
calico-node-zpqbg                           1/1     Running            0                10m

再次查看calicoctl查看 calico node的状态

[root@rzbl-middleware01 ~]# calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 10.51.10.5   | node-to-node mesh | up    | 06:07:29 | Established |
| 10.51.10.6   | node-to-node mesh | up    | 06:05:04 | Established |
+--------------+-------------------+-------+----------+-------------+

相关文章

网友评论

      本文标题:k8s集群故障处理

      本文链接:https://www.haomeiwen.com/subject/ecmftjtx.html