美文网首页
pod内部访问svc失败分析

pod内部访问svc失败分析

作者: cloudFans | 来源:发表于2021-12-29 08:55 被阅读0次

    pod 无法访问svc

    环境:
    3 mst 2 worker

    node 双网卡

    node eth0:默认路由在eth0 ,k8s管理网络,node访问svc ,pod经过node访问svc,以及pod回包给node都会经过eth0

    pod eth1:pod 访问pod需要经过 eth1的网关

    情况描述:

    svc 信息

    
    [root@(l2)k8s-master-1 ~]# kubectl get svc | grep srvclb-ngnx
    srvclb-ngnx   LoadBalancer   10.111.240.224   <pending>     80:31288/TCP   23h
    
    [root@(l2)k8s-master-1 ~]# ipvsadm -ln | grep -A 2 10.111.240.224
    TCP  10.111.240.224:80 rr
      -> 172.33.1.255:80              Masq    1      0          0         
      -> 172.33.2.17:80               Masq    1      0          0    
    
    后端pod是两个nginx web
    
    [root@(l2)k8s-master-1 ~]#kubectl get pod -A -o wide| grep -E "172.33.1.255|172.33.2.17|172.33.2.4"
    default         loadbalancer-5554b69d95-clgjd                 1/1     Running   0          22h     172.33.1.255   k8s-worker-3  
    default         loadbalancer-5554b69d95-tt99x                 1/1     Running   0          17h     172.33.2.17    k8s-worker-1  
    default         sshd-k8s-master-1                                      1/1     Running   0          20h     172.33.2.4     k8s-master-1
    
    sshd-k8s-master-1    是测试发起端
    
    

    在发端pod所在的node抓包分析

    # 发起端
    ## 在pod内部保持telnet
    
    [root@sshd-k8s-master-1 /]# telnet 10.111.240.224 80
    Trying 10.111.240.224...
    
    
    
    
    
    # node 抓包
    
    ## mac 说明
    [root@(l2)k8s-master-1 env-test]# ansible all -i inventory/inventory.ini -m shell -a "ip a | grep -i -C 2 -E '00:00:00:fa:f1:34|00:00:00:b2:8f:1b'"
    k8s-master-1 | CHANGED | rc=0 >>
           valid_lft forever preferred_lft forever
    3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
        link/ether 00:00:00:b2:8f:1b brd ff:ff:ff:ff:ff:ff
    4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
        link/ether 00:00:00:9c:1c:c7 brd ff:ff:ff:ff:ff:ff
    --
           valid_lft forever preferred_lft forever
    10: ipvl_3@eth1: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default 
        link/ether 00:00:00:b2:8f:1b brd ff:ff:ff:ff:ff:ff
        inet 172.33.192.10/32 scope host ipvl_3
           valid_lft forever preferred_lft forever
    
    k8s-worker-1 | CHANGED | rc=0 >>
           valid_lft forever preferred_lft forever
    3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
        link/ether 00:00:00:fa:f1:34 brd ff:ff:ff:ff:ff:ff
    4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
        link/ether 00:00:00:0e:78:83 brd ff:ff:ff:ff:ff:ff
    --
           valid_lft forever preferred_lft forever
    10: ipvl_3@eth1: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default 
        link/ether 00:00:00:fa:f1:34 brd ff:ff:ff:ff:ff:ff
        inet 172.33.192.15/32 scope host ipvl_3
           valid_lft forever preferred_lft forever
    
    
     tcpdump -i any host 172.33.2.4 or 10.111.240.224 or 172.33.1.255 or 172.33.2.17  -netvv
    
    
    ## 发包: pod 发给 node svc cluster ip,走eth1网卡出来
    
    Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 76: (tos 0x10, ttl 64, id 20122, offset 0, flags [DF], proto TCP (6), length 60)
        172.33.2.4.49454 > 10.111.240.224.http: Flags [S], cksum 0xa9a3 (incorrect -> 0x66d3), seq 2359839294, win 65280, options [mss 1360,sackOK,TS val 2450394791 ecr 0,nop,wscale 7], length 0
    
    
    ## 奇怪 这里怎么就有pod后端的响应包了,应该先有发起端发过去的包才对
    
     In 00:00:00:fa:f1:34 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
        172.33.2.17.http > 172.33.2.4.49454: Flags [S.], cksum 0x5c86 (incorrect -> 0x5944), seq 3684244595, ack 2359839295, win 64704, options [mss 1360,sackOK,TS val 1076700321 ecr 2450394791,nop,wscale 7], length 0
    
    ## ipvs 前端转后端之后
    #### 只是mac变了
    Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
        172.33.2.4.49454 > 172.33.2.17.http: Flags [R], cksum 0xbb22 (correct), seq 2359839295, win 0, length 0
    
    ## 发包 
    Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 76: (tos 0x10, ttl 64, id 20123, offset 0, flags [DF], proto TCP (6), length 60)
        172.33.2.4.49454 > 10.111.240.224.http: Flags [S], cksum 0xa9a3 (incorrect -> 0x62df), seq 2359839294, win 65280, options [mss 1360,sackOK,TS val 2450395803 ecr 0,nop,wscale 7], length 0
    
    #### 为什么两个pod 会直接通信,没有经过cluster ip
    
     In 00:00:00:fa:f1:34 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
        172.33.2.17.http > 172.33.2.4.49454: Flags [S.], cksum 0x5c86 (incorrect -> 0x29bf), seq 3700048671, ack 2359839295, win 64704, options [mss 1360,sackOK,TS val 1076701333 ecr 2450395803,nop,wscale 7], length 0
    Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
        172.33.2.4.49454 > 172.33.2.17.http: Flags [R], cksum 0xbb22 (correct), seq 2359839295, win 0, length 0
    Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 76: (tos 0x10, ttl 64, id 20124, offset 0, flags [DF], proto TCP (6), length 60)
    
    #### 重复尝试
    
        172.33.2.4.49454 > 10.111.240.224.http: Flags [S], cksum 0xa9a3 (incorrect -> 0x5adf), seq 2359839294, win 65280, options [mss 1360,sackOK,TS val 2450397851 ecr 0,nop,wscale 7], length 0
     In 00:00:00:fa:f1:34 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
        172.33.2.17.http > 172.33.2.4.49454: Flags [S.], cksum 0x5c86 (incorrect -> 0xcc70), seq 3732049541, ack 2359839295, win 64704, options [mss 1360,sackOK,TS val 1076703381 ecr 2450397851,nop,wscale 7], length 0
    Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
        172.33.2.4.49454 > 172.33.2.17.http: Flags [R], cksum 0xbb22 (correct), seq 2359839295, win 0, length 0
     In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.2.4 tell 172.33.2.17, length 28
    
    # 往返包正常 又发arp做什么?
    
    Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 172.33.2.4 is-at 00:00:00:b2:8f:1b, length 28
    Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.2.17 tell 172.33.2.4, length 28
    Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 10.111.240.224 tell 172.33.2.4, length 28
     In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 172.33.2.17 is-at 00:00:00:fa:f1:34, length 28
     In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 10.111.240.224 is-at 00:00:00:fa:f1:34, length 28
    
    

    原因:kube-proxy 没有开启masquerade导致的, 不开启pod发出的包经过ipvs就不会被伪装成eth0的ip和mac,而是只替换了mac。由于ipvlan模式下,eth1网卡无法向外转发,所以走了eth0出去,即eth0发出了一个不是自己的ip也不是自己的mac的包,所以会导致出问题。

    像macvlan,ipvlan,kube-ovn 多网卡场景的node,eth0是k8s管理网卡(svc依赖)必须开启该模式

    对比 启用 masquerade 之后

    [root@(l2)k8s-master-1 ~]# grep masquerade  -r /etc/kubernetes/
    /etc/kubernetes/kubeadm-config.yaml:  masqueradeAll: True
    
    
    

    基于kubespray 更新全伪装之后,旧的lb未生效,所以直接重建了下测试集群

    svc 信息

    
    [root@(l2)k8s-master-1 ~]# kubectl get svc | grep srvclb-ngnx
    default       srvclb-ngnx  LoadBalancer  10.105.106.250   172.32.1.6   80:32244/TCP  17m   app=hello,tier=frontend
    
    [root@(l2)k8s-master-1 ~]# ipvsadm -ln | grep -A 2 10.105.106.250
    TCP  10.105.106.250:80 rr
      -> 172.33.2.17:80               Masq    1      0          0         
      -> 172.33.2.18:80               Masq    1      0          0    
    
    后端pod是两个nginx web
    
    [root@(l2)k8s-master-1 ~]# kubectl get pod -A -o wide| grep -E "172.33.2.17|172.33.2.18|172.33.2.7"
    default               loadbalancer-5554b69d95-tp778  1/1     Running    47m    172.33.2.18    k8s-worker-1
    default               loadbalancer-5554b69d95-wsk8k 1/1     Running    47m    172.33.2.17    k8s-worker-3
    default               sshd-k8s-master-1                        1/1     Running    88m    172.33.2.7     k8s-master-1
    
    sshd-k8s-master-1    是测试发起端
    
    sh-4.2# telnet 10.105.106.250 80
    Trying 10.105.106.250...
    Connected to 10.105.106.250.
    Escape character is '^]'.
    ^]
    
    
    
    
    
    ## pod 所在 node 抓包
    
    telnet 能通的情况下的包
    
    [root@(l2)k8s-master-1 ~]# tcpdump -i any host 10.105.106.250  or 172.33.2.7 or 172.33.2.17 or  172.33.2.18  -netvv
    
    
    Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 76: (tos 0x10, ttl 64, id 42606, offset 0, flags [DF], proto TCP (6), length 60)
        172.33.2.7.53432 > 10.105.106.250.http: Flags [S], cksum 0x23ba (incorrect -> 0xfed3), seq 292962656, win 65280, options [mss 1360,sackOK,TS val 582082675 ecr 0,nop,wscale 7], length 0
     In 00:00:00:fa:f1:34 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 62, id 0, offset 0, flags [DF], proto TCP (6), length 60)
        10.105.106.250.http > 172.33.2.7.53432: Flags [S.], cksum 0xd8ba (correct), seq 4218169578, ack 292962657, win 64704, options [mss 1360,sackOK,TS val 2603772094 ecr 582082675,nop,wscale 7], length 0
    Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 68: (tos 0x10, ttl 64, id 42607, offset 0, flags [DF], proto TCP (6), length 52)
        172.33.2.7.53432 > 10.105.106.250.http: Flags [.], cksum 0x23b2 (incorrect -> 0x01e5), seq 1, ack 1, win 510, options [nop,nop,TS val 582082676 ecr 2603772094], length 0
    
    # 以下arp相关的包
    
     In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.2.7 tell 172.33.192.15, length 28
    Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 172.33.2.7 is-at 00:00:00:b2:8f:1b, length 28
      P 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.0.1 tell 172.33.2.18, length 28
    Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 10.105.106.250 tell 172.33.2.7, length 28
     In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 10.105.106.250 is-at 00:00:00:fa:f1:34, length 28
     In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.2.7 tell 172.33.192.15, length 28
    Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 172.33.2.7 is-at 00:00:00:b2:8f:1b, length 28
     In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.2.7 tell 172.33.192.15, length 28
    Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 172.33.2.7 is-at 00:00:00:b2:8f:1b, length 28
    

    首要问题pod 访问svc 概率性不通的问题:

    1. svc 后端为host network模式的pod,比如 curl -k https://kubernetes:443/livez?verbose

    2. svc 后端为 和node eth1 同网段ip的pod,比如自建的svc

    场景1: 在kube-proxy 不启用全伪装模式的时候, 此时pod 内部访问 curl -k https://kubernetes:443/livez?verbose
    小概率可成功响应,大概率失败。 而node上访问完全正常。

    当kubernetes有三个后端,那么pod内部的成功率是1/3,而node成功率是100%。

    原因: 由于是双网关,且无伪装。 pod访问svc后端的node需要eth0的网关,而回包是直接返回给pod。由于node回包给pod时,网关转发包的混乱,包看起来是随机发送给任何一个node的,所以小概率,pod可以收到包。

    小结: 本质上是因为node访问pod,跨网关转发,这个转发是不稳定的。

    开启全伪装模式之后, pod内部访问 curl -k https://kubernetes:443/livez?verbose 100%成功, 只是有首包慢的问题,localdns的缓存效果,后续请求就很快

    前提: 在kube-proxy 启用全伪装模式的情况下,进行2 自建svc的测试。

    一般情况下,pod创建出来,node 能否平通 pod是概率性的,但是pod可以100% ping通node
    在pod ping 通node之后的一段时间内,node ping pod 也是100%可通的。 也就是在这段时间内,网关知道pod在哪里,可以准确转发。

    情况1: 在node 无法ping通pod的情况下,pod 访问svc的情况是概率性的,和kube-proxy未开启全伪装的表现几乎一样

    当自定义svc有两个后端, pod内部访问svc成功率是1/2, 而node成功率为0

    原因:

    跟踪contrack表发现

    pod 内部 ping svc, 没有新的contrack条目创建。也就是就是没基于cluster ip建立连接

    node 内部 ping svc,发现有新的contrack条目创建。 但是由于node ping 不同 pod,所以依旧无法访问

    情况2: 在 node 可以ping 通 pod的情况下,pod 访问svc 100% 成功

    保持pod 对node的ping

    image.png

    此时有两条icmp的记录,svc两个后端pod 保持 对 master1的ping

    image.png

    此时在master1 访问cluster ip 100%成功

    等待 node访问的contrack记录消失,再进行 pod访问的测试

    测试pod访问自定义svc

    始终是1/2的成功率,但是完全没有新的contrack记录。 pod访问svc 应该没走kube-proxy。

    解决方式: 移除eth1的网卡上的ip,该情况完全消失。 pod 内部访问100%成功,node访问会产生新的contrack

    追加测试

    如果自定义 svc 只有一个pod,pod 访问 svc 始终都是成功的

    持续跟踪了下arp表的变化, 同一个ip对应的mac会以肉眼可见的速度不断更新,但是不会出现冲突以及错乱,同一个ip始终都映射到一个mac。

    image.png

    抓包发现触发arp更替的请求是本地的eth1网卡发起的。

    image.png

    可以看到eth1网卡发起的arp广播。
    由于eth1作为master,是无法通往外部以及本地pod的

    image.png

    也就是说这张网卡,完全没有联通的功能,但是仍然会触发arp的更新。

    解决方式:该网卡的ip以及路由需要移除掉,达到禁用该网卡的目的

    禁用网卡时,可以看到eth1相关的arp记录全部清除

    执行ip addr flush dev eth1 时,eth1的arp记录全部清除
    pod 内部 访问多pod 后端的svc 也100%回复正常

    参考:ipvs https://blog.dianduidian.com/post/lvs-snat%E5%8E%9F%E7%90%86%E5%88%86%E6%9E%90/

    相关文章

      网友评论

          本文标题:pod内部访问svc失败分析

          本文链接:https://www.haomeiwen.com/subject/zgfyqrtx.html