美文网首页玩转大数据大数据
docker swarm (三):overlay与docker_

docker swarm (三):overlay与docker_

作者: 小耸 | 来源:发表于2020-03-08 22:14 被阅读0次

    本文通过实验,帮助大家认识docker swarm中的overlay和docker_gwbridge网络。

    实验环境搭建

    先建立两台物理机组成的docker swarm网络(方法可见《docker swarm(一): 入门,搭建一个简单的swarm集群》):

    $ docker node ls
    ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
    43k0p9fnwu9dhsyr0n6utfynn *   ubuntu              Ready               Active              Leader              19.03.5
    gorkh8cb5ylb7szzbbrp2sheu     ubuntu-2            Ready               Active                                  19.03.5
    

    创建一个overlay网络。

    docker network create -d overlay --attachable --subnet 10.200.0.0/16 overlay_test
    

    当前建立的docker相关的网络有:

    $ docker network ls
    NETWORK ID          NAME                DRIVER              SCOPE
    a473a52d686d        bridge              bridge              local
    5e1880193fbf        docker_gwbridge     bridge              local
    62ba25167374        host                host                local
    jjyg85t5ta3k        ingress             overlay             swarm
    d056684646b3        none                null                local
    hxyiridl2b9r        overlay_test        overlay             swarm
    

    这里关注两个网络:

    • overlay_test:overlay网络,实现容器间东西向流量的网络。
    • docker_gwbridge: 容器收发南北向报文的网络。

    工具准备

    我们知道,docker是基于namespace,划分了网络空间。这里先准备一段脚本,由于在各个namespece中,执行对应的网络命令。

    #!/bin/bash 
    NAMESPACE=$1    
    if [[ -z $NAMESPACE ]]; then    
        ls -1 /var/run/docker/netns/    
        exit 0  
    fi  
    NAMESPACE_FILE=/var/run/docker/netns/${NAMESPACE}   
    if [[ ! -f $NAMESPACE_FILE ]]; then 
        NAMESPACE_FILE=$(docker inspect -f "{{.NetworkSettings.SandboxKey}}" $NAMESPACE 2>/dev/null)    
    fi  
    if [[ ! -f $NAMESPACE_FILE ]]; then 
        echo "Cannot open network namespace '$NAMESPACE': No such file or directory"    
        exit 1  
    fi  
    shift   
    if [[ $# -lt 1 ]]; then 
        echo "No command specified" 
        exit 1  
    fi  
    nsenter --net=${NAMESPACE_FILE} $@
    

    它可以查看有哪些namespace:

    $ sudo ./docker_netns.sh 
    1-k2rx924tgr
    eab3f856fe9a
    ingress_sbox
    

    还可以在指定的namespace下执行命令:

    $ sudo ./docker_netns.sh eab3f856fe9a ip link
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    170: eth0@if171: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default 
        link/ether 02:42:0a:00:00:54 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    172: eth1@if173: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
        link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    

    第二个工具,find_links.sh

    #!/bin/bash 
    DOCKER_NETNS_SCRIPT=./docker_netns.sh   
    IFINDEX=$1  
    if [[ -z $IFINDEX ]]; then  
        for namespace in $($DOCKER_NETNS_SCRIPT); do    
            printf "\e[1;31m%s: \e[0m\n" $namespace 
            $DOCKER_NETNS_SCRIPT $namespace ip -c -o link   
            printf "\n" 
        done    
    else    
        for namespace in $($DOCKER_NETNS_SCRIPT); do    
            if $DOCKER_NETNS_SCRIPT $namespace ip -c -o link | grep -Pq "^$IFINDEX: "; then 
                printf "\e[1;31m%s: \e[0m\n" $namespace 
                $DOCKER_NETNS_SCRIPT $namespace ip -c -o link | grep -P "^$IFINDEX: ";  
                printf "\n" 
            fi  
        done    
    fi
    

    这个脚本可以根据ifindex查找接口所在的namespace。

    $ sudo ./find_links.sh 60
    1-hxyiridl2b: 
    60: veth1@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default \    link/ether 4a:0a:52:98:84:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    

    网络结构分析

    以下,我们通过实验,了解一下overlay网络与docker_gwbridge网络。

    我们现在在两个nodes上都创建容器:

    $ docker run -d --name busybox --net overlay_test busybox sleep 36000
    

    在容器的环境下,查看一下网络连接:

    docker exec busybox ip addr
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
    59: eth0@if60: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue 
        link/ether 02:42:0a:c8:00:02 brd ff:ff:ff:ff:ff:ff
        inet 10.200.0.2/16 brd 10.200.255.255 scope global eth0
           valid_lft forever preferred_lft forever
    61: eth1@if62: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue 
        link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
        inet 172.18.0.3/16 brd 172.18.255.255 scope global eth1
           valid_lft forever preferred_lft forever
    
    

    我们发现,除了回环口外,还有两个接口。10.200.0.2/16即是容器busybox在overlay_test网络上的接口的IP地址。172.18.0.3/16是容器busybox在docker_gwbridge网络上的接口的IP地址。

    到目前为止,我们看到的容器网络是这样的。我们只看到了网络地址,还不知道它们间的报文是如何交互的。(192.168.154.2是宿主机的网关)

    step 0

    南北向流量

    我们尝试从容器内跟踪访问外部IP的路由

    $ docker exec busybox traceroute baidu.com
    traceroute to baidu.com (220.181.38.148), 30 hops max, 46 byte packets
     1  bogon (172.18.0.1)  0.003 ms  0.004 ms  0.006 ms
     2  bogon (192.168.154.2)  0.148 ms  0.330 ms  0.175 ms
     ...
    

    可见,流量经过172.18.0.1,然后访问到宿主机网关上。

    接下来,我们尝试解析出内部网络连接。上面我们已经得知,从容器内部的视角,172.18.0.3所在的接口为:61: eth1@if62。我们可以理解为,此接口的ifindex为61,通过veth连接到ifindex为62的接口上。

    我们查找看看62接口的namespace是:

    $ sudo ./find_links.sh 62
    
    

    居然没有显示。这就说明62接口是在宿主机的主namespace中的。我们在宿主机上看看:

    $ ip addr
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
        inet6 ::1/128 scope host 
           valid_lft forever preferred_lft forever
    2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
        link/ether 00:0c:29:e5:66:45 brd ff:ff:ff:ff:ff:ff
        inet 192.168.154.135/24 brd 192.168.154.255 scope global dynamic noprefixroute ens33
           valid_lft 1502sec preferred_lft 1502sec
        inet6 fe80::f378:1d3:6cde:69bb/64 scope link noprefixroute 
           valid_lft forever preferred_lft forever
    3: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
        link/ether 02:42:50:e9:2d:e1 brd ff:ff:ff:ff:ff:ff
        inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge
           valid_lft forever preferred_lft forever
        inet6 fe80::42:50ff:fee9:2de1/64 scope link 
           valid_lft forever preferred_lft forever
    4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
        link/ether 02:42:5d:cd:c3:16 brd ff:ff:ff:ff:ff:ff
        inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
           valid_lft forever preferred_lft forever
    23: veth6ee82c3@if22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP group default 
        link/ether 4a:71:4d:f7:0e:4e brd ff:ff:ff:ff:ff:ff link-netnsid 1
        inet6 fe80::4871:4dff:fef7:e4e/64 scope link 
           valid_lft forever preferred_lft forever
    62: veth0204500@if61: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP group default 
        link/ether 9e:d6:10:49:8e:42 brd ff:ff:ff:ff:ff:ff link-netnsid 4
        inet6 fe80::9cd6:10ff:fe49:8e42/64 scope link 
           valid_lft forever preferred_lft forever
    
    

    可见,62接口的master是docker_gwbridge。也就是说,62接口被桥接到docker_gwbridge中。

    南北向流量在经过宿主机出口时,还做了NAT转换

    $ sudo iptables-save -t nat  | grep -- '-A POSTROUTING'
    -A POSTROUTING -o docker_gwbridge -m addrtype --src-type LOCAL -j MASQUERADE
    -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
    -A POSTROUTING -s 172.18.0.0/16 ! -o docker_gwbridge -j MASQUERAD
    

    于是,南北向的流量走向就很清晰了。我们的网络拓扑可以更新为:

    step 1

    东西向流量

    东西向流量即容器与容器间的流量。我们先测试一下容器间的连通性。

    $ docker exec busybox ping 10.200.0.2
    PING 10.200.0.2 (10.200.0.2): 56 data bytes
    64 bytes from 10.200.0.2: seq=0 ttl=64 time=41.177 ms
    64 bytes from 10.200.0.2: seq=1 ttl=64 time=1.181 ms
    64 bytes from 10.200.0.2: seq=2 ttl=64 time=1.110 ms
    

    接下来探索这个流量是怎么走的。我们再看一下容器中的网络配置。

    $ docker exec busybox ip addr
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
    59: eth0@if60: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue 
        link/ether 02:42:0a:c8:00:02 brd ff:ff:ff:ff:ff:ff
        inet 10.200.0.2/16 brd 10.200.255.255 scope global eth0
           valid_lft forever preferred_lft forever
    61: eth1@if62: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue 
        link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
        inet 172.18.0.3/16 brd 172.18.255.255 scope global eth1
           valid_lft forever preferred_lft forever
    

    10.200.0.2所在的接口为,59: eth0@if60。即本接口ifindex为59,连接到ifindex为60的接口上。我们查询一下60接口所在的namespaec。

    $ sudo ./find_links.sh 60
    1-hxyiridl2b: 
    60: veth1@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default \    link/ether 4a:0a:52:98:84:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    

    可见60接口处于1-hxyiridl2b这一namespace中。

    $ sudo ./docker_netns.sh 1-hxyiridl2b ip addr
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
    2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
        link/ether 0e:2d:34:e6:eb:b7 brd ff:ff:ff:ff:ff:ff
        inet 10.200.0.1/16 brd 10.200.255.255 scope global br0
           valid_lft forever preferred_lft forever
    56: vxlan0@if56: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UNKNOWN group default 
        link/ether 0e:2d:34:e6:eb:b7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    58: veth0@if57: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP group default 
        link/ether ea:c1:db:d4:b1:83 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    60: veth1@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP group default 
        link/ether 4a:0a:52:98:84:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    
    

    在这个namespace中,有一个vxlan出口。docker overlsy就是通过overlay隧道与其它容器通信的。

    两个容器虽然是通过vxlan隧道通信,但容器内部却不感知。它们只能看到两个容器处于同一个二层网络中。由vxlan接口将二层报文封装在UDP报文的payload中,发到对端,再由对端的vxlan接口解封装。

    我们查看一下namespace 1-hxyiridl2b中的arp地址表:

    $ sudo ./docker_netns.sh 1-hxyiridl2b ip neigh
    10.200.0.5 dev vxlan0 lladdr 02:42:0a:c8:00:05 PERMANENT
    10.200.0.4 dev vxlan0 lladdr 02:42:0a:c8:00:04 PERMANENT
    

    我们可以看到,远端node中的容器IP 10.200.0.4,有体现在本端的arp地址表中。即是通过查找此表,得到对端的二层地址。

    我们再来看看,vxlan报文的出口在哪里:

    $ sudo ./docker_netns.sh 1-hxyiridl2b bridge fdb
    ...
    02:42:0a:c8:00:05 dev vxlan0 dst 192.168.154.136 link-netnsid 0 self permanent
    02:42:0a:c8:00:04 dev vxlan0 dst 192.168.154.136 link-netnsid 0 self permanent
    ...
    

    这可以理解为VxLAN的VTEP表,即根据MAC地址,查找出VxLAN报文应该封装的外层IP,是192.168.154.136

    我们可以画出东西向流量的完整的拓扑了:

    step-2

    相关文章

      网友评论

        本文标题:docker swarm (三):overlay与docker_

        本文链接:https://www.haomeiwen.com/subject/mnksdhtx.html