美文网首页
OVN环境虚机热迁移丢包严重

OVN环境虚机热迁移丢包严重

作者: LC0127 | 来源:发表于2021-10-12 20:53 被阅读0次

    问题描述
    虚机热迁移时ping包丢包5个以上。

    [root@node-4 ~]# ping 172.47.0.21
    PING 172.47.0.21 (172.47.0.21) 56(84) bytes of data.
    ......
    64 bytes from 172.47.0.21: icmp_seq=82 ttl=63 time=0.307 ms
    64 bytes from 172.47.0.21: icmp_seq=83 ttl=63 time=0.340 ms
    64 bytes from 172.47.0.21: icmp_seq=89 ttl=63 time=1.22 ms
    64 bytes from 172.47.0.21: icmp_seq=90 ttl=63 time=0.413 ms
    ......
    --- 172.47.0.21 ping statistics ---
    95 packets transmitted, 90 received, 5% packet loss, time 96206ms
    rtt min/avg/max/mdev = 0.282/0.483/3.467/0.532 ms
    

    热迁移过程:

    1. 在目标节点创建虚机的tap设备
    2. 网卡up后拷贝虚机进程的内存
    3. 迁移完成后源节点删除虚机的tap设备
    4. 调用ovs-vsctl删除源节点ovsdb数据库中的port数据
    5. 调用neutron client更新port的binding host信息为目标节点
    _post_live_migration
    |- post_live_migration_at_source
      |- unplug_vifs(去源节点删除vif)
    |- post_live_migration_at_destination
      |- migrate_instance_finish
        |- _update_port_binding_for_instance(调用neutron client执行port update,修改binding:host_id信息)
    

    分析过程:

    设置ovn-controller binding模块vlog日志等级为debug,并抓包, 发现丢包时间正好位于源节点release lport后到目标节点claim lport成功这段时间

    /* 源节点 */
    2021-10-11T08:00:40.691Z|28327|binding|INFO|Releasing lport 84282c8b-0002-47b1-a5c0-a7947cb795ca from this chassis.
    
    /* 目标节点 */
    2021-10-11T07:59:35.690Z|31303|binding|INFO|Not claiming lport 84282c8b-0002-47b1-a5c0-a7947cb795ca, chassis 63c4fb22-b817-4c84-9594-a8c554b8de46 requested-chassis node-3.domain.tld
    2021-10-11T08:00:40.711Z|31304|binding|INFO|Not claiming lport 84282c8b-0002-47b1-a5c0-a7947cb795ca, chassis 63c4fb22-b817-4c84-9594-a8c554b8de46 requested-chassis node-3.domain.tld
    2021-10-11T08:00:49.621Z|31305|binding|INFO|Claiming lport 84282c8b-0002-47b1-a5c0-a7947cb795ca for this chassis.
    2021-10-11T08:00:49.621Z|31306|binding|INFO|84282c8b-0002-47b1-a5c0-a7947cb795ca: Claiming fa:16:3e:f8:c5:5d 192.168.222.112
    
    count_pkt_lose.png
    ovn-controller release和claim lport部分代码:
    bool
    binding_handle_ovs_interface_changes(struct binding_ctx_in *b_ctx_in,
                                         struct binding_ctx_out *b_ctx_out)
    {
            ...
            const char *iface_id = smap_get(&iface_rec->external_ids, "iface-id");
            const char *old_iface_id = smap_get(b_ctx_out->local_iface_ids,
                                                iface_rec->name);
            const char *cleared_iface_id = NULL;
            if (!ovsrec_interface_is_deleted(iface_rec)) {
                int64_t ofport = iface_rec->n_ofport ? *iface_rec->ofport : 0;
                if (iface_id) {
                    /* Check if iface_id is changed. If so we need to
                     * release the old port binding and associate this
                     * inteface to new port binding. */
                    if (old_iface_id && strcmp(iface_id, old_iface_id)) {
                        cleared_iface_id = old_iface_id;
                    } else if (ofport <= 0) {
                        /* If ofport is <= 0, we need to release the iface if
                         * already claimed. */
                        cleared_iface_id = iface_id;
                    }
                } else if (old_iface_id) {
                    cleared_iface_id = old_iface_id;
                }
            } else {
                cleared_iface_id = iface_id;
            }
    
            if (cleared_iface_id) {
                handled = consider_iface_release(iface_rec, cleared_iface_id,
                                                 b_ctx_in, b_ctx_out);
            }
    

    gdb调试controller代码时发现ofport为-1,根据代码ofport ≤ 0时就会release lport


    debug

    claim lport时能否claim判断:

    static bool
    can_bind_on_this_chassis(const struct sbrec_chassis *chassis_rec,
                             const char *requested_chassis)
    {
        return !requested_chassis || !requested_chassis[0]
               || !strcmp(requested_chassis, chassis_rec->name)
               || !strcmp(requested_chassis, chassis_rec->hostname);
    }
    

    尝试模拟Interface ofport为-1:

    1. 创建tap设备
    2. 将tap设备挂给br-int
    3. iprouter2命令将tap设备删除

    此时查看interface的ofport字段为-1

    [root@node-1 ~]# ovs-vsctl list Interface | grep --color -C 10 lc-tap
    ...
    error               : "could not open network device lc-tap (No such device)"
    external_ids        : {}
    ...
    name                : lc-tap
    ofport              : -1
    ...
    

    设备删除时间点确定
    在nova执行unplug前加日志,并执行热迁移,对比vswitchd和nova日志,发现vswitchd在nova做unplug前已经将interface删除

    nova日志:

    
    2021-10-27 16:07:01.368 28641 INFO nova.virt.libvirt.driver [req-8b6d0c19-8088-4b91-94d6-bc8e037ac010 cf354206167f49599583663544832c9b d988d53fd2a94686b0c56fc8576e727b - - -] Do unplug vif from post_live_migration_at_source
    2021-10-27 16:07:01.369 28641 INFO nova.virt.libvirt.driver [req-8b6d0c19-8088-4b91-94d6-bc8e037ac010 cf354206167f49599583663544832c9b d988d53fd2a94686b0c56fc8576e727b - - -] Do unplug vif from unplug_vifs
    2021-10-27 16:07:01.373 28641 INFO os_vif [req-8b6d0c19-8088-4b91-94d6-bc8e037ac010 cf354206167f49599583663544832c9b d988d53fd2a94686b0c56fc8576e727b - - -] Successfully unplugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:b3:f9:b4,bridge_name='br-int',has_traffic_filtering=True,id=a88326d4-bca7-444c-8476-abcaddec9f12,network=Network(2c4dfab0-7362-4ad8-9a92-27cec0fe6c05),plugin='ovs',port_profile=VIFPortProfileBase,preserve_on_delete=False,vif_name='tapa88326d4-bc')
    

    vswitchd日志:

    258:2021-10-27T08:06:43.183Z|08976|bridge|INFO|bridge br-int: deleted interface tapa88326d4-bc on port 1493
    259:2021-10-27T08:06:43.188Z|08977|bridge|WARN|could not open network device tapa88326d4-bc (No such device)
    

    最后想计算team确定在执行unplug vif前qemu会删除源节点的tap设备

    热迁移过程及丢包时序图

    live_migrate.png

    相关文章

      网友评论

          本文标题:OVN环境虚机热迁移丢包严重

          本文链接:https://www.haomeiwen.com/subject/vrejoltx.html