Client-go中的watch接口的resultChan会自动

作者: 周宇_1757 | 来源:发表于2019-12-12 21:26 被阅读0次

Client-go中的watch接口的resultChan会自动关闭

@[toc]

问题描述

在client-go工程中，有时候需要用到watch接口，实际场景如下：

namespacesWatch, err := clientSet.CoreV1().Namespaces().Watch(metav1.ListOptions{})
if err != nil {
    klog.Errorf("create watch error, error is %s, program exit!", err.Error())
    panic(err)
}
for {
    select e, ok := <-namespacesWatch.ResultChan()
    if e.Object == nil {
        // 这个时候一般是chan已经被关闭了，可以顺便开看下ok是不是flase。
    } else {
        // 正常处理逻辑
    }
}

watch的resultChan会周期性的关闭，我不知道这个周期是不是可以设置，但是我在github的issue中看到，有人5分钟就自动关闭了，我的集群是大概40分钟左右，进一步的追究需要深入，我这边就说我遇到的问题以及解决方法吧。问题就是resultChan会定期自动关闭。github上client-go项目对于该问题的Issues: https://github.com/kubernetes/client-go/issues/623, 上面有大佬回复：No, the server will close watch connections regularly. Re-establishing a watch at the last-received resourceVersion is a normal part of maintaining a watch as a client. There are helpers to do this for you in https://github.com/kubernetes/client-go/tree/master/tools/watch。是有官方解决方法的去client-go的tools目录下的watch文件夹下看下。

resultChan会自动关闭的原因

看一下watch.go中的部分代码：

// Interface can be implemented by anything that knows how to watch and report changes.
type Interface interface {
    // Stops watching. Will close the channel returned by ResultChan(). Releases
    // any resources used by the watch.
    Stop()

    // Returns a chan which will receive all the events. If an error occurs
    // or Stop() is called, this channel will be closed, in which case the
    // watch should be completely cleaned up.  ！！！明确说了在出现错误或者被调用Stop时，通道会自动关闭的
    ResultChan() <-chan Event
}

我们接着看下有哪些error和哪些情况会调用stop，我就开始追watch方法的那个对象，最后追到这里streamwatcher.go，Ok,这里就可以看到watch实际对象了，struct的中参数可以看下，有result的通道，互斥锁和Stoppe标志接收是否已经结束了的标志位。看下NewStreamWatcher方法，关注其中的go sw.receive()，在返回对象前就起了协程在接收数据了，那接着去看receive函数，看一下receive那边的注释，我写的中文注释，就是在解码遇到错误是就会选择return，return前看下defer sw.Stop()等清理操作。这样就跟上面对上了！

// StreamWatcher turns any stream for which you can write a Decoder interface
// into a watch.Interface.
type StreamWatcher struct {
    sync.Mutex
    source   Decoder
    reporter Reporter
    result   chan Event
    stopped  bool
}

// NewStreamWatcher creates a StreamWatcher from the given decoder.
func NewStreamWatcher(d Decoder, r Reporter) *StreamWatcher {
    sw := &StreamWatcher{
        source:   d,
        reporter: r,
        // It's easy for a consumer to add buffering via an extra
        // goroutine/channel, but impossible for them to remove it,
        // so nonbuffered is better.
        result: make(chan Event),
    }
    go sw.receive()// !!!看这里，新建完对象就开始接收数据了
    return sw
}

// ResultChan implements Interface.
func (sw *StreamWatcher) ResultChan() <-chan Event {
    return sw.result
}

// Stop implements Interface.
func (sw *StreamWatcher) Stop() {
    // Call Close() exactly once by locking and setting a flag.
    sw.Lock()
    defer sw.Unlock()
    if !sw.stopped {
        sw.stopped = true
        sw.source.Close()
    }
}

// stopping returns true if Stop() was called previously.
func (sw *StreamWatcher) stopping() bool {
    sw.Lock()
    defer sw.Unlock()
    return sw.stopped
}

// receive reads result from the decoder in a loop and sends down the result channel.
func (sw *StreamWatcher) receive() {
    defer close(sw.result)
    defer sw.Stop()// 注意看这里，这个方法退出前就会调用stop函数
    defer utilruntime.HandleCrash()
    for {// for循环，一直接收
        action, obj, err := sw.source.Decode()
        if err != nil {//以下是接收到的错，反正有错误就会return
            // Ignore expected error.
            if sw.stopping() {
                return
            }
            switch err {
            case io.EOF:
                // watch closed normally
            case io.ErrUnexpectedEOF:
                klog.V(1).Infof("Unexpected EOF during watch stream event decoding: %v", err)
            default:
                if net.IsProbableEOF(err) {
                    klog.V(5).Infof("Unable to decode an event from the watch stream: %v", err)
                } else {
                    sw.result <- Event{
                        Type:   Error,
                        Object: sw.reporter.AsObject(fmt.Errorf("unable to decode an event from the watch stream: %v", err)),
                    }
                }
            }
            return //只要是错误就是停止接收了！！！
        }
        sw.result <- Event{
            Type:   action,
            Object: obj,
        }// 没错误就往result里塞数据
    }
}

在这部分源代码中，我学到一个点就是在要从一个对象中持续的处理数据时，开通到，并且在新建完返回对象前就可以开始传数据了，接收到只要使用对象中的chan就可以拿到数据，这样就不用手动开启reecive了，很省事，很安全，所有操作（除了Stop和ResultChan方法）都是内部做好了，不容许外部调用者做额外的操作干扰我正常的逻辑。

解决办法

首先说下，这个问题是有解决方法的：

No, the server will close watch connections regularly. Re-establishing a watch at the last-received resourceVersion is a normal part of maintaining a watch as a client. There are helpers to do this for you in https://github.com/kubernetes/client-go/tree/master/tools/watch

这串是大佬的回复。

先说下我自己挫比解决方案，代码如下，在我知道有watch接口的resultChan会自动关闭后，第一时间想到的就是，关闭了，我重新起不就可以了吗，所以我写了这两段for循环，检测到通道close掉后跳出内层循环，再次创建就可以了。挫吧！当时还挺有用，作为上线时候的代码使用了。（我是真不专业！当时应该想到会有官方解决方案的）

    for {
        klog.Info("start watch")
        config, err := rest.InClusterConfig()
        clientSet, err := kubernetes.NewForConfig(config)
        namespacesWatch, err := clientSet.CoreV1().Namespaces().Watch(metav1.ListOptions{})
        if err != nil {
            klog.Errorf("create watch error, error is %s, program exit!", err.Error())
            panic(err)
        }
        loopier:
        for {
            select {
            case e, ok := <-namespacesWatch.ResultChan():
                if !ok {
                    // 说明该通道已经被close掉了
                    klog.Warning("!!!!!namespacesWatch chan has been close!!!!")
                    klog.Info("clean chan over!")
                    time.Sleep(time.Second * 5)
                    break loopier
                }
                if e.Object != nil {
                    // 业务逻辑
                }
            }
        }
    }

下面就是官方解决方案，在client工程下有个tools文件夹中有个watch文件夹，里面有个retrywatcher.go：我是把这个文件下所有的源码复制出来了，下面会细讲，你先看看！

// resourceVersionGetter is an interface used to get resource version from events.
// We can't reuse an interface from meta otherwise it would be a cyclic dependency and we need just this one method
type resourceVersionGetter interface {
    GetResourceVersion() string
}

// RetryWatcher will make sure that in case the underlying watcher is closed (e.g. due to API timeout or etcd timeout)
// it will get restarted from the last point without the consumer even knowing about it.
// RetryWatcher does that by inspecting events and keeping track of resourceVersion.
// Especially useful when using watch.UntilWithoutRetry where premature termination is causing issues and flakes.
// Please note that this is not resilient to etcd cache not having the resource version anymore - you would need to
// use Informers for that.
type RetryWatcher struct {
    lastResourceVersion string
    watcherClient       cache.Watcher
    resultChan          chan watch.Event
    stopChan            chan struct{}
    doneChan            chan struct{}
    minRestartDelay     time.Duration
}

// NewRetryWatcher creates a new RetryWatcher.
// It will make sure that watches gets restarted in case of recoverable errors.
// The initialResourceVersion will be given to watch method when first called.
func NewRetryWatcher(initialResourceVersion string, watcherClient cache.Watcher) (*RetryWatcher, error) {
    return newRetryWatcher(initialResourceVersion, watcherClient, 1*time.Second)
}

func newRetryWatcher(initialResourceVersion string, watcherClient cache.Watcher, minRestartDelay time.Duration) (*RetryWatcher, error) {
    switch initialResourceVersion {
    case "", "0":
        // TODO: revisit this if we ever get WATCH v2 where it means start "now"
        //       without doing the synthetic list of objects at the beginning (see #74022)
        return nil, fmt.Errorf("initial RV %q is not supported due to issues with underlying WATCH", initialResourceVersion)
    default:
        break
    }

    rw := &RetryWatcher{
        lastResourceVersion: initialResourceVersion,
        watcherClient:       watcherClient,
        stopChan:            make(chan struct{}),
        doneChan:            make(chan struct{}),
        resultChan:          make(chan watch.Event, 0),
        minRestartDelay:     minRestartDelay,
    }

    go rw.receive()
    return rw, nil
}

func (rw *RetryWatcher) send(event watch.Event) bool {
    // Writing to an unbuffered channel is blocking operation
    // and we need to check if stop wasn't requested while doing so.
    select {
    case rw.resultChan <- event:
        return true
    case <-rw.stopChan:
        return false
    }
}

// doReceive returns true when it is done, false otherwise.
// If it is not done the second return value holds the time to wait before calling it again.
func (rw *RetryWatcher) doReceive() (bool, time.Duration) {
    watcher, err := rw.watcherClient.Watch(metav1.ListOptions{
        ResourceVersion: rw.lastResourceVersion,
    })
    // We are very unlikely to hit EOF here since we are just establishing the call,
    // but it may happen that the apiserver is just shutting down (e.g. being restarted)
    // This is consistent with how it is handled for informers
    switch err {
    case nil:
        break

    case io.EOF:
        // watch closed normally
        return false, 0

    case io.ErrUnexpectedEOF:
        klog.V(1).Infof("Watch closed with unexpected EOF: %v", err)
        return false, 0

    default:
        msg := "Watch failed: %v"
        if net.IsProbableEOF(err) {
            klog.V(5).Infof(msg, err)
            // Retry
            return false, 0
        }

        klog.Errorf(msg, err)
        // Retry
        return false, 0
    }

    if watcher == nil {
        klog.Error("Watch returned nil watcher")
        // Retry
        return false, 0
    }

    ch := watcher.ResultChan()
    defer watcher.Stop()

    for {
        select {
        case <-rw.stopChan:
            klog.V(4).Info("Stopping RetryWatcher.")
            return true, 0
        case event, ok := <-ch:
            if !ok {
                klog.V(4).Infof("Failed to get event! Re-creating the watcher. Last RV: %s", rw.lastResourceVersion)
                return false, 0
            }

            // We need to inspect the event and get ResourceVersion out of it
            switch event.Type {
            case watch.Added, watch.Modified, watch.Deleted, watch.Bookmark:
                metaObject, ok := event.Object.(resourceVersionGetter)
                if !ok {
                    _ = rw.send(watch.Event{
                        Type:   watch.Error,
                        Object: &apierrors.NewInternalError(errors.New("retryWatcher: doesn't support resourceVersion")).ErrStatus,
                    })
                    // We have to abort here because this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
                    return true, 0
                }

                resourceVersion := metaObject.GetResourceVersion()
                if resourceVersion == "" {
                    _ = rw.send(watch.Event{
                        Type:   watch.Error,
                        Object: &apierrors.NewInternalError(fmt.Errorf("retryWatcher: object %#v doesn't support resourceVersion", event.Object)).ErrStatus,
                    })
                    // We have to abort here because this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
                    return true, 0
                }

                // All is fine; send the event and update lastResourceVersion
                ok = rw.send(event)
                if !ok {
                    return true, 0
                }
                rw.lastResourceVersion = resourceVersion

                continue

            case watch.Error:
                // This round trip allows us to handle unstructured status
                errObject := apierrors.FromObject(event.Object)
                statusErr, ok := errObject.(*apierrors.StatusError)
                if !ok {
                    klog.Error(spew.Sprintf("Received an error which is not *metav1.Status but %#+v", event.Object))
                    // Retry unknown errors
                    return false, 0
                }

                status := statusErr.ErrStatus

                statusDelay := time.Duration(0)
                if status.Details != nil {
                    statusDelay = time.Duration(status.Details.RetryAfterSeconds) * time.Second
                }

                switch status.Code {
                case http.StatusGone:
                    // Never retry RV too old errors
                    _ = rw.send(event)
                    return true, 0

                case http.StatusGatewayTimeout, http.StatusInternalServerError:
                    // Retry
                    return false, statusDelay

                default:
                    // We retry by default. RetryWatcher is meant to proceed unless it is certain
                    // that it can't. If we are not certain, we proceed with retry and leave it
                    // up to the user to timeout if needed.

                    // Log here so we have a record of hitting the unexpected error
                    // and we can whitelist some error codes if we missed any that are expected.
                    klog.V(5).Info(spew.Sprintf("Retrying after unexpected error: %#+v", event.Object))

                    // Retry
                    return false, statusDelay
                }

            default:
                klog.Errorf("Failed to recognize Event type %q", event.Type)
                _ = rw.send(watch.Event{
                    Type:   watch.Error,
                    Object: &apierrors.NewInternalError(fmt.Errorf("retryWatcher failed to recognize Event type %q", event.Type)).ErrStatus,
                })
                // We are unable to restart the watch and have to stop the loop or this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
                return true, 0
            }
        }
    }
}

// receive reads the result from a watcher, restarting it if necessary.
func (rw *RetryWatcher) receive() {
    defer close(rw.doneChan)
    defer close(rw.resultChan)

    klog.V(4).Info("Starting RetryWatcher.")
    defer klog.V(4).Info("Stopping RetryWatcher.")

    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()
    go func() {
        select {
        case <-rw.stopChan:
            cancel()
            return
        case <-ctx.Done():
            return
        }
    }()

    // We use non sliding until so we don't introduce delays on happy path when WATCH call
    // timeouts or gets closed and we need to reestablish it while also avoiding hot loops.
    wait.NonSlidingUntilWithContext(ctx, func(ctx context.Context) {
        done, retryAfter := rw.doReceive()
        if done {
            cancel()
            return
        }

        time.Sleep(retryAfter)

        klog.V(4).Infof("Restarting RetryWatcher at RV=%q", rw.lastResourceVersion)
    }, rw.minRestartDelay)
}

// ResultChan implements Interface.
func (rw *RetryWatcher) ResultChan() <-chan watch.Event {
    return rw.resultChan
}

// Stop implements Interface.
func (rw *RetryWatcher) Stop() {
    close(rw.stopChan)
}

// Done allows the caller to be notified when Retry watcher stops.
func (rw *RetryWatcher) Done() <-chan struct{} {
    return rw.doneChan
}

看一下retrywatcher有哪些属性，多了stopChan和doneChan以及minRestartDelay(重启时的延迟时间可以设置)

type RetryWatcher struct {
    lastResourceVersion string
    watcherClient       cache.Watcher
    resultChan          chan watch.Event
    stopChan            chan struct{}
    doneChan            chan struct{}
    minRestartDelay     time.Duration
}

看一下新建函数，NewRetryWatcher面向外部调用，主要还是看内部的newRetryWatcher，里面设置了minRestartDelay的时间是1秒，详细看下newRetryWatcher方法，最后调用了go rw.receive()，我们接着看receive()方法

func NewRetryWatcher(initialResourceVersion string, watcherClient cache.Watcher) (*RetryWatcher, error) {
    return newRetryWatcher(initialResourceVersion, watcherClient, 1*time.Second)
}

func newRetryWatcher(initialResourceVersion string, watcherClient cache.Watcher, minRestartDelay time.Duration) (*RetryWatcher, error) {
    switch initialResourceVersion {
    case "", "0":
        // TODO: revisit this if we ever get WATCH v2 where it means start "now"
        //       without doing the synthetic list of objects at the beginning (see #74022)
        return nil, fmt.Errorf("initial RV %q is not supported due to issues with underlying WATCH", initialResourceVersion)
    default:
        break
    }

    rw := &RetryWatcher{
        lastResourceVersion: initialResourceVersion,
        watcherClient:       watcherClient,
        stopChan:            make(chan struct{}),
        doneChan:            make(chan struct{}),
        resultChan:          make(chan watch.Event, 0),
        minRestartDelay:     minRestartDelay,
    }

    go rw.receive()
    return rw, nil
}

看一下这段代码ctx, cancel := context.WithCancel(context.Background())，看不懂的小伙伴建议去看下go语言的context包，WithCancel函数，传递一个父Context作为参数，返回子Context，以及一个取消函数用来取消Context。context专门用来简化对于处理单个请求的多个goroutine之间与请求域的数据、取消信号、截止时间等相关操作。意思就是你在子协程中需要关闭一连串相关协程时就用这个context，调用cancel函数即可。我们接着看下面wait.NonSlidingUntilWithContext方法，去看看这个方法的注释，代码在下方。意思是只要context不被done，就将循环调用其中的匿名函数。看一下匿名函数做了些啥，调用了doReceive()函数，看下doReceive函数是干嘛的，我们可以看到，他也是使用watch方法，watch方法报错就返回，不报错就继续进for循环，select语句查看当前retrywatch是否被stop，或者ch中是否有数据，若ch被关闭，就return false,0。一旦返回就看外面的receive函数处理，receive函数会判断如果返回是true就调用cancel就真的退出，不会重建watch。只有当返回false时才会重新回到NonSlidingUntilWithContext，循环调用匿名函数进行侦听。所以上面我的挫比解决方案，还是太粗了。这边的主要亮点就是wait.NonSlidingUntilWithContext方法的妙用了，建议大家学一下，还有就是，重试机制需要明确定义哪些情况是需要重试，哪些情况不需要重试！

func (rw *RetryWatcher) receive() {
    defer close(rw.doneChan)
    defer close(rw.resultChan)

    klog.V(4).Info("Starting RetryWatcher.")
    defer klog.V(4).Info("Stopping RetryWatcher.")

    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()
    go func() {
        select {
        case <-rw.stopChan:
            cancel()
            return
        case <-ctx.Done():
            return
        }
    }()

    // We use non sliding until so we don't introduce delays on happy path when WATCH call
    // timeouts or gets closed and we need to reestablish it while also avoiding hot loops.
    wait.NonSlidingUntilWithContext(ctx, func(ctx context.Context) {
        done, retryAfter := rw.doReceive()
        if done {
            cancel()
            return
        }

        time.Sleep(retryAfter)

        klog.V(4).Infof("Restarting RetryWatcher at RV=%q", rw.lastResourceVersion)
    }, rw.minRestartDelay)
}

// NonSlidingUntilWithContext loops until context is done, running f every#意思是除非context调用了Done,不然就会循环调用f函数
// period.
//
// NonSlidingUntilWithContext is syntactic sugar on top of JitterUntilWithContext
// with zero jitter factor, with sliding = false (meaning the timer for period
// starts at the same time as the function starts).
func NonSlidingUntilWithContext(ctx context.Context, f func(context.Context), period time.Duration) {
    JitterUntilWithContext(ctx, f, period, 0.0, false)
}

doReceive函数：

func (rw *RetryWatcher) doReceive() (bool, time.Duration) {
    watcher, err := rw.watcherClient.Watch(metav1.ListOptions{
        ResourceVersion: rw.lastResourceVersion,
    })//打开watch
    // We are very unlikely to hit EOF here since we are just establishing the call,
    // but it may happen that the apiserver is just shutting down (e.g. being restarted)
    // This is consistent with how it is handled for informers
    switch err {
    case nil:
        break

    case io.EOF:
        // watch closed normally
        return false, 0

    case io.ErrUnexpectedEOF:
        klog.V(1).Infof("Watch closed with unexpected EOF: %v", err)
        return false, 0

    default:
        msg := "Watch failed: %v"
        if net.IsProbableEOF(err) {
            klog.V(5).Infof(msg, err)
            // Retry
            return false, 0
        }

        klog.Errorf(msg, err)
        // Retry
        return false, 0
    }

    if watcher == nil {
        klog.Error("Watch returned nil watcher")
        // Retry
        return false, 0
    }

    ch := watcher.ResultChan()
    defer watcher.Stop()
###########这里很重要！这里很重要，这里很重要
    for {
        select {
        case <-rw.stopChan://查看是否被停止了
            klog.V(4).Info("Stopping RetryWatcher.")
            return true, 0
        case event, ok := <-ch://从通道拿出数据
            if !ok {//通道是不是开着的，关闭的话，就返回
                klog.V(4).Infof("Failed to get event! Re-creating the watcher. Last RV: %s", rw.lastResourceVersion)
                return false, 0
            }

            // We need to inspect the event and get ResourceVersion out of it
            switch event.Type {//下面是成功获取到数据的逻辑
            case watch.Added, watch.Modified, watch.Deleted, watch.Bookmark:
                metaObject, ok := event.Object.(resourceVersionGetter)
                if !ok {
                    _ = rw.send(watch.Event{
                        Type:   watch.Error,
                        Object: &apierrors.NewInternalError(errors.New("retryWatcher: doesn't support resourceVersion")).ErrStatus,
                    })
                    // We have to abort here because this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
                    return true, 0
                }

                resourceVersion := metaObject.GetResourceVersion()
                if resourceVersion == "" {
                    _ = rw.send(watch.Event{
                        Type:   watch.Error,
                        Object: &apierrors.NewInternalError(fmt.Errorf("retryWatcher: object %#v doesn't support resourceVersion", event.Object)).ErrStatus,
                    })
                    // We have to abort here because this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
                    return true, 0
                }

                // All is fine; send the event and update lastResourceVersion
                ok = rw.send(event)
                if !ok {
                    return true, 0
                }
                rw.lastResourceVersion = resourceVersion

                continue

            case watch.Error:
                // This round trip allows us to handle unstructured status
                errObject := apierrors.FromObject(event.Object)
                statusErr, ok := errObject.(*apierrors.StatusError)
                if !ok {
                    klog.Error(spew.Sprintf("Received an error which is not *metav1.Status but %#+v", event.Object))
                    // Retry unknown errors
                    return false, 0
                }

                status := statusErr.ErrStatus

                statusDelay := time.Duration(0)
                if status.Details != nil {
                    statusDelay = time.Duration(status.Details.RetryAfterSeconds) * time.Second
                }

                switch status.Code {
                case http.StatusGone:
                    // Never retry RV too old errors
                    _ = rw.send(event)
                    return true, 0

                case http.StatusGatewayTimeout, http.StatusInternalServerError:
                    // Retry
                    return false, statusDelay

                default:
                    // We retry by default. RetryWatcher is meant to proceed unless it is certain
                    // that it can't. If we are not certain, we proceed with retry and leave it
                    // up to the user to timeout if needed.

                    // Log here so we have a record of hitting the unexpected error
                    // and we can whitelist some error codes if we missed any that are expected.
                    klog.V(5).Info(spew.Sprintf("Retrying after unexpected error: %#+v", event.Object))

                    // Retry
                    return false, statusDelay
                }

            default:
                klog.Errorf("Failed to recognize Event type %q", event.Type)
                _ = rw.send(watch.Event{
                    Type:   watch.Error,
                    Object: &apierrors.NewInternalError(fmt.Errorf("retryWatcher failed to recognize Event type %q", event.Type)).ErrStatus,
                })
                // We are unable to restart the watch and have to stop the loop or this might cause lastResourceVersion inconsistency by skipping a potential RV with valid data!
                return true, 0
            }
        }
    }
}

csdn主页https://me.csdn.net/u013276277

Client-go中的watch接口的resultChan会自动

Client-go中的watch接口的resultChan会自动关闭

问题描述

resultChan会自动关闭的原因

解决办法

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

k8s那点事儿