美文网首页
记一次nodelocaldns的pod重启事件分析

记一次nodelocaldns的pod重启事件分析

作者: NotFoundW | 来源:发表于2020-04-13 15:06 被阅读0次

连续多日持续告警,nodelocaldns相关pod重启

1.png

查看pod发现很多都重启过多次

$ kubectl -n kube-system get pod | grep nodelo 
nodelocaldns-22ngs                            1/1     Running   0          25d
nodelocaldns-445j6                            1/1     Running   1          159d
nodelocaldns-4hmzn                            1/1     Running   4          159d
nodelocaldns-4ztfx                            1/1     Running   1          159d
nodelocaldns-57x9t                            1/1     Running   3          159d
...

查看其中一个pod重启之前的日志

重点就在最后一行里的报错[FATAL] Failed to add back non-existent rule {filter OUTPUT [-p tcp -s 169.254.25.10 --port 53 -j ACCEPT]}

$ kubectl -n kube-system logs --previous nodelocaldns-4hmzn
2020/04/05 22:34:58 2020-04-05T22:34:58.339Z [INFO] Tearing down
2020/04/05 22:34:58 2020-04-05T22:34:58.381Z [INFO] Hit error during teardown - Link not found
2020/04/05 22:34:58 2020-04-05T22:34:58.381Z [INFO] Setting up networking for node cache
cluster.local.:53 on 169.254.25.10
in-addr.arpa.:53 on 169.254.25.10
ip6.arpa.:53 on 169.254.25.10
.:53 on 169.254.25.10
2020-04-05T22:34:58.414Z [INFO] CoreDNS-1.2.6
2020-04-05T22:34:58.414Z [INFO] linux/amd64, go1.11.5, 
CoreDNS-1.2.6
linux/amd64, go1.11.5, 
 [INFO] plugin/reload: Running configuration MD5 = 1eb7532090c17b1382ac5e04072e5b3e
2020-04-08T08:33:00.424Z [INFO] Tearing down
2020-04-08T08:33:00.507Z [FATAL] Failed to add back non-existent rule {filter OUTPUT [-p tcp -s 169.254.25.10 --port 53 -j ACCEPT]}

克隆代码,查看逻辑

先查看自己使用的版本
$ kubectl -n kube-system get -o yaml pod nodelocaldns-4hmzn | grep mage:
    image: registry.fcagsdp.connectedservices-harman.cn:9443/k8s-dns-node-cache:1.15.1
    image: registry.fcagsdp.connectedservices-harman.cn:9443/k8s-dns-node-cache:1.15.1
从github上克隆代码
git clone --branch 1.15.1 https://github.com/kubernetes/dns.git
用goland打开
根据报错找一下函数

发现在runChecks()函数里

func (c *cacheApp) runChecks() {
    for _, rule := range c.iptablesRules {
        exists, err := c.iptables.EnsureRule(utiliptables.Prepend, rule.table, rule.chain, rule.args...)
        if !exists {
            if err != nil {
                cache.teardownNetworking()
                clog.Fatalf("Failed to add back non-existent rule %v", rule)
            }
            clog.Infof("Added back nonexistent rule - %v", rule)
        }
        if err != nil {
            clog.Errorf("Failed to check rule %v - %s", rule, err)
        }
    }
...
在run()函数里,每隔60秒调用一次runChecks()函数
func (c *cacheApp) run() {
    c.params.exitChan = make(chan bool, 1)
    tick := time.NewTicker(c.params.interval * time.Second)
    for {
        select {
        case <-tick.C:
            c.runChecks()
        case <-c.params.exitChan:
            clog.Warningf("Exiting iptables check goroutine")
            return
        }
    }
}

c.params.interval定义在parseAndValidateFlags()函数中,但这个数值不重要

    flag.DurationVar(&c.params.interval, "syncinterval", 60, "interval(in seconds) to check for iptables rules")
继续看runChecks()函数

错误发生在调用EnsureRule()函数之后,查看EnsureRule()函数,发现错误应该是发生在调用checkRule()函数之后。

func (runner *runner) EnsureRule(position RulePosition, table Table, chain Chain, args ...string) (bool, error) {
    fullArgs := makeFullArgs(table, chain, args...)

    runner.mu.Lock()
    defer runner.mu.Unlock()

    exists, err := runner.checkRule(table, chain, args...)
    if err != nil {
        return false, err
    }
    if exists {
        return true, nil
    }
    out, err := runner.run(operation(position), fullArgs)
    if err != nil {
        return false, fmt.Errorf("error appending rule: %v: %s", err, out)
    }
    return false, nil
}
进而查看checkRule()函数

这个函数里判断用hasCheck来检验iptables命令是否有-C这个option,然后再看调用哪个函数,我看了一下pod里的iptables命令是有-C这个option的,所以直接查看checkRuleUsingCheck()函数

func (runner *runner) checkRule(table Table, chain Chain, args ...string) (bool, error) {
    if runner.hasCheck {
        return runner.checkRuleUsingCheck(makeFullArgs(table, chain, args...))
    } else {
        return runner.checkRuleWithoutCheck(table, chain, args...)
    }
}
checkRuleUsingCheck()函数

要对应上一开始runChecks()函数里报错的逻辑,那么EnsureRule()的返回值里,err就一定不能为空。

for _, rule := range c.iptablesRules {
        exists, err := c.iptables.EnsureRule(utiliptables.Prepend, rule.table, rule.chain, rule.args...)
        if !exists {
            if err != nil {
                cache.teardownNetworking()
                clog.Fatalf("Failed to add back non-existent rule %v", rule)
            }
            clog.Infof("Added back nonexistent rule - %v", rule)
        }
        if err != nil {
            clog.Errorf("Failed to check rule %v - %s", rule, err)
        }
    }

所以从这个checkRuleUsingCheck()函数的代码来看,可以知道,要报这个错,那么iptables -C命令就必须执行出错,且shell的return code不是1,才能走到这个函数的最后一行那个逻辑。不然前两处返回的逻辑都不满足这个要求。

func (runner *runner) checkRuleUsingCheck(args []string) (bool, error) {
    out, err := runner.run(opCheckRule, args)
    if err == nil {
        return true, nil
    }
    if ee, ok := err.(utilexec.ExitError); ok {
        // iptables uses exit(1) to indicate a failure of the operation,
        // as compared to a malformed commandline, for example.
        if ee.Exited() && ee.ExitStatus() == 1 {
            return false, nil
        }
    }
    return false, fmt.Errorf("error checking rule: %v: %s", err, out)
}
手动尝试执行

分析下来,就是一个命令的执行出现了错误,所以可以自己手动在容器里执行一下。

$ kubectl -n kube-system exec -it nodelocaldns-4hmzn -- iptables -C PREROUTING -p tcp -d 169.254.25.10 --dport 53 -j NOTRACK
iptables: Bad rule (does a matching rule exist in that chain?).
command terminated with exit code 1

得到的结果是规则已经存在,但是shell的return code是1。然后再执行好几遍也是同样的结果。所以无法手动模拟出pod重启时,这条命令出现了什么错误。
但是最操蛋的是,checkRuleUsingCheck()函数最后返回了错误

    return false, fmt.Errorf("error checking rule: %v: %s", err, out)

但是返回到了最开始的runChecks()函数里,就只用来判断一下err是否为nil……

            if err != nil {
                cache.teardownNetworking()
                clog.Fatalf("Failed to add back non-existent rule %v", rule)
            }

难道不应该把这个错误给报出来吗?就只在日志里里说一下失败了???
没法查没法查……

相关文章

网友评论

      本文标题:记一次nodelocaldns的pod重启事件分析

      本文链接:https://www.haomeiwen.com/subject/qxggmhtx.html