连续多日持续告警,nodelocaldns相关pod重启

查看pod发现很多都重启过多次
$ kubectl -n kube-system get pod | grep nodelo
nodelocaldns-22ngs 1/1 Running 0 25d
nodelocaldns-445j6 1/1 Running 1 159d
nodelocaldns-4hmzn 1/1 Running 4 159d
nodelocaldns-4ztfx 1/1 Running 1 159d
nodelocaldns-57x9t 1/1 Running 3 159d
...
查看其中一个pod重启之前的日志
重点就在最后一行里的报错[FATAL] Failed to add back non-existent rule {filter OUTPUT [-p tcp -s 169.254.25.10 --port 53 -j ACCEPT]}
$ kubectl -n kube-system logs --previous nodelocaldns-4hmzn
2020/04/05 22:34:58 2020-04-05T22:34:58.339Z [INFO] Tearing down
2020/04/05 22:34:58 2020-04-05T22:34:58.381Z [INFO] Hit error during teardown - Link not found
2020/04/05 22:34:58 2020-04-05T22:34:58.381Z [INFO] Setting up networking for node cache
cluster.local.:53 on 169.254.25.10
in-addr.arpa.:53 on 169.254.25.10
ip6.arpa.:53 on 169.254.25.10
.:53 on 169.254.25.10
2020-04-05T22:34:58.414Z [INFO] CoreDNS-1.2.6
2020-04-05T22:34:58.414Z [INFO] linux/amd64, go1.11.5,
CoreDNS-1.2.6
linux/amd64, go1.11.5,
[INFO] plugin/reload: Running configuration MD5 = 1eb7532090c17b1382ac5e04072e5b3e
2020-04-08T08:33:00.424Z [INFO] Tearing down
2020-04-08T08:33:00.507Z [FATAL] Failed to add back non-existent rule {filter OUTPUT [-p tcp -s 169.254.25.10 --port 53 -j ACCEPT]}
克隆代码,查看逻辑
先查看自己使用的版本
$ kubectl -n kube-system get -o yaml pod nodelocaldns-4hmzn | grep mage:
image: registry.fcagsdp.connectedservices-harman.cn:9443/k8s-dns-node-cache:1.15.1
image: registry.fcagsdp.connectedservices-harman.cn:9443/k8s-dns-node-cache:1.15.1
从github上克隆代码
git clone --branch 1.15.1 https://github.com/kubernetes/dns.git
用goland打开
根据报错找一下函数
发现在runChecks()函数里
func (c *cacheApp) runChecks() {
for _, rule := range c.iptablesRules {
exists, err := c.iptables.EnsureRule(utiliptables.Prepend, rule.table, rule.chain, rule.args...)
if !exists {
if err != nil {
cache.teardownNetworking()
clog.Fatalf("Failed to add back non-existent rule %v", rule)
}
clog.Infof("Added back nonexistent rule - %v", rule)
}
if err != nil {
clog.Errorf("Failed to check rule %v - %s", rule, err)
}
}
...
在run()函数里,每隔60秒调用一次runChecks()函数
func (c *cacheApp) run() {
c.params.exitChan = make(chan bool, 1)
tick := time.NewTicker(c.params.interval * time.Second)
for {
select {
case <-tick.C:
c.runChecks()
case <-c.params.exitChan:
clog.Warningf("Exiting iptables check goroutine")
return
}
}
}
c.params.interval定义在parseAndValidateFlags()函数中,但这个数值不重要
flag.DurationVar(&c.params.interval, "syncinterval", 60, "interval(in seconds) to check for iptables rules")
继续看runChecks()函数
错误发生在调用EnsureRule()函数之后,查看EnsureRule()函数,发现错误应该是发生在调用checkRule()函数之后。
func (runner *runner) EnsureRule(position RulePosition, table Table, chain Chain, args ...string) (bool, error) {
fullArgs := makeFullArgs(table, chain, args...)
runner.mu.Lock()
defer runner.mu.Unlock()
exists, err := runner.checkRule(table, chain, args...)
if err != nil {
return false, err
}
if exists {
return true, nil
}
out, err := runner.run(operation(position), fullArgs)
if err != nil {
return false, fmt.Errorf("error appending rule: %v: %s", err, out)
}
return false, nil
}
进而查看checkRule()函数
这个函数里判断用hasCheck来检验iptables命令是否有-C这个option,然后再看调用哪个函数,我看了一下pod里的iptables命令是有-C这个option的,所以直接查看checkRuleUsingCheck()函数
func (runner *runner) checkRule(table Table, chain Chain, args ...string) (bool, error) {
if runner.hasCheck {
return runner.checkRuleUsingCheck(makeFullArgs(table, chain, args...))
} else {
return runner.checkRuleWithoutCheck(table, chain, args...)
}
}
checkRuleUsingCheck()函数
要对应上一开始runChecks()函数里报错的逻辑,那么EnsureRule()的返回值里,err就一定不能为空。
for _, rule := range c.iptablesRules {
exists, err := c.iptables.EnsureRule(utiliptables.Prepend, rule.table, rule.chain, rule.args...)
if !exists {
if err != nil {
cache.teardownNetworking()
clog.Fatalf("Failed to add back non-existent rule %v", rule)
}
clog.Infof("Added back nonexistent rule - %v", rule)
}
if err != nil {
clog.Errorf("Failed to check rule %v - %s", rule, err)
}
}
所以从这个checkRuleUsingCheck()函数的代码来看,可以知道,要报这个错,那么iptables -C
命令就必须执行出错,且shell的return code不是1,才能走到这个函数的最后一行那个逻辑。不然前两处返回的逻辑都不满足这个要求。
func (runner *runner) checkRuleUsingCheck(args []string) (bool, error) {
out, err := runner.run(opCheckRule, args)
if err == nil {
return true, nil
}
if ee, ok := err.(utilexec.ExitError); ok {
// iptables uses exit(1) to indicate a failure of the operation,
// as compared to a malformed commandline, for example.
if ee.Exited() && ee.ExitStatus() == 1 {
return false, nil
}
}
return false, fmt.Errorf("error checking rule: %v: %s", err, out)
}
手动尝试执行
分析下来,就是一个命令的执行出现了错误,所以可以自己手动在容器里执行一下。
$ kubectl -n kube-system exec -it nodelocaldns-4hmzn -- iptables -C PREROUTING -p tcp -d 169.254.25.10 --dport 53 -j NOTRACK
iptables: Bad rule (does a matching rule exist in that chain?).
command terminated with exit code 1
得到的结果是规则已经存在,但是shell的return code是1。然后再执行好几遍也是同样的结果。所以无法手动模拟出pod重启时,这条命令出现了什么错误。
但是最操蛋的是,checkRuleUsingCheck()函数最后返回了错误
return false, fmt.Errorf("error checking rule: %v: %s", err, out)
但是返回到了最开始的runChecks()函数里,就只用来判断一下err是否为nil……
if err != nil {
cache.teardownNetworking()
clog.Fatalf("Failed to add back non-existent rule %v", rule)
}
难道不应该把这个错误给报出来吗?就只在日志里里说一下失败了???
没法查没法查……
网友评论