watchdog_timer:
expire time: 4s = (watchdog_threshold * 2) / 5(watchdog_threshold 由/proc/sys/kernel/watchdog_thresh配置,系统默认是10,下同)
function:updating hrtimer_interrupts when watchdog_timer expires
/|\
0 4 8 | 12 14 18
|-------|-------|-------|-------|-------|----
|
|
|
|
0 10 20
|-------------------|----------------------|
nmi_check
nmi:
expire time: 10s = watchdog_thresh perf event的counter值也是由watchdog_thresh来确定;
function:nmi check
if(hrtimer_interrupts==hrtimer_interrupts_save)--->hard_lockup--->warning or crash
else hrtimer_interrupts_save = hrtimer_interrupts
从上面可以看,理论上nmi check之前,watchdog会进行2-3次的喂狗操作,造成hardlockup的原因基本可以总结以下二点:
1:中断被关闭,长时间未打开,导致watchdog定时器中断不被响应,无法每4s一次喂狗,从而被nmi到时函数断定为hardlockup
2:由于nmi基于硬件cpu频率计时,如果频率不稳定或 Turbo-Mode被使能,突增变大,就可能会造成nmi检测提前,误报hardlockup。
针对2,开源合入了相应补丁,得到了很好的解释:
https://cloud.tencent.com/developer/article/1646007
网友评论