agent大量僵尸进程问题定位
参考:
Background
agent每隔一秒会通过free
,docker stats
命令采集node数据,运行一段时间之后发现节点卡死,定位发现是因为agent产生了大量的僵尸进程。
// 查看僵尸进程
ps -A -o stat,ppid,pid,cmd | grep -e '^[Zz]'
// kill僵尸进程
ps -A -o stat,ppid,pid,cmd | grep -e '^[Zz]' | awk '{print $2}' | xargs kill -9
危害
由于父进程没有wait,导致子进程资源得不到释放,一直占用系统资源。比如PID,僵尸进程一直占用PID,而OS的PID资源是有限的,大量的僵尸进程导致没有可用的PID,OS不能产生新的进程。
Fix
// Wait waits for the command to exit and waits for any copying to
// stdin or copying from stdout or stderr to complete.
//
// The command must have been started by Start.
//
// The returned error is nil if the command runs, has no problems
// copying stdin, stdout, and stderr, and exits with a zero exit
// status.
//
// If the command fails to run or doesn't complete successfully, the
// error is of type *ExitError. Other error types may be
// returned for I/O problems.
//
// If any of c.Stdin, c.Stdout or c.Stderr are not an *os.File, Wait also waits
// for the respective I/O loop copying to or from the process to complete.
//
// Wait releases any resources associated with the Cmd.
func (c *Cmd) Wait() error {}
执行wait方法, 释放资源。
网友评论