美文网首页
Linux进程状态与信号

Linux进程状态与信号

作者: 酱油王0901 | 来源:发表于2020-02-27 22:11 被阅读0次

问题描述

今天测试环境上出现创建缓存分区失败的情况,查看log发现是ceph-disk zap /dev/sdx hang死,导致超时被杀。log如下所示:

318 time=2020-02-27T10:08:25+08:00 level=warning module=utils/process.go:123 topic=kernel.external.process msg="Process was killed after 2m0.000139012s: /usr/sbin/ceph-disk [ceph-disk zap /dev/sdg]
319 out:
320 err: 1+0 records in
321 1+0 records out
322 4194304 bytes (4.2 MB) copied, 0.00448586 s, 935 MB/s
323 "

分析

查看其对应的进程信息,发现有好几个sgdisk进程

[root@sds2 ~]# ps -ef | grep zap
root      4085     1  0 11:10 ?        00:00:00 /usr/sbin/sgdisk --zap-all -- /dev/sdg
root     23181     1  0 10:06 ?        00:00:00 /usr/sbin/sgdisk --zap-all -- /dev/sdg
root     40867     1  0 Feb26 ?        00:00:00 /usr/sbin/sgdisk --zap-all -- /dev/sdg
root     41064     1  0 Feb26 ?        00:00:00 /usr/sbin/sgdisk --zap-all -- /dev/sdi
root     42785     1  0 Feb26 ?        00:00:00 /usr/sbin/sgdisk --zap-all -- /dev/sdg
root     48840 32585  0 16:24 pts/1    00:00:00 grep --color=auto zap

查看其中一个进程的栈信息,从其栈信息可以看出其hang在call_rwsem_down_read_failed,具体介绍可以参考读写信号量与实时进程阻塞挂死问题

[root@sds2 ~]# cat /proc/4085/stack
[<ffffffff81331ad8>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff81204e8a>] iterate_supers+0xaa/0x120
[<ffffffff81233614>] sys_sync+0x44/0xb0
[<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

接着使用top命令查看其进程状态为DD代表uninterruptible sleep,Linux进程有两种睡眠状态,一种interruptible sleep,处在这种睡眠状态的进程是可以通过给它发信号来唤醒的,比如发HUP信号给nginx的master进程可以让nginx重新加载配置文件而不需要重新启动nginx进程;另外一种睡眠状态是uninterruptible sleep,处在这种状态的进程不接受外来的任何信号,也无法用kill杀掉这些处于D状态的进程,无论是”kill”, “kill -9″还是”kill -15″,因为它们不受这些信号的支配。
进程为什么会被置于uninterruptible sleep状态呢?处于uninterruptible sleep状态的进程通常是在等待IO,比如磁盘IO,网络IO,其他外设IO,如果进程正在等待的IO在较长的时间内都没有响应,那么就很会不幸地被 ps看到了,同时也就意味着很有可能有IO出了问题,可能是外设本身出了故障,也可能是比如挂载的远程文件系统已经不可访问了。

[root@sds2 ~]# top -p 4085
top - 16:27:32 up 16 days, 20:22,  3 users,  load average: 7.24, 7.25, 7.26
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.1 sy,  0.0 ni, 99.4 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65758080 total, 37593416 free,  5325808 used, 22838856 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 53136852 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4085 root      20   0   53296   2112   1736 D   0.0  0.0   0:00.08 sgdisk
(ENV) [root@ceph-2 ~]# ps -axf | grep etcd
 7123 pts/1    S+     0:00          \_ grep --color=auto etcd
17158 ?        Ssl  462:16 /opt/sds/bin/etcd --config-file /opt/sds/etcd/etcd.conf
17227 ?        Ssl   97:00 /opt/sds/bin/etcd --config-file /opt/sds/etcd/etcd-proxy.conf

以下内容来自ps手册页。

  • This ps works by reading the virtual files in /proc.
  • Processes marked <defunct> are dead processes (so-called "zombies") that remain because their parent has not destroyed
    them properly. These processes will be destroyed by init(8) if the parent process exits.
PROCESS STATE CODES
       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to
       describe the state of a process:

               D    uninterruptible sleep (usually IO)
               R    running or runnable (on run queue)
               S    interruptible sleep (waiting for an event to complete)
               T    stopped by job control signal
               t    stopped by debugger during the tracing
               W    paging (not valid since the 2.6.xx kernel)
               X    dead (should never be seen)
               Z    defunct ("zombie") process, terminated but not reaped by its parent

       For BSD formats and when the stat keyword is used, additional characters may be displayed:

               <    high-priority (not nice to other users)
               N    low-priority (nice to other users)
               L    has pages locked into memory (for real-time and custom IO)
               s    is a session leader
               l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
               +    is in the foreground process group

其中,前面提到的kill命令,我们可以调用kill -l查看相应的信号。

[root@sds2 ~]# kill -l
 1) SIGHUP   2) SIGINT   3) SIGQUIT  4) SIGILL   5) SIGTRAP
 6) SIGABRT  7) SIGBUS   8) SIGFPE   9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT   17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
21) SIGTTIN 22) SIGTTOU 23) SIGURG  24) SIGXCPU 25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF 28) SIGWINCH    29) SIGIO   30) SIGPWR
31) SIGSYS  34) SIGRTMIN    35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX

上面的信号中需要提到的是18,19,20。

kill -SIGSTOP [pid]
kill -SIGCONT [pid]

对于SIGSTOP

When SIGSTOP is sent to a process, the usual behaviour is to pause that process in its current state. The process will only resume execution if it is sent the SIGCONT signal. SIGSTOP and SIGCONT are used for job control in the Unix shell, among other purposes. SIGSTOP cannot be caught or ignored.

对于SIGCONT

When SIGSTOP or SIGTSTP is sent to a process, the usual behaviour is to pause that process in its current state. The process will only resume execution if it is sent the SIGCONT signal. SIGSTOP and SIGCONT are used for job control in the Unix shell, among other purposes.

简而言之,SIGSTOP告诉进程先hold on,而且SIGSTOP不能被捕捉或忽略,SIGTSTP可以被捕捉或忽略。 SIGCONT通知进程从其hold on的地方继续开始。

In short, SIGSTOP tells a process to “hold on” and SIGCONT tells a process to “pick up where you left off”.

  • A job running in the foreground can be stopped by typing the suspend character (Ctrl-Z). This sends the "terminal stop" signal (SIGTSTP) to the process group. By default, SIGTSTP causes processes receiving it to stop, and control is returned to the shell. However, a process can register a signal handler for or ignore SIGTSTP. A process can also be paused with the "stop" signal (SIGSTOP), which cannot be caught or ignored.
  • A job running in the foreground can be interrupted by typing the interruption character (Ctrl-C). This sends the "interrupt" signal (SIGINT), which defaults to terminating the process, though it can be overridden.

另外有一个地方需要注意的是kill -0 <pid>,其主要是执行错误检查,用于检查进程或进程组ID是否存在。当时在keepalived启动时也看到同样的用法。

Jan  8 12:14:36 ceph-2 Keepalived[9288]: Opening file '/opt/sds/keepalived/sds-keepalived-10.252.90.77-8/keepalived.conf'.
Jan  8 12:14:36 ceph-2 Keepalived[9288]: Remove a zombie pid file /opt/sds/keepalived/sds-keepalived-10.252.90.77-8/keepalived.pid
Jan  8 12:14:36 ceph-2 Keepalived[9288]: Remove a zombie pid file /opt/sds/keepalived/sds-keepalived-10.252.90.77-8/vrrp.pid
Jan  8 12:14:36 ceph-2 Keepalived[9289]: Starting VRRP child process, pid=9290
Jan  8 12:14:36 ceph-2 Keepalived_vrrp[9290]: Registering Kernel netlink reflector
Jan  8 12:14:36 ceph-2 Keepalived_vrrp[9290]: Registering Kernel netlink command channel
Jan  8 12:14:36 ceph-2 Keepalived_vrrp[9290]: Registering gratuitous ARP shared channel
Jan  8 12:14:36 ceph-2 Keepalived_vrrp[9290]: Opening file '/opt/sds/keepalived/sds-keepalived-10.252.90.77-8/keepalived.conf'.
Jan  8 12:14:36 ceph-2 Keepalived_vrrp[9290]: WARNING - default user 'keepalived_script' for script execution does not exist - please create.
Jan  8 12:14:36 ceph-2 Keepalived_vrrp[9290]: (sds-keepalived-10.252.90.77-8): Cannot start in MASTER state if not address owner
Jan  8 12:14:36 ceph-2 Keepalived_vrrp[9290]: (sds-keepalived-10.252.90.77-8): Unable to set no_accept mode since iptables chain name unset

从log看到在keepalived pid文件中注入某进程ID之后还是能正常启动,查看源码可以看出启动时会去检查pid file。

2171         /* Check if keepalived is already running */
2172         if (keepalived_running(daemon_mode)) {
2173             log_message(LOG_INFO, "daemon is already running");
2174             report_stopped = false;
2175             goto end;
2176         }
2177     }
123 /* Return parent process daemon state */
124 bool
125 keepalived_running(unsigned long mode)
126 {
127     if (process_running(main_pidfile))
128         return true;
129 #ifdef _WITH_VRRP_
130     if (__test_bit(DAEMON_VRRP, &mode) && process_running(vrrp_pidfile))
131         return true;
132 #endif
133 #ifdef _WITH_LVS_
134     if (__test_bit(DAEMON_CHECKERS, &mode) && process_running(checkers_pidfile))
135         return true;
136 #endif
137 #ifdef _WITH_BFD_
138     if (__test_bit(DAEMON_BFD, &mode) && process_running(bfd_pidfile))
139         return true;
140 #endif
141     return false;
142 }
90 static int
 91 process_running(const char *pid_file)
 92 {
 93     FILE *pidfile = fopen(pid_file, "r");
 94     pid_t pid = 0;
 95     int ret;
 96
 97     /* No pidfile */
 98     if (!pidfile)
 99         return 0;
100
101     ret = fscanf(pidfile, "%d", &pid);
102     fclose(pidfile);
103     if (ret != 1) {
104         log_message(LOG_INFO, "Error reading pid file %s", pid_file);
105         pid = 0;
106         pidfile_rm(pid_file);
107     }
108
109     /* What should we return - we don't know if it is running or not. */
110     if (!pid)
111         return 1;
112
113     /* If no process is attached to pidfile, remove it */
114     if (kill(pid, 0)) {
115         log_message(LOG_INFO, "Remove a zombie pid file %s", pid_file);
116         pidfile_rm(pid_file);
117         return 0;
118     }
119
120     return 1;
121 }

查看man 2 kill手册页可以看到:

#include <signal.h>

int kill(pid_t pid, int sig);

If sig is 0, then no signal is sent, but error checking is still performed; this can be used to check for the existence
       of a process ID or process group ID.

References

相关文章

网友评论

      本文标题:Linux进程状态与信号

      本文链接:https://www.haomeiwen.com/subject/dihxhhtx.html