从一个进程死锁分析pipe的缓冲功能
背景介绍
最近碰到一个进程死锁的例子,这个进程简化后主要功能是:
- 调用一个shell子进程
- 读取子进程的stdout和stderr
主程序代码如下,
$ cat main1.go
package main
import (
"fmt"
"io"
"io/ioutil"
"os/exec"
"errors"
)
func ExecCmd(appCmd string) (string, string, error) {
var outStr, errStr string
var content []byte
var stdout, stderr io.ReadCloser
var err error
// prepare stdout/stderr
cmd := exec.Command("bash", appCmd)
stdout, err = cmd.StdoutPipe()
if err != nil {
fmt.Printf("ERROR: %v\n", err)
return outStr, errStr, err
}
defer stdout.Close()
stderr, err = cmd.StderrPipe()
if err != nil {
fmt.Printf("ERROR: %v\n", err)
return outStr, errStr, err
}
defer stderr.Close()
// launch process
if err := cmd.Start(); err != nil {
fmt.Printf("ERROR: %v\n", err)
return outStr, errStr, err
}
// read stdout firstly
content, err = ioutil.ReadAll(stdout)
if err != nil {
fmt.Printf("ERROR: %v\n", err)
return outStr, errStr, err
}
outStr = string(content)
// read stderr secondly
content, err = ioutil.ReadAll(stderr)
if err != nil {
fmt.Printf("ERROR: %v\n", err)
return outStr, errStr, err
}
errStr = string(content)
// wait process
if err = cmd.Wait(); err != nil {
fmt.Printf("ERROR: %v\n", err)
return outStr, errStr, err
}
if cmd.ProcessState.Success() {
return outStr, errStr, nil
} else {
err = errors.New("Command Failed")
fmt.Printf("ERROR: %v\n", err)
return outStr, errStr, err
}
}
func main() {
stdout, stderr, err := ExecCmd("test.sh")
if err != nil {
fmt.Printf("ERROR: %v\n", err)
}
fmt.Printf("STDOUT: %s\n", stdout)
fmt.Printf("STDERR: %s\n", stderr)
}
子程序代码如下:
$ cat test.sh
#!/bin/bash
for i in {1..100000}
do
1>&2 echo "Welcome $i times"
done
echo 123
编译运行:
$ go build main1.go && ./main1
然后进程main1挂在这个地方了。
分析
我们看进程父子关系:
$ ps -u <uid> | grep main1
25149 pts/9 00:00:00 main1
$ ps -ef | grep 25149
<uid> 25149 9263 0 19:49 pts/9 00:00:00 ./main1
<uid> 25153 25149 0 19:49 pts/9 00:00:00 bash test.sh
main1是父进程,他有一个shell子进程test.sh
我们看main1的调用栈
$ pstack 25149
Thread 1 (process 25149):
#0 syscall.Syscall () at /usr/local/go/src/syscall/asm_linux_amd64.s:27
#1 0x0000000000462eb5 in syscall.read (fd=3, p=..., n=140066438885536, err=...) at /usr/local/go/src/syscall/zsyscall_linux_amd64.go:783
#2 0x0000000000462579 in syscall.Read (fd=3, p=..., n=842350697520, err=...) at /usr/local/go/src/syscall/syscall_unix.go:162
#3 0x000000000046e72f in os.(*File).read (f=0xc42000c028, b=..., n=512, err=...) at /usr/local/go/src/os/file_unix.go:165
#4 0x000000000046d7d6 in os.(*File).Read (f=0xc42000c028, b=..., n=4345630, err=...) at /usr/local/go/src/os/file.go:101
#5 0x0000000000496910 in bytes.(*Buffer).ReadFrom (b=0xc420039cc0, r=..., n=842350919680, err=...) at /usr/local/go/src/bytes/buffer.go:179
#6 0x0000000000499830 in io/ioutil.readAll (r=..., capacity=512, b=..., err=...) at /usr/local/go/src/io/ioutil/ioutil.go:33
#7 0x000000000049990e in io/ioutil.ReadAll (r=..., ~r1=..., ~r2=...) at /usr/local/go/src/io/ioutil/ioutil.go:42
#8 0x000000000049ddff in main.ExecCmd (appCmd=..., ~r1=..., ~r2=..., ~r3=...) at /home/<uid>/go/src/main/main1.go:41
#9 0x000000000049e428 in main.main () at /home/<uid>/go/src/main/main1.go:72
看到main1停在等待read的状态,我们看main1.go里面在读什么?
40 // read stdout firstly
41 content, err = ioutil.ReadAll(stdout)
42 if err != nil {
43 fmt.Printf("ERROR: %v\n", err)
44 return outStr, errStr, err
45 }
46 outStr = string(content)
可见试图读取stdout的内容,这很正常啊,为什么会读不到呢,stdout是用test.sh写入的啊。
这下提醒了我,main1的子进程test.sh也没有返回呢,也就是说test.sh没有执行完。
我们先看main1从什么地方读取数据:
$ strace -p 25149
Process 25149 attached
read(3,
接着看fd=3是指向什么?
$ ls -l /proc/25149/fd/3
lr-x------. 1 <uid> <gid> 64 Mar 6 19:52 /proc/25149/fd/3 -> pipe:[47984554]
这是一个pipe,我们看pipe的详细信息:
$ lsof 2>/dev/null | grep 47984554
main1 25149 <uid> 3r FIFO 0,10 0t0 47984554 pipe
main1 25149 25150 <uid> 3r FIFO 0,10 0t0 47984554 pipe
main1 25149 25151 <uid> 3r FIFO 0,10 0t0 47984554 pipe
main1 25149 25152 <uid> 3r FIFO 0,10 0t0 47984554 pipe
bash 25153 <uid> 10w FIFO 0,10 0t0 47984554 pipe
如果没有lsof可以使用:
$ (find /proc -type l | xargs ls -l | fgrep 'pipe:[47984554]') 2>/dev/null
前面我们已经知道pid=25153就是test.sh,所以可见pipe 47984554的写入端是test.sh,读取端是main1。
再看test.sh的pstack:
$ pstack 25153
#0 0x00007f0a164e6840 in __write_nocancel () from /lib64/libc.so.6
#1 0x00007f0a16472fb3 in _IO_new_file_write () from /lib64/libc.so.6
#2 0x00007f0a1647441c in __GI__IO_do_write () from /lib64/libc.so.6
#3 0x00007f0a164747f3 in __GI__IO_file_overflow () from /lib64/libc.so.6
#4 0x00007f0a16470c49 in putc () from /lib64/libc.so.6
#5 0x000000000046df36 in echo_builtin ()
#6 0x000000000042f26f in execute_builtin.isra.2 ()
#7 0x0000000000431359 in execute_simple_command ()
#8 0x00000000004326eb in execute_command_internal ()
#9 0x0000000000433b9e in execute_command ()
#10 0x0000000000432f05 in execute_command_internal ()
#11 0x0000000000433b9e in execute_command ()
#12 0x000000000041e285 in reader_loop ()
#13 0x000000000041c8ee in main ()
和strace:
$ strace -p 25153
Process 25153 attached
write(1, "Welcome 3500 times\n", 19
test.sh的fd=1指向
$ ls -l /proc/25153/fd/1
l-wx------. 1 <uid> <gid> 64 Mar 6 19:53 /proc/25153/fd/1 -> pipe:[47984555]
继续查看fd的关联端:
$ lsof 2>/dev/null | grep 47984555
main1 25149 <uid> 5r FIFO 0,10 0t0 47984555 pipe
main1 25149 25150 <uid> 5r FIFO 0,10 0t0 47984555 pipe
main1 25149 25151 <uid> 5r FIFO 0,10 0t0 47984555 pipe
main1 25149 25152 <uid> 5r FIFO 0,10 0t0 47984555 pipe
bash 25153 <uid> 1w FIFO 0,10 0t0 47984555 pipe
bash 25153 <uid> 2w FIFO 0,10 0t0 47984555 pipe
这一下就比较清楚了。
根源
由于test.sh的写操作失败了,导致test.sh挂起,导致test.sh无法完成,从而main1也无法从stdout读取数据,最终造成死锁。
为什么test.sh写操作时会失败呢,因为pipe写满了,而main1没有及时去读取,main1先读取的是stdout,而此时stderr已经满了。
也就是说进程间的pipe是有缓冲的,这个缓冲的大小不是无限制的,当缓冲填满之后就不能再写入,直到有人从中读取出内容。
解决方案
针对这个特定例子,我们可以在main1里面调换stdout和stderr的读取顺序,让stderr先读取,再读取stdout就不会发生死锁,因为stderr写满了,而stdout的内容简单不会满。
但是这个方案,如果stdout和stderr都内容很多,都会写满的情况下就不能解决问题了。
方案2:采用新的exec API
$ cat main2.go
package main
import (
"fmt"
"bytes"
"os/exec"
)
func ExecCmd(appCmd string) (string, string, error) {
var outbuf, errbuf bytes.Buffer
cmd := exec.Command("bash", "test.sh")
cmd.Stdout = &outbuf
cmd.Stderr = &errbuf
err := cmd.Run()
return string(outbuf.Bytes()), string(errbuf.Bytes()), err
}
func main() {
stdout, stderr, err := ExecCmd("test.sh")
if err != nil {
fmt.Printf("ERROR: %v\n", err)
}
fmt.Printf("STDOUT: %s\n", stdout)
fmt.Printf("STDERR: %s\n", stderr)
}
网友评论