Service 'zygote' killed by signa

作者: 啃着地瓜数星星 | 来源:发表于2018-05-23 11:27 被阅读0次

Service 'zygote' killed by signa
Zygote的启动流程学习
zygote的启动和作用
Zygote解析
多组学分析揭示新的治疗靶点和有待开发的疾病生物标志物（IF8+）
[视频笔记] -剖析framework面试，冲击Android高
zygote的理解
Android源码(1) --- Zygote进程启动流程
Android Framework学习之zygote
剖析Frameworks笔记

一、问题描述

01-07 21:57:03.228 1690 2829 D ActivityManager: cleanUpApplicationRecord -- 5762
01-07 21:57:03.232 1690 1702 W WindowManager: Attempted to remove non-existing token: android.os.Binder@333a888
01-07 21:57:03.233 1690 2829 W ActivityManager: Scheduling restart of crashed service com.android.statementservice/.DirectStatementService in 60923ms
01-07 21:57:03.234 2892 3105 E WtProcessController: Error pid or pid not exist
01-07 21:57:03.234 1690 3683 D ActivityManager: cleanUpApplicationRecord -- 5324
01-07 21:57:03.235 1690 3683 I AutoStartManagerService: MIUILOG- Reject RestartService packageName :com.android.email uid : 10064
01-07 21:57:03.236 2892 3105 E WtProcessController: Error pid or pid not exist
01-07 21:57:03.236 1690 2876 D ActivityManager: cleanUpApplicationRecord -- 5303
01-07 21:57:03.237 1690 2876 I AutoStartManagerService: MIUILOG- Reject RestartService packageName :com.miui.personalassistant uid : 10040
01-07 21:57:03.303 10500 10500 W init : type=1400 audit(0.0:2272): avc: denied { write } for name="zygote64_pid" dev="debugfs" ino=12729 scontext=u:r:init:s0 tcontext=u:object_r:debugfs_ktrace:s0 tclass=file permissive=0
01-07 21:57:03.311 739 739 I cnss-daemon: RTM_NEWNEIGH message received: 28
01-07 21:57:03.311 739 739 E cnss-daemon: Stale or unreachable neighbors, ndm state: 32
01-07 21:57:03.314 553 553 I ServiceManager: service 'media.camera' died
01-07 21:57:03.314 553 553 I ServiceManager: service 'media.player' died
01-07 21:57:03.314 553 553 I ServiceManager: service 'media.resource_manager' died
01-07 21:57:03.317 730 980 E OMXNodeInstance: !!! Observer died. Quickly, do something, ... anything...
01-07 21:57:03.373 553 553 I ServiceManager: service 'media.radio' died
01-07 21:57:03.373 553 553 I ServiceManager: service 'media.sound_trigger_hw' died
01-07 21:57:03.373 553 553 I ServiceManager: service 'media.audio_flinger' died
01-07 21:57:03.373 553 553 I ServiceManager: service 'media.audio_policy' died
01-07 21:57:03.388 553 553 I ServiceManager: service 'fingerprints_service' died

1690 是重启之前的 system_server 的 pid，从上面的 log 中可以看出上一步 system_server 还在正常执行操作，下一步各种 service 就开始挂掉，系统开始重启了，中间也没有 system_server 的错误信息。这种情况下，我们会怀疑是其他 service 挂掉直接或间接导致 system_server 重启，譬如说 SurfaceFlinger 重启导致 system_server 重启；查看 log 可以发现 SurfaceFlinger 的 pid 并没有发生改变，并且：

u:r:zygote:s0                  root      1375  1     1613000 25752 20    0     0     0     fg  poll_sched 0000000000 S zygote
u:r:zygote:s0                  root      10500 1     2175000 92232 20    0     0     0     fg  poll_sched 0000000000 S zygote64

service zygote 的 pid 发生了变化，很容易可以推断出 zygote64 发生了重启，并导致 system_server 重启，搜索 log 果然可以发现：

<13>[ 9907.324247] init: Service 'zygote' (pid 1374) killed by signal 1
<13>[ 9907.324349] init: Service 'zygote' (pid 1374) killing any children in process group

zygote64 被 signal 1 杀掉了，那 signal 1 又是什么呢？我们可以通过 "kill -l" 进行查看：

这里写图片描述

signal 1 应为 SIGHUP

二、SIGHUP

从上面的分析可以看出，zygote64 是被 SIGHUP kill 了，下面来具体看一下 SIGHUP 是怎么产生的。
kernel/msm-4.4/kernel/exit.c

/*
 * Check to see if any process groups have become orphaned as
 * a result of our exiting, and if they have any stopped jobs,
 * send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2)
 */
static void
kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
{
    struct pid *pgrp = task_pgrp(tsk);
    struct task_struct *ignored_task = tsk;
 
    if (!parent)
        /* exit: our father is in a different pgrp than
         * we are and we were the only connection outside.
         */
        parent = tsk->real_parent;
    else
        /* reparent: our child is in a different pgrp than
         * we are, and it was the only connection outside.
         */
        ignored_task = NULL;
 
    if (task_pgrp(parent) != pgrp &&
        task_session(parent) == task_session(tsk) &&
        will_become_orphaned_pgrp(pgrp, ignored_task) &&
        has_stopped_jobs(pgrp)) {
        __kill_pgrp_info(SIGHUP, SEND_SIG_PRIV, pgrp);
        __kill_pgrp_info(SIGCONT, SEND_SIG_PRIV, pgrp);
    }
}

可以看到在满足一系列条件时，会调用 __kill_pgrp_info(SIGHUP, SEND_SIG_PRIV, pgrp) 给 pgrp 中的每个进程发送一个 SIGHUP 信号，那么应该如何解读这些条件呢？

可以看到系统中有两处会调用到 __kill_pgrp_info(...) 函数，见下图：

这里写图片描述
我们先看一下5处 kill_orphaned_pgrp 的调用场景：
kernel/msm-4.4/kernel/exit.c

/*
 * This does two things:
 *
 * A.  Make init inherit all the child processes
 * B.  Check to see if any process groups have become orphaned
 *  as a result of our exiting, and if they have any stopped
 *  jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
 */
static void forget_original_parent(struct task_struct *father,
                    struct list_head *dead)
{
    struct task_struct *p, *t, *reaper;
 
    if (unlikely(!list_empty(&father->ptraced)))
        exit_ptrace(father, dead);
 
    // 为正在退出的进程查找其子进程的 reaper
    reaper = find_child_reaper(father);
    // 如果没有子进程，直接返回
    if (list_empty(&father->children))
        return;
    // 为正在退出的进程查找其子进程的新的 reaper
    reaper = find_new_reaper(father, reaper);
    list_for_each_entry(p, &father->children, sibling) {
        for_each_thread(p, t) {
            t->real_parent = reaper;
            BUG_ON((!t->ptrace) != (t->parent == father));
            if (likely(!t->ptrace))
                t->parent = t->real_parent;
            if (t->pdeath_signal)
                group_send_sig_info(t->pdeath_signal,
                            SEND_SIG_NOINFO, t);
        }
        /*
         * If this is a threaded reparent there is no need to
         * notify anyone anything has happened.
         */
        if (!same_thread_group(reaper, father))
            reparent_leader(father, p, dead);
    }
    list_splice_tail_init(&father->children, &reaper->children);
}

注意，这里传入的参数 father 实际上是 do_exit 中正在退出的 task 的指针，所以这个函数的主要作用是：

为 father（也就是正在退出的 task）的每个子进程以及每个子进程的线程找到他们的新的父亲（real_parent）
如果新的 reaper 与 father 不属于同一线程组，那么对 father 的每个子进程 p 调用 reparent_leader(father, p, dead) （注意这里的 father 并不是我们新找到的 reaper，仍旧是我们这个正在退出的 task）

kernel/msm-4.4/kernel/exit.c

/*
* Any that need to be release_task'd are put on the @dead list.
 */
static void reparent_leader(struct task_struct *father, struct task_struct *p,
                struct list_head *dead)
{
    if (unlikely(p->exit_state == EXIT_DEAD))
        return;
 
    /* We don't want people slaying init. */
    p->exit_signal = SIGCHLD;
 
    /* If it has exited notify the new parent about this child's death. */
    if (!p->ptrace &&
        p->exit_state == EXIT_ZOMBIE && thread_group_empty(p)) {
        if (do_notify_parent(p, p->exit_signal)) {
            p->exit_state = EXIT_DEAD;
            list_add(&p->ptrace_entry, dead);
        }
    }
 
    kill_orphaned_pgrp(p, father);
}

以上是第一处调用 kill_orphaned_pgrp 的地方，下面看一下第二处调用 kill_orphaned_pgrp 的地方：
kernel/msm-4.4/kernel/exit.c

/*
 * Send signals to all our closest relatives so that they know
 * to properly mourn us..
 */
static void exit_notify(struct task_struct *tsk, int group_dead)
{
    bool autoreap;
    struct task_struct *p, *n;
    LIST_HEAD(dead);
 
    write_lock_irq(&tasklist_lock);
    forget_original_parent(tsk, &dead);
 
    if (group_dead)
        kill_orphaned_pgrp(tsk->group_leader, NULL);
 
    if (unlikely(tsk->ptrace)) {
        int sig = thread_group_leader(tsk) &&
                thread_group_empty(tsk) &&
                !ptrace_reparented(tsk) ?
            tsk->exit_signal : SIGCHLD;
        autoreap = do_notify_parent(tsk, sig);
    } else if (thread_group_leader(tsk)) {
        autoreap = thread_group_empty(tsk) &&
            do_notify_parent(tsk, tsk->exit_signal);
    } else {
        autoreap = true;
    }
 
    tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE;
    if (tsk->exit_state == EXIT_DEAD)
        list_add(&tsk->ptrace_entry, &dead);
 
    /* mt-exec, de_thread() is waiting for group leader */
    if (unlikely(tsk->signal->notify_count < 0))
        wake_up_process(tsk->signal->group_exit_task);
    write_unlock_irq(&tasklist_lock);
 
    list_for_each_entry_safe(p, n, &dead, ptrace_entry) {
        list_del_init(&p->ptrace_entry);
        release_task(p);
    }
}

group_dead 是调用 exit_notify(...) 时传过来的参数，表明是线程组中最后一个退出的 task，tsk->group_leader 即 tgid，这就是第二处调用 kill_orphaned_pgrp 的地方

我们来比较一下两处调用 kill_orphaned_pgrp 的地方有什么不同，假设我们正在退出的 task 是 A（并且 A 是 group_leader 以及线程组中最后一个退出的 task），B 是 A 的子进程（由 A fork 出来），那么两处调用 kill_orphaned_pgrp 的地方分别为：

kill_orphaned_pgrp(B, A)
kill_orphaned_pgrp(A, NULL)

所以这两处调用 kill_orphaned_pgrp 的地方实际上是针对两种不同的情景：

情景一 kill_orphaned_pgrp(B, A) 如下图所示：

这里写图片描述

所以根据这幅图，可以这样理解 __kill_pgrp_info(SIGHUP, SEND_SIG_PRIV, pgrp) 前需要满足的四个条件：

task_pgrp(parent) != pgrp，进程 B 与他的 parent A 不处于同一 process group
task_session(parent) == task_session(tsk)，A 与 B 在同一 session 中
will_become_orphaned_pgrp(pgrp, ignored_task) (这里 ignored_task 为 NULL)，这个条件可以很形象地表示为 pgrp2 中除了进程 B 作为 process group 之间的桥梁之外，没有其他进程可以作为这样的桥梁（父进程是 init 的进程除外）
has_stopped_jobs(pgrp)，pgrp2 中有进程处于 stop 状态（p->signal->flags & SIGNAL_STOP_STOPPED 为 true）

情景二 kill_orphaned_pgrp(A, NULL) 如下图所示：

这里写图片描述

根据这幅图，可以这样理解 __kill_pgrp_info(SIGHUP, SEND_SIG_PRIV, pgrp) 前需要满足的四个条件：

task_pgrp(parent) != pgrp，进程 A 与他的 parent 不处于同一 process group
task_session(parent) == task_session(tsk)，进程 A 与他的 parent 在同一 session 中
will_become_orphaned_pgrp(pgrp, ignored_task) (这里 ignored_task 为 A)，这个条件可以很形象地表示为 pgrp2 中除了进程 A 作为 process group 之间的桥梁之外，没有其他进程可以作为这样的桥梁（父进程是 init 的进程除外）
has_stopped_jobs(pgrp)，pgrp2 中有进程处于 stop 状态（p->signal->flags & SIGNAL_STOP_STOPPED 为 true）

综合这上面两个情景，可以总结为，进程 A 退出时，会考虑会不会使得其子进程所处的 pgrp 变为孤儿进程组（情景一）以及会不会使得自己退出前所处的 pgrp 变为孤儿进程组（情景二）

三、实践

通过上面的分析，针对 __kill_pgrp_info(SIGHUP, SEND_SIG_PRIV, pgrp) 的两个场景，我们得出了很清晰的结论，下面就通过一些例子来验证一下我们的结论，理论与实践相结合，体验一下花式搞死 zygote 的快感

因为很多份 log 中，系统都是在一键清理 com.tencent.tmgp.speedmobile 这个应用的过程中挂掉的，我们就来看一下这个应用有什么过人之处（我们先在 32 位的机器上实践一下）：

1、32 位机器

comm	pid	ppid	tgid	pgid
com.tencent.tmgp.speedmobile	5999	310	5999	5999
xg_service_v2	6076	310	6076	310
libxguardian.so	6199	1	6199	310
debuggerd	6274	5999	6274	6274
debuggerd	6276	6274	6276	310

这里写图片描述

可以看到相同 uid 有五个进程存在，我们再通过 cat /proc/pid/stat 和 cat /proc/pid/status 命令查看他们各自的 pid、ppid、pgid、sid 等信息，以 5999 为例，如下所示：

这里写图片描述

其中第一个数字是 pid，S 后面的三个数分别是 ppid、pgid、sid

这里写图片描述
通过上面两个命令，能列出上面5个进程之间的关系：

comm	pid	ppid	tgid	pgid
com.tencent.tmgp.speedmobile	5999	310	5999	5999
xg_service_v2	6076	310	6076	310
libxguardian.so	6199	1	6199	310
debuggerd	6274	5999	6274	6274
debuggerd	6276	6274	6276	310

默认情况下 zygote 进程的子进程和孙子进程（即所有 java 进程）的 pgid 应该都等于 zygote 进程的 pid，这里进程 5999 和 6274 应当是自己重新设置了 pgid；五个进程之间的关系如下图所示：

这里写图片描述
从上面五个进程之间的关系，以及前面分析得到的结论，我们可以推测：

pgrp 310 与其他 pgrp 沟通的桥梁可以认为有两个，分别是进程 C 和进程 E
给 pgrp 310 中的某个进程发一个 stop 信号（kill -19），再 kill 进程 B 可以模拟出上面的情景一
给 pgrp 310 中的某个进程发一个 stop 信号（除去进程 C），再 kill 进程 C 可以模拟出上面的情景二
kill 进程 B 或 C，给 pgrp 310 中的某个进程发一个 stop 信号（除去进程 C、E），再 kill 进程 E 可以模拟出上面的情景二

可以手动验证一下上面推测的三种情况，结论完美，O(∩_∩)O哈哈~

2、64 位机器

comm	pid	ppid	tgid	pgid
zygote	709	1	709	709
zygote64	708	1	708	708
com.tencent.tmgp.speedmobile	5013	709	5013	5013
xg_service_v2	5101	709	5101	708
libxguardian.so	5201	1	5201	708
debuggerd	5232	5013	5232	5232
debuggerd	5234	5232	5234	708

这里写图片描述

7个进程之间的关系：

comm	pid	ppid	tgid	pgid
zygote	709	1	709	709
zygote64	708	1	708	708
com.tencent.tmgp.speedmobile	5013	709	5013	5013
xg_service_v2	5101	709	5101	708
libxguardian.so	5201	1	5201	708
debuggerd	5232	5013	5232	5232
debuggerd	5234	5232	5234	708

默认情况下所有 java 进程（除去 zygote 进程）的 pgid 应该都等于 zygote64 进程的 pid，zygote 进程自己处于一个 pgrp 中，如上 7 个进程之间的关系如下图所示：

这里写图片描述
从上面几个进程之间的关系，以及前面分析得到的结论，我们可以推测：

pgrp zygote64 与其他 pgrp 沟通的桥梁可以认为有多个，分别是进程 C 和进程 E，以及 zygote 进程的子进程
kill zygote 进程的所有子进程（除去 A），给 pgrp zygote64 中的某个进程发一个 stop 信号（kill -19），再 kill 进程 B 可以模拟出上面的情景一
kill zygote 进程的所有子进程（除去 A），给 pgrp zygote64 中的某个进程发一个 stop 信号（除去进程 C），再 kill 进程 C 可以模拟出上面的情景二
kill zygote 进程的所有子进程，kill 进程 B 或 C，给 pgrp zygote64 中的某个进程发一个 stop 信号（除去进程 C、E），再 kill 进程 E 可以模拟出上面的情景二，注意这是理论上的，实际操作过程中由于 com.tencent.tmgp.speedmobile 的设置，kill 进程 D 的同时进程 E 也会挂掉
kill zygote 进程的所有子进程（除去 D 或者其他的任意一个进程 X），kill 进程 B 或 C，给 pgrp zygote64 中的某个进程发一个 stop 信号（除去会被 kill 的进程），再 kill 进程 D 或 X 可以模拟出上面的情景二

可以手动验证一下上面推测的几种情况，结论完美，^_

综上所述，32位和64位机器相差的实际上就是 zygote 这个 process group，如果把 zygote 的子进程都 kill 掉，64位系统的进程关系实际上就相当于32位系统的进程关系；

可以发现，实际上在64位机器上，zygote 进程的子进程很少，大部分 java 进程都是 zygote64 的子进程，这样就很容易出现 zygote 进程的子进程都已经退出的状况了；

四、解决方案

为什么会产生 SIGHUP 这个机制可以参考博客 http://blog.csdn.net/zhangfangew/article/details/27070491
另外，已经就这个问题向 google 提交了 issue https://issuetracker.google.com/issues/71965619 和 change https://android-review.googlesource.com/c/platform/frameworks/base/+/588576

五、知识点补充

list_for_each(pos, head)、list_for_each_entry(pos, head, member)

Service 'zygote' killed by signa
一、问题描述 1690 是重启之前的 system_server 的 pid，从上面的 log 中可以看出上一步 ...
Zygote的启动流程学习
0. 前言上节文章的最后说到了init以service的方式启动了Zygote进程。这节文章主要讲Zygote进...
zygote的启动和作用
zygote的作用（what） zygote的启动流程（how） zygote的工作原理（why） zygote的...
Zygote解析
一、Zygote简介 init会创建Zygote进程，SystemServer进程和应用进程都是Zygote（孵化...
多组学分析揭示新的治疗靶点和有待开发的疾病生物标志物（IF8+）
Identification of functional pathways and molecular signa...
[视频笔记] -剖析framework面试，冲击Android高
1.1 谈谈对zygote的理解？ zygote的作用是什么？启动SystemServer：需要zygote里准...
zygote的理解
谈一谈对Zygote的理解? Zygote的作用是什么? 启动三段式 Zygote的启动流程 Zygote进程是怎...
Android源码(1) --- Zygote进程启动流程
Zygote进程简介什么是Zygote进程？ Zygote进程是整个Android系统的根进程，包括Syste...
Android Framework学习之zygote
1.Zygote是什么？ 2.Zygote的启动流程？ 3.Zygote的工作原理？虽然做android很...
剖析Frameworks笔记
谈谈对Zygote的理解 Zygote的作用是什么？启动SystemServer（从Zygote直接获取常用类、J...

Service 'zygote' killed by signa

一、问题描述

二、SIGHUP

三、实践

1、32 位机器

2、64 位机器

四、解决方案

五、知识点补充

相关文章

Service 'zygote' killed by signa

Zygote的启动流程学习

zygote的启动和作用

Zygote解析

多组学分析揭示新的治疗靶点和有待开发的疾病生物标志物（IF8+）

[视频笔记] -剖析framework面试，冲击Android高

zygote的理解

Android源码(1) --- Zygote进程启动流程

Android Framework学习之zygote

剖析Frameworks笔记

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读