Android系统层Watchdog机制源码分析

作者: Anderson大码渣 | 来源:发表于2017-12-21 17:08 被阅读205次

    一:为什么需要看门狗?

    Watchdog,初次见到这个词语是在大学的单片机书上, 谈到了看门狗定时器. 在很早以前那个单片机刚发展的时候, 单片机容易受到外界工作影响, 导致自己的程序跑飞, 因此有了看门狗的保护机制, 即:需要每多少时间内都去喂狗, 如果不喂狗, 看门狗将触发重启. 大体原理是, 在系统运行以后启动了看门狗的计数器,看门狗就开始自动计数,如果到了一定的时间还不去清看门狗,那么看门狗计数器就会溢出从而引起看门狗中断,造成系统复位。

    而手机, 其实是一个超强超强的单片机, 其运行速度比单片机快N倍, 存储空间比单片机大N倍, 里面运行了若干个线程, 各种软硬件协同工作, 不怕一万,就怕万一, 万一我们的系统死锁了, 万一我们的手机也受到很大的干扰程序跑飞了. 都可能发生jj思密达的事情, 因此, 我们也需要看门狗机制.

    二:Android系统层看门狗

    看门狗有硬件看门狗和软件看门狗之分, 硬件就是单片机那种的定时器电路, 软件, 则是我们自己实现一个类似机制的看门狗.Android系统为了保证系统的稳定性,也设计了这么一个看门狗,其为了保证各种系统服务能够正常工作,要监控很多的服务,并且在核心服务异常时要进行重启,还要保存现场。

    接下来我们就看看Android系统的Watchdog是怎么设计的。

    注:本文以Android6.0代码讲解

    Android系统的Watchdog源码路径在此:
    frameworks/base/services/core/java/com/android/server/Watchdog.java

    Watchdog的初始化位于SystemServer.
    /frameworks/base/services/java/com/android/server/SystemServer.java

    在SystemServer中会对Watchdog进行初始化。

    492            Slog.i(TAG, "Init Watchdog");
    493            final Watchdog watchdog = Watchdog.getInstance();
    494            watchdog.init(context, mActivityManagerService);
    

    此时Watchdog会走如下初始化方法,先是构造方法,再是init方法:

    216    private Watchdog() {
    217        super("watchdog");
    218        // Initialize handler checkers for each common thread we want to check.  Note
    219        // that we are not currently checking the background thread, since it can
    220        // potentially hold longer running operations with no guarantees about the timeliness
    221        // of operations there.
    222
    223        // The shared foreground thread is the main checker.  It is where we
    224        // will also dispatch monitor checks and do other work.
    225        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
    226                "foreground thread", DEFAULT_TIMEOUT);
    227        mHandlerCheckers.add(mMonitorChecker);
    228        // Add checker for main thread.  We only do a quick check since there
    229        // can be UI running on the thread.
    230        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
    231                "main thread", DEFAULT_TIMEOUT));
    232        // Add checker for shared UI thread.
    233        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
    234                "ui thread", DEFAULT_TIMEOUT));
    235        // And also check IO thread.
    236        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
    237                "i/o thread", DEFAULT_TIMEOUT));
    238        // And the display thread.
    239        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
    240                "display thread", DEFAULT_TIMEOUT));
    241
    242        // Initialize monitor for Binder threads.
    243        addMonitor(new BinderThreadMonitor());
    244    }
    
    246    public void init(Context context, ActivityManagerService activity) {
    247        mResolver = context.getContentResolver();
    248        mActivity = activity;
    249        // 注册重启广播
    250        context.registerReceiver(new RebootRequestReceiver(),
    251                new IntentFilter(Intent.ACTION_REBOOT),
    252                android.Manifest.permission.REBOOT, null);
    253    }
    

    但是我们看了源码会知道,Watchdog这个类继承于Thread,所以还会需要一个启动的地方,就是下面这行代码,这是在ActivityManagerService的SystemReady接口中干的。

    Watchdog.getInstance().start();

    TAG: HandlerChecker

    上面的代码中有个比较重要的类,HandlerChecker,这是Watchdog用来检测主线程,io线程,显示线程,UI线程的机制,代码也不长,直接贴出来吧。其原理就是通过各个Handler的looper的MessageQueue来判断该线程是否卡住了。当然,该线程是运行在SystemServer进程中的线程。

    public final class HandlerChecker implements Runnable {
    88        private final Handler mHandler;
    89        private final String mName;
    90        private final long mWaitMax;
    91        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
    92        private boolean mCompleted;
    93        private Monitor mCurrentMonitor;
    94        private long mStartTime;
    95
    96        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
    97            mHandler = handler;
    98            mName = name;
    99            mWaitMax = waitMaxMillis;
    100            mCompleted = true;
    101        }
    102
    103        public void addMonitor(Monitor monitor) {
    104            mMonitors.add(monitor);
    105        }
    106        // 记录当前的开始时间
    107        public void scheduleCheckLocked() {
    108            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
    109                // If the target looper has recently been polling, then
    110                // there is no reason to enqueue our checker on it since that
    111                // is as good as it not being deadlocked.  This avoid having
    112                // to do a context switch to check the thread.  Note that we
    113                // only do this if mCheckReboot is false and we have no
    114                // monitors, since those would need to be executed at this point.
    115                mCompleted = true;
    116                return;
    117            }
    118
    119            if (!mCompleted) {
    120                // we already have a check in flight, so no need
    121                return;
    122            }
    123
    124            mCompleted = false;
    125            mCurrentMonitor = null;
    126            mStartTime = SystemClock.uptimeMillis();
    127            mHandler.postAtFrontOfQueue(this);
    128        }
    129
    130        public boolean isOverdueLocked() {
    131            return (!mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);
    132        }
    133        // 获取完成时间标识
    134        public int getCompletionStateLocked() {
    135            if (mCompleted) {
    136                return COMPLETED;
    137            } else {
    138                long latency = SystemClock.uptimeMillis() - mStartTime;
    139                if (latency < mWaitMax/2) {
    140                    return WAITING;
    141                } else if (latency < mWaitMax) {
    142                    return WAITED_HALF;
    143                }
    144            }
    145            return OVERDUE;
    146        }
    147
    148        public Thread getThread() {
    149            return mHandler.getLooper().getThread();
    150        }
    151
    152        public String getName() {
    153            return mName;
    154        }
    155
    156        public String describeBlockedStateLocked() {
    157            if (mCurrentMonitor == null) {
    158                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
    159            } else {
    160                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
    161                        + " on " + mName + " (" + getThread().getName() + ")";
    162            }
    163        }
    164
    165        @Override
    166        public void run() {
    167            final int size = mMonitors.size();
    168            for (int i = 0 ; i < size ; i++) {
    169                synchronized (Watchdog.this) {
    170                    mCurrentMonitor = mMonitors.get(i);
    171                }
    172                mCurrentMonitor.monitor();
    173            }
    174
    175            synchronized (Watchdog.this) {
    176                mCompleted = true;
    177                mCurrentMonitor = null;
    178            }
    179        }
    180    }
    

    通过上面的代码,我们可以看到一个核心的方法是

    mHandler.getLooper().getQueue().isPolling()

    这个方法的实现在MessageQueue中,我将代码贴出来,我们可以看到上面的注释写到:返回当前的looper线程是否在polling工作来做,这个是个很好的用于检测loop是否存活的方法。我们从HandlerChecker源码可以看到,如果looper这个返回true,将会直接返回。

    139    /**
    140     * Returns whether this looper's thread is currently polling for more work to do.
    141     * This is a good signal that the loop is still alive rather than being stuck
    142     * handling a callback.  Note that this method is intrinsically racy, since the
    143     * state of the loop can change before you get the result back.
    144     *
    145     * <p>This method is safe to call from any thread.
    146     *
    147     * @return True if the looper is currently polling for events.
    148     * @hide
    149     */
    150    public boolean isPolling() {
    151        synchronized (this) {
    152            return isPollingLocked();
    153        }
    154    }
    155
    

    若没有返回true,表明looper当前正在工作,会post一下自己,同时将mComplete置为false,标明已经发出一个消息正在等待处理。如果当前的looper没有阻塞,那很快,将会调用到自己的run方法。

    自己的run方法干了什么呢。干的是TAG: HandlerChecker源码里面的166行,里面对自己的Monitors遍历并进行monitor。(注:此处的monitor下面会讲到),若有monitor发生了阻塞,那么mComplete会一直是false。

    那么在系统检测调用这个获取完成状态时,就会进入else里面,进行了时间的计算,并返回相应的时间状态码。

    133        // 获取完成时间标识
    134        public int getCompletionStateLocked() {
    135            if (mCompleted) {
    136                return COMPLETED;
    137            } else {
    138                long latency = SystemClock.uptimeMillis() - mStartTime;
    139                if (latency < mWaitMax/2) {
    140                    return WAITING;
    141                } else if (latency < mWaitMax) {
    142                    return WAITED_HALF;
    143                }
    144            }
    145            return OVERDUE;
    146        }
    

    好了,到这我们已经知道是怎么判断线程是否卡住的了

    1. MessageQueue.isPolling
    2. Monitor.monitor

    TAG:Monitor

    204    public interface Monitor {
    205        void monitor();
    206    }
    

    Monitor是一个接口,实现这个接口的类有好几个。比如:如下我搜出来的结果


    看,有这么多的类实现了该接口,而且我们都不用去猜,就可以知道,他们一定会注册到这个Watchdog中。注册到哪的呢,下面代码可以看到。

    225        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
    226                "foreground thread", DEFAULT_TIMEOUT);
    227        mHandlerCheckers.add(mMonitorChecker);
    
    275    public void addMonitor(Monitor monitor) {
    276        synchronized (this) {
    277            if (isAlive()) {
    278                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
    279            }
    280            mMonitorChecker.addMonitor(monitor);
    281        }
    282    }
    

    所以各个实现这个接口的类,只需要调一下,上述接口就行了。我们看一下ActivityManagerService类的调法。路径在此,点击可以进入。
    /frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java

    2381        Watchdog.getInstance().addMonitor(this);
    
    19655    /** In this method we try to acquire our lock to make sure that we have not deadlocked */
    19656    public void monitor() {
    19657        synchronized (this) { }
    19658    }
    

    可以看到,我们的AMS实现了该接口,并在2381行,将自己注册进Watchdog. 同时其monitor方法只是同步一下自己,确保自己没有死锁。
    干的事情虽然不多,但这足够了。足够让外部通过这个方法得到AMS是否死了。

    好了,现在我们知道是如何判断其他服务是否死锁了,那么看Watchdog的run方法是怎么完成这一套机制的吧。

    TAG: Watchdog.run

    run方法就是死循环,不断的去遍历所有HandlerChecker,并调其监控方法,等待三十秒,评估状态。具体见下面的注释:

    341    @Override
    342    public void run() {
    343        boolean waitedHalf = false;
    344        while (true) {
    345            final ArrayList<HandlerChecker> blockedCheckers;
    346            final String subject;
    347            final boolean allowRestart;
    348            int debuggerWasConnected = 0;
    349            synchronized (this) {
    350                long timeout = CHECK_INTERVAL;
    351                // Make sure we (re)spin the checkers that have become idle within
    352                // this wait-and-check interval
                       // 在这里,我们遍历所有HandlerChecker,并调其监控方法,记录开始时间
    353                for (int i=0; i<mHandlerCheckers.size(); i++) {
    354                    HandlerChecker hc = mHandlerCheckers.get(i);
    355                    hc.scheduleCheckLocked();
    356                }
    357
    358                if (debuggerWasConnected > 0) {
    359                    debuggerWasConnected--;
    360                }
    361
    362                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
    363                // wait while asleep. If the device is asleep then the thing that we are waiting
    364                // to timeout on is asleep as well and won't have a chance to run, causing a false
    365                // positive on when to kill things.
    366                long start = SystemClock.uptimeMillis();
                       // 等待30秒,使用uptimeMills是为了不把手机睡眠时间算进入,手机睡眠时系统服务同样睡眠
    367                while (timeout > 0) {
    368                    if (Debug.isDebuggerConnected()) {
    369                        debuggerWasConnected = 2;
    370                    }
    371                    try {
    372                        wait(timeout);
    373                    } catch (InterruptedException e) {
    374                        Log.wtf(TAG, e);
    375                    }
    376                    if (Debug.isDebuggerConnected()) {
    377                        debuggerWasConnected = 2;
    378                    }
    379                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
    380                }
    381                // 评估Checker的状态,里面会遍历所有的HandlerChecker,并获取最大的返回值。
    382                final int waitState = evaluateCheckerCompletionLocked();
                       // 最大的返回值有四种情况,分别是:COMPLETED对应消息已处理完毕线程无阻塞
    383                if (waitState == COMPLETED) {
    384                    // The monitors have returned; reset
    385                    waitedHalf = false;
    386                    continue;
                       // WAITING对应消息处理花费0~29秒,继续运行
    387                } else if (waitState == WAITING) {
    388                    // still waiting but within their configured intervals; back off and recheck
    389                    continue;
                       // WAITED_HALF对应消息处理花费30~59秒,线程可能已经被阻塞,需要保存当前AMS堆栈状态
    390                } else if (waitState == WAITED_HALF) {
    391                    if (!waitedHalf) {
    392                        // We've waited half the deadlock-detection interval.  Pull a stack
    393                        // trace and wait another half.
    394                        ArrayList<Integer> pids = new ArrayList<Integer>();
    395                        pids.add(Process.myPid());
    396                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
    397                                NATIVE_STACKS_OF_INTEREST);
    398                        waitedHalf = true;
    399                    }
    400                    continue;
    401                }
    402                //OVERDUE对应消息处理已经花费超过60, 能够走到这里,说明已经发生了超时60秒了。那么下面接下来全是应对超时的情况
    403                // something is overdue!
    404                blockedCheckers = getBlockedCheckersLocked();
    405                subject = describeCheckersLocked(blockedCheckers);
    406                allowRestart = mAllowRestart;
    407            }
    408
    409            // If we got here, that means that the system is most likely hung.
    410            // First collect stack traces from all threads of the system process.
    411            // Then kill this process so that the system will restart.
    412            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
    413
                   .......各种记录的保存
    468
    469            // Only kill the process if the debugger is not attached.
    470            if (Debug.isDebuggerConnected()) {
    471                debuggerWasConnected = 2;
    472            }
    473            if (debuggerWasConnected >= 2) {
    474                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
    475            } else if (debuggerWasConnected > 0) {
    476                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
    477            } else if (!allowRestart) {
    478                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
    479            } else {
    480                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
    481                for (int i=0; i<blockedCheckers.size(); i++) {
    482                    Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
    483                    StackTraceElement[] stackTrace
    484                            = blockedCheckers.get(i).getThread().getStackTrace();
    485                    for (StackTraceElement element: stackTrace) {
    486                        Slog.w(TAG, "    at " + element);
    487                    }
    488                }
    489                Slog.w(TAG, "*** GOODBYE!");
    490                Process.killProcess(Process.myPid());
    491                System.exit(10);
    492            }
    493
    494            waitedHalf = false;
    495        }
    496    }
    

    上述可以看到, 如果走到412行处。便是重启系统前的准备了。
    会进行以下事情:

    1. 写Eventlog
    2. 以追加的方式,输出system_server和3个native进程的栈信息
    3. 输出kernel栈信息
    4. dump所有阻塞线程
    5. 输出dropbox信息
    6. 判断有没有debuger,没有的话,重启系统了,并输出log: *** WATCHDOG KILLING SYSTEM PROCESS:

    三:总结:

    以上便是Android系统层Watchdog的原理了。设计的比较好。若由我来设计,我还真想不到使用Monitor那个锁机制来判断。

    接下来总结以下:

    1. Watchdog是一个线程,用来监听系统各项服务是否正常运行,没有发生死锁
    2. HandlerChecker用来检查Handler以及monitor
    3. monitor通过锁来判断是否死锁
    4. 超时30秒会输出log,超时60秒会重启(debug情况除外)

    本文作者:Anderson/Jerey_Jobs

    博客地址 : http://jerey.cn/
    简书地址 : Anderson大码渣
    github地址 : https://github.com/Jerey-Jobs

    相关文章

      网友评论

      本文标题:Android系统层Watchdog机制源码分析

      本文链接:https://www.haomeiwen.com/subject/desmwxtx.html