美文网首页
WatchDog运行原理

WatchDog运行原理

作者: 小庄bb | 来源:发表于2018-03-19 11:17 被阅读63次

    概述

    Android系统中,有硬件WatchDog用于定时检测关键硬件是否征程工作,类似地,在framework层有一个软件WatchDog用于定期检测关键系统服务是否发生死锁事件.WatchDog功能是主要分析系统核心服务和重要线程是否处于Blocked状态.

    • 监视reboot广播;
    • 监视mMonnitors关键系统服务是否死锁.

    启动流程

    startOtherServices

    [-> SystemServer.java]

    private void startOtherServices() {
        ...
        //创建watchdog【见小节2.2】
        final Watchdogwatchdog = Watchdog.getInstance();
        //注册reboot广播【见小节2.3】
        watchdog.init(context, mActivityManagerService);
        ...
        mSystemServiceManager.startBootPhase(SystemService.PHASE_LOCK_SETTINGS_READY);
        ...
        mActivityManagerService.systemReady(new Runnable() {
          @Override
          public void run() {
              mSystemServiceManager.startBootPhase(
                      SystemService.PHASE_ACTIVITY_MANAGER_READY);
              ...
              // watchdog启动【见小节3.1】
              Watchdog.getInstance().start();
              mSystemServiceManager.startBootPhase(
                      SystemService.PHASE_THIRD_PARTY_APPS_CAN_START);
            }
        }
    }
    

    getInstance

    [-> Watchdog.java]

    public static WatchdoggetInstance() {
        if (sWatchdog == null) {
            //单例模式,创建实例对象【见小节2.2.1 】
            sWatchdog = new Watchdog();
        }
        return sWatchdog;
    }
    

    创建Watchdog

    [->Watchdog.java]

    public class Watchdog extends Thread {
        ...
     
        private Watchdog() {
            super("watchdog");
            //【见小节2.2.2 】
            mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                    "foreground thread", DEFAULT_TIMEOUT);
            mHandlerCheckers.add(mMonitorChecker);
            mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                    "main thread", DEFAULT_TIMEOUT));
            mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                    "ui thread", DEFAULT_TIMEOUT));
            mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                    "i/o thread", DEFAULT_TIMEOUT));
            mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                    "display thread", DEFAULT_TIMEOUT));
            //【见小节2.2.3】
            addMonitor(new BinderThreadMonitor());
        }
    }
    

    Watchdog继承于Thread,创建的线程名为"watchdog".mHandlerCheckers是记录着所有的HandlerChecker对象的列表.

    WatchDog监控的线程有:

    线程名 对应handler 含义
    main thread new Handler(MainLooper) 当前主线程
    android.fg FgThread.getHandler 前台线程
    android.ui UiThread.getHandler UI线程
    android.io IoThread.getHandler IO线程
    android.display DisplayThread.getHandler Display线程

    DEFALT_TIMEOUT默认为60s,调试时才为10s方便找出潜在的ANR问题.

    HandlerChecker

    [-> Watchdog.java]

    public final class HandlerChecker implements Runnable {
        ...
        HandlerChecker(Handlerhandler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }
    }
    

    mMonitors记录所有Watchdog目前正在监控的服务.

    监控Binder线程

    在创建Watchdog时,通过addMonitor(new BinderThreadMonitor())来监控BInder线程,这里拆分成两步:

    • addMonitor
    • new BinderThreadMonitor
    addMonitor
    public class Watchdog extends Thread {
        public void addMonitor(Monitormonitor) {
            synchronized (this) {
                if (isAlive()) {
                    throw new RuntimeException("Monitors can't be added once the Watchdog is running");
                }
                //此处mMonitorChecker数据类型为HandlerChecker
                mMonitorChecker.addMonitor(monitor);
            }
        }
     
        public final class HandlerChecker implements Runnable {
            private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
     
            public void addMonitor(Monitormonitor) {
                //将上面的BinderThreadMonitor添加到mMonitors队列
                mMonitors.add(monitor);
            }
            ...
        }
    }
    

    将monitor添加到HandlerChecker的成员变量mMonitors列表中.

    BinderThreadMonitor
    private static final class BinderThreadMonitorimplements Watchdog.Monitor {
        public void monitor() {
            Binder.blockUntilThreadAvailable();
        }
    }
    

    blockUntilThreadAvaiable最终调用的IPCThreadState,等待有空闲的binder线程

    void IPCThreadState::blockUntilThreadAvailable()
    {
        pthread_mutex_lock(&mProcess->mThreadCountLock);
        while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
            //等待正在执行的binder线程小于进程最大binder线程上限(16个)
            pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
        }
        pthread_mutex_unlock(&mProcess->mThreadCountLock);
    }
    

    可见addMonitor(new BinderThreadMOnitor())是将Binder线程添加到android.fg线程handler(mMonitorChecher)来检查是否工作正常.

    Monitor

    public class Watchdog extends Thread {
        public interface Monitor {
            void monitor();
        }
    }
    

    能够被Watchdog监控的服务系统都实现了Watchdog.Monitor接口,实现该接口类:

    • AcitvityManagerService
    • PowerManagerService
    • WindowMangerService
    • InputManagerService
    • NetworkManagermentService
    • MountService
    • NativeDaemonConnector
    • BinderThreadMonitor
    • MediaProjectionMangerService
    • MediaRouterService
    • MediaSessionService

    init

    [-> Watchdog.java]

    public void init(Contextcontext, ActivityManagerServiceactivity) {
        mResolver = context.getContentResolver();
        mActivity = activity;
        //注册reboot广播接收者【见小节2.3.1】
        context.registerReceiver(new RebootRequestReceiver(),
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);
    }
    

    RebootRequestReceiver

    [-> Watchdog.java]

    final class RebootRequestReceiver extends BroadcastReceiver {
        @Override
        public void onReceive(Context c, Intentintent) {
            if (intent.getIntExtra("nowait", 0) != 0) {
                //【见小节2.3.2】
                rebootSystem("Received ACTION_REBOOT broadcast");
                return;
            }
            Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
        }
    }
    

    rebootSystem

    [->Watchdog.java]

    void rebootSystem(String reason) {
        Slog.i(TAG, "Rebooting system because: " + reason);
        IPowerManagerpms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
        try {
            //通过PowerManager执行reboot操作
            pms.reboot(false, reason, false);
        } catch (RemoteExceptionex) {
        }
    }
    

    最终通过PowerManagerService来完成重启操作,具体的重启流程后续会单独讲述.

    小节

    获取watchdog实例对象,并注册reboot广播

    • mHandlerCheckers记录所有的HandlerChecker对象的列表,包括foreground,main,ui,i/o,display线程的handler;
    • mMonitors记录所有Watchdog目前正在监控Monitor,此处为BinderThreadMonitor;
    • 注册reboot广播,最终是通过PowerMangerService来完成;
    • 系统每间隔一分钟执行一侧monitor操作,当系统hang时间超过1分钟则执行run()方法;
      下面来看看当系统hang超过1分钟时,进入watchdog.run()的过程:

    Watch.dog

    run

    public void run() {
        boolean waitedHalf = false;
        while (true) {
            final ArrayList<HandlerChecker> blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                //timeout=30s
                long timeout = CHECK_INTERVAL;
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerCheckerhc = mHandlerCheckers.get(i);
                    //【见小节3.2】
                    hc.scheduleCheckLocked();
                }
     
                if (debuggerWasConnected > 0) {
                    debuggerWasConnected--;
                }
     
                long start = SystemClock.uptimeMillis();
                //等待30s
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        wait(timeout);
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }
                //获取等待状态【见小节3.3】
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        //第一次进入等待时间过半的状态
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        //则输出栈信息【见小节3.4】
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                                NATIVE_STACKS_OF_INTEREST);
                        waitedHalf = true;
                    }
                    continue;
                }
                //获取被阻塞的checkers
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
                allowRestart = mAllowRestart;
            }
     
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
     
            ArrayList<Integer> pids = new ArrayList<Integer>();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            //waitedHalf=true,则追加输出栈信息【见小节3.4】
            final Filestack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
            //系统已被阻塞1分钟,也不在乎多等待2s来确保stack trace信息输出
            SystemClock.sleep(2000);
     
            if (RECORD_KERNEL_THREADS) {
                //输出kernel栈信息【见小节3.5】
                dumpKernelStackTraces();
            }
     
            //触发kernel来dump所有阻塞线程【见小节3.6】
            doSysRq('l');
            //输出dropbox信息【见小节3.7】
            ThreaddropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                subject, null, stack, null);
                    }
                };
            dropboxThread.start();
            try {
                //等待dropbox线程工作2s
                dropboxThread.join(2000);
            } catch (InterruptedExceptionignored) {}
     
            IActivityControllercontroller;
            synchronized (this) {
                controller = mController;
            }
            if (controller != null) {
                //将阻塞状态报告给activity controller,
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    //返回值为1表示继续等待,-1表示杀死系统
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        waitedHalf = false; //继续等待
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }
     
            //当debugger没有attach时,才杀死进程
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                //遍历输出阻塞线程的栈信息
                for (int i=0; i<blockedCheckers.size(); i++) {
                    Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
                    StackTraceElement[] stackTrace
                            = blockedCheckers.get(i).getThread().getStackTrace();
                    for (StackTraceElementelement: stackTrace) {
                        Slog.w(TAG, "    at " + element);
                    }
                }
                Slog.w(TAG, "*** GOODBYE!");
                //杀死进程system_server【见小节3.8】
                Process.killProcess(Process.myPid());
                System.exit(10);
            }
     
            waitedHalf = false;
        }
    }
    

    scheduleCheckLocked

    public final class HandlerChecker implements Runnable {
        ...
        public void scheduleCheckLocked() {
            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
                mCompleted = true;
                return;
            }
     
            if (!mCompleted) {
                return; //有一个check正在处理中,则无需重复发送
            }
     
            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            //发送消息,插入消息队列最开头【见3.2.1】
            mHandler.postAtFrontOfQueue(this);
        }
    }
    
    

    postAtFrountOfQueue(this),该方法输入参数为Runnable对象,根据消息机制HandlerCHecker中的run方法.

    HandlerCheckerRun.run

    public final class HandlerChecker implements Runnable {
        public void run() {
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                //回调具体服务的monitor方法
                mCurrentMonitor.monitor();
            }
     
            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }
    }
    

    回调的方法,例如BinderTreadMonitor.monitor

    evaluaterCHeckerCompletionLOcked

    private int evaluateCheckerCompletionLocked() {
        int state = COMPLETED;
        for (int i=0; i<mHandlerCheckers.size(); i++) {
            HandlerCheckerhc = mHandlerCheckers.get(i);
            //【见小节3.3.1】
            state = Math.max(state, hc.getCompletionStateLocked());
        }
        return state;
    }
    

    获取mHandlerCheckers列表中等待状态值最大的state

    getCompletionStateLocked

    public int getCompletionStateLocked() {
        if (mCompleted) {
            return COMPLETED;
        } else {
            long latency = SystemClock.uptimeMillis() - mStartTime;
            if (latency < mWaitMax/2) {
                return WAITING;
            } else if (latency < mWaitMax) {
                return WAITED_HALF;
            }
        }
        return OVERDUE;
    }
    
    • COMPLETED = 0 : 等待完成 ;
    • WAITING = 1 : 等待时间小于DEFAULT_TIMEOUT的一半,即30s;
    • WAITED_HALF = 2 : 等待时间处于30s-60s之间;
    • OVERDUE = 3 : 等待时间大于或者等于60s.

    AMS.dumpStackTraces

    public static FiledumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
            ProcessCpuTrackerprocessCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
        //默认为 data/anr/traces.txt
        String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
        if (tracesPath == null || tracesPath.length() == 0) {
            return null;
        }
     
        FiletracesFile = new File(tracesPath);
        try {
            //当clearTraces,则删除已存在的traces文件
            if (clearTraces && tracesFile.exists()) tracesFile.delete();
            //创建traces文件
            tracesFile.createNewFile();
            // -rw-rw-rw-
            FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1);
        } catch (IOException e) {
            return null;
        }
        //输出trace内容
        dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);
        return tracesFile;
    }
    

    关于trace内容,这里就不细说,直接说说结论:
    1调用Process.sendSignal()向目标进程发送信号SIGNAL_QUIT;
    2分别调用backtrace.dump_backtrace(),输出/system/bin/mediaserver, /system/bin/sdcard/, /sysatem/bin/surfacelinger这三个进程的backtrace;
    3统计CPU使用率;
    4调用Process.sendSignal()向其他进程发送信号SIGNAL_QUIT.

    dumpKernelStackTrace

    private FiledumpKernelStackTraces() {
        // 路径为data/anr/traces.txt
        String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
        if (tracesPath == null || tracesPath.length() == 0) {
            return null;
        }
     
        native_dumpKernelStacks(tracesPath);
        return new File(tracesPath);
    }
    

    native_dumpKernelStacks调用到android_server_Watchdog.dumpKernelStacks

    doSysRq

    private void doSysRq(char c) {
        try {
            FileWritersysrq_trigger = new FileWriter("/proc/sysrq-trigger");
            sysrq_trigger.write(c);
            sysrq_trigger.close();
        } catch (IOException e) {
            Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
        }
    }
    

    通过向节点/proc/sysrq-trigger写入字符,触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log.

    KillProcess

    当杀死system_server进程,从而导致zygote进程自杀,进入触发init执行重启Zygote进程,这便出现了手机framwork重启现象.

    小节

    watchDog在check过程中出现了阻塞一分钟的情况,则会输出:

    1. AMS.dumpStackTraces
    • kill -3
    • backtrace.dump_backtrace()
    1. dumpKernelStackTrace,输出kernel栈信息
    2. dropBox,输出文件到/data/system/dropbox

    文章来自:
    http://www.qingpingshan.com/rjbc/az/147237.html

    相关文章

      网友评论

          本文标题:WatchDog运行原理

          本文链接:https://www.haomeiwen.com/subject/msjuqftx.html