美文网首页Android技术知识Android开发Android开发
应用与系统稳定性第二篇---ANR的监测与信息采集

应用与系统稳定性第二篇---ANR的监测与信息采集

作者: LooperJing | 来源:发表于2018-11-01 17:42 被阅读21次

    第一篇文章-应用与系统稳定性第一篇---ANR问题分析的一般套路主要梳理了几个ANR案例,讲了分析ANR问题的一般思路。本文基于Android O,深入理解ANR发生时系统的处理逻辑,加深对ANR问题的实战处理。实际处理ANR问题中,对于很多的信息还是比较陌生。比如ANR的trace是怎么输出来的?有时候为什么ANR的trace是无效的?一个进程发生了ANR的过程中又发生了ANR,系统是怎么处理的?什么是后台ANR?等等小问题如果你都能回答,那么可以不用看了。

    一、ANR的trace是如何生成的

    1、appNotResponding方法解读

    发生ANR的时候,一般先搜索ANR in获取最直观的信息,如下:

    06-16 16:16:28.590  1853  2073 E ActivityManager: ANR in com.android.camera (com.android.camera/.Camera)
    06-16 16:16:28.590  1853  2073 E ActivityManager: PID: 27661
    06-16 16:16:28.590  1853  2073 E ActivityManager: Reason: Input dispatching timed out (com.android.camera/com.android.camera.Camera, Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago.  Wait queue length: 24.  Wait queue head age: 5511.1ms.)
    06-16 16:16:28.590  1853  2073 E ActivityManager: Load: 16.25 / 29.48 / 38.33
    06-16 16:16:28.590  1853  2073 E ActivityManager: CPU usage from 0ms to 8058ms later:
    06-16 16:16:28.590  1853  2073 E ActivityManager:   58% 291/mediaserver: 51% user + 6.7% kernel / faults: 2457 minor 4 major
    06-16 16:16:28.590  1853  2073 E ActivityManager:   27% 317/mm-qcamera-daemon: 21% user + 5.8% kernel / faults: 15965 minor
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.4% 288/debuggerd: 0% user + 0.3% kernel / faults: 21615 minor 87 major
    06-16 16:16:28.590  1853  2073 E ActivityManager:   17% 27661/com.android.camera: 10% user + 6.8% kernel / faults: 2412 minor 34 major
    06-16 16:16:28.590  1853  2073 E ActivityManager:   16% 1853/system_server: 10% user + 6.4% kernel / faults: 1754 minor 87 major
    06-16 16:16:28.590  1853  2073 E ActivityManager:   10% 539/sensors.qcom: 7.8% user + 2.6% kernel / faults: 16 minor
    06-16 16:16:28.590  1853  2073 E ActivityManager:   4.4% 277/surfaceflinger: 1.8% user + 2.6% kernel / faults: 14 minor
    06-16 16:16:28.590  1853  2073 E ActivityManager:   4% 203/mmcqd/0: 0% user + 4% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   2.6% 3510/com.android.phone: 1.9% user + 0.6% kernel / faults: 1148 minor 8 major
    06-16 16:16:28.590  1853  2073 E ActivityManager:   2.1% 2902/com.android.systemui: 1.6% user + 0.4% kernel / faults: 1272 minor 32 major
    06-16 16:16:28.590  1853  2073 E ActivityManager:   1.6% 3110/com.miui.whetstone: 1.6% user + 0% kernel / faults: 2614 minor 22 major
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.8% 99/kswapd0: 0% user + 0.8% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   1.4% 217/jbd2/mmcblk0p25: 0% user + 1.4% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   1.4% 223/logd: 0.7% user + 0.7% kernel / faults: 4 minor
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.9% 12808/kworker/0:1: 0% user + 0.9% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.8% 35/kworker/u:2: 0% user + 0.8% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0% 3222/com.miui.sysbase: 0% user + 0% kernel / faults: 1314 minor 12 major
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.8% 3446/com.android.nfc: 0.4% user + 0.3% kernel / faults: 1223 minor 9 major
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.7% 10866/kworker/u:1: 0% user + 0.7% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.6% 642/mdss_fb0: 0% user + 0.6% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.6% 29336/kworker/u:7: 0% user + 0.6% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.4% 6/kworker/u:0: 0% user + 0.4% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.4% 22924/kworker/u:6: 0% user + 0.4% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.3% 4421/mpdecision: 0% user + 0.3% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.2% 276/servicemanager: 0.1% user + 0.1% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.2% 289/rild: 0.2% user + 0% kernel / faults: 20 minor
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 4161/mcd: 0% user + 0% kernel / faults: 9 minor 1 major
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 3/ksoftirqd/0: 0% user + 0.1% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 5/kworker/0:0H: 0% user + 0.1% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 7/kworker/u:0H: 0% user + 0.1% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0% 215/flush-179:0: 0% user + 0% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 321/displayfeature: 0.1% user + 0% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 368/irq/33-cpubw_hw: 0% user + 0.1% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 403/qmuxd: 0% user + 0.1% kernel / faults: 60 minor
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0% 3491/com.xiaomi.finddevice: 0% user + 0% kernel / faults: 706 minor
    06-16 16:16:28.590  1853  2073 E ActivityManager:   0.1% 29330/ksoftirqd/1: 0% user + 0.1% kernel
    06-16 16:16:28.590  1853  2073 E ActivityManager: 96% TOTAL: 56% user + 29% kernel + 6.3% iowait + 4.1% softirq
    

    应用在发生ANR的时候,不论是Provider,input还是BroadcastReceiver和Service处理超时等,系统都会调用appNotResponding方法,进行trace的收集,主要记录当时的案发现场,各个相关线程的状态一目了然。下面从appNotResponding入手一起看看源码是怎么写的。

    这个方法的参数有5个,如下:

    • app 当前发生ANR的进程
    • activity 发生ANR的界面
    • parent 发生ANR的界面的上一级界面
    • aboveSystem 这个参数系统中用的全部都是False,还没有看到有true的地方
    • annotation 发生ANR的原因
    776    final void appNotResponding(ProcessRecord app, ActivityRecord activity,
    777            ActivityRecord parent, boolean aboveSystem, final String annotation) {
                // 填充firstPids和lastPids数组,从最近运行进程中挑选
                // firstPids: 用于保存ANR进程及其父进程,system_server进程和persistent的进程
                // lastPids: 用于保存除firstPids外的其他进程
    778        ArrayList<Integer> firstPids = new ArrayList<Integer>(5);
    779        SparseArray<Boolean> lastPids = new SparseArray<Boolean>(20);
    780       
    781        if (mService.mController != null) {
    782            try {
    783                // 0 == continue, -1 = kill process immediately
    784                int res = mService.mController.appEarlyNotResponding(
    785                        app.processName, app.pid, annotation);
    786                if (res < 0 && app.pid != MY_PID) {
    787                    app.kill("anr", true);
    788                }
    789            } catch (RemoteException e) {
    790                mService.mController = null;
    791                Watchdog.getInstance().setActivityController(null);
    792            }
    793        }
    794
    795        long anrTime = SystemClock.uptimeMillis();
                  //更新Cpu的信息
    796        if (ActivityManagerService.MONITOR_CPU_USAGE) {
    797            mService.updateCpuStatsNow();
    798        }
    799
    800        // Unless configured otherwise, swallow ANRs in background processes & kill the process.
        //showBackground意味在后台的时候是否让ANR有感知,如果showBackground为fasle就不会弹出ANR的dialog了
    801        boolean showBackground = Settings.Secure.getInt(mContext.getContentResolver(),
    802                Settings.Secure.ANR_SHOW_BACKGROUND, 0) != 0;
    803
    804        boolean isSilentANR;
    805
    806        synchronized (mService) {
             //这些场景的ANR直接做忽略处理(关机,app已经ANR了,APP处在crash流程中,被AMS或者其他人杀死的)
    807            // PowerManager.reboot() can block for a long time, so ignore ANRs while shutting down.
    808            if (mService.mShuttingDown) {
    809                Slog.i(TAG, "During shutdown skipping ANR: " + app + " " + annotation);
    810                return;
    811            } else if (app.notResponding) {
    812                Slog.i(TAG, "Skipping duplicate ANR: " + app + " " + annotation);
    813                return;
    814            } else if (app.crashing) {
    815                Slog.i(TAG, "Crashing app skipping ANR: " + app + " " + annotation);
    816                return;
    817            } else if (app.killedByAm) {
    818                Slog.i(TAG, "App already killed by AM skipping ANR: " + app + " " + annotation);
    819                return;
    820            } else if (app.killed) {
    821                Slog.i(TAG, "Skipping died app ANR: " + app + " " + annotation);
    822                return;
    823            }
    824
    825            // In case we come through here for the same app before completing
    826            // this one, mark as anring now so we will bail out.
    827            app.notResponding = true;
    828
    829            // Log the ANR to the event log.常常搜索的am_anr关键字,记录ANR信息到eventlog中
    830            EventLog.writeEvent(EventLogTags.AM_ANR, app.userId, app.pid,
    831                    app.processName, app.info.flags, annotation);
    832
    833            // Dump thread traces as quickly as we can, starting with "interesting" processes.
             //ANR进程放到firstPids集合中
    834            firstPids.add(app.pid);
    835
    836            // Don't dump other PIDs if it's a background ANR
             //如果是后台的ANR,就不dump了
    837            isSilentANR = !showBackground && !isInterestingForBackgroundTraces(app);
    838            if (!isSilentANR) {
    839                int parentPid = app.pid;
    840                if (parent != null && parent.app != null && parent.app.pid > 0) {
    841                    parentPid = parent.app.pid;
    842                }
               //把ANR的父进程也记录firstPids中
    843                if (parentPid != app.pid) firstPids.add(parentPid);
    844          //MY_PID是system_server的pid,如果发生的进程不是system,需要把system进程加到firstPids中
    845                if (MY_PID != app.pid && MY_PID != parentPid) firstPids.add(MY_PID);
    846         //最近运行的进程在mLruProcesses中,遍历这个List
    847                for (int i = mService.mLruProcesses.size() - 1; i >= 0; i--) {
    848                    ProcessRecord r = mService.mLruProcesses.get(i);
    849                    if (r != null && r.thread != null) {
    850                        int pid = r.pid;
    851                        if (pid > 0 && pid != app.pid && pid != parentPid && pid != MY_PID) {
                      //如果persistent进程,这类是“不死进程”,
                      //这类进程在启动的时候会针对目标用户进程的IApplicationThread接口,注册一个binder讣告监听器,
                      //一旦日后用户进程意外挂掉,AMS就能在第一时间感知到,
                                  //并采取相应的措施。如果AMS发现意外挂掉的应用是persistent的,它会尝试重新启动这个应用
    852                            if (r.persistent) {
    853                                firstPids.add(pid);
    854                                if (DEBUG_ANR) Slog.i(TAG, "Adding persistent proc: " + r);
    855                            } else if (r.treatLikeActivity) {
    856                                firstPids.add(pid);
    857                                if (DEBUG_ANR) Slog.i(TAG, "Adding likely IME: " + r);
    858                            } else {
                                       //其他的都加到lastPids中
    859                                lastPids.put(pid, Boolean.TRUE);
    860                                if (DEBUG_ANR) Slog.i(TAG, "Adding ANR proc: " + r);
    861                            }
    862                        }
    863                    }
    864                }
    865            }
    866        }
    867
    868        // Log the ANR to the main log.输出到main log里面
    869        StringBuilder info = new StringBuilder();
    870        info.setLength(0);
    871        info.append("ANR in ").append(app.processName);
    872        if (activity != null && activity.shortComponentName != null) {
    873            info.append(" (").append(activity.shortComponentName).append(")");
    874        }
    875        info.append("\n");
    876        info.append("PID: ").append(app.pid).append("\n");
    877        if (annotation != null) {
    878            info.append("Reason: ").append(annotation).append("\n");
    879        }
    880        if (parent != null && parent != activity) {
    881            info.append("Parent: ").append(parent.shortComponentName).append("\n");
    882        }
    883  
    884        ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);
    885
    886        // don't dump native PIDs for background ANRs unless it is the process of interest
    887        String[] nativeProcs = null;
    888        if (isSilentANR) {
            //NATIVE_STACKS_OF_INTEREST是Watchog中定义的几个关键的进程
    889            for (int i = 0; i < NATIVE_STACKS_OF_INTEREST.length; i++) {
    890                if (NATIVE_STACKS_OF_INTEREST[i].equals(app.processName)) {
    891                    nativeProcs = new String[] { app.processName };
    892                    break;
    893                }
    894            }
    895        } else {
    896            nativeProcs = NATIVE_STACKS_OF_INTEREST;
    897        }
    898  //获取nativeProcs这些进程的pid
    899        int[] pids = nativeProcs == null ? null : Process.getPidsForCommands(nativeProcs);
    900        ArrayList<Integer> nativePids = null;
    901
    902        if (pids != null) {
    903            nativePids = new ArrayList<Integer>(pids.length);
    904            for (int i : pids) {
    905                nativePids.add(i);
    906            }
    907        }
    908
    909        // For background ANRs, don't pass the ProcessCpuTracker to
    910        // avoid spending 1/2 second collecting stats to rank lastPids.
                  //
    911        File tracesFile = mService.dumpStackTraces(true, firstPids,
    912                                                   (isSilentANR) ? null : processCpuTracker,
    913                                                   (isSilentANR) ? null : lastPids,
    914                                                   nativePids);
    915
    916        String cpuInfo = null;
    917        if (ActivityManagerService.MONITOR_CPU_USAGE) {
               //更新Cpu的信息
    918            mService.updateCpuStatsNow();
    919            synchronized (mService.mProcessCpuTracker) {
               // 记录ANR之前的cpu使用情况(CPU usage from xxxms to 0ms ago)
    920                cpuInfo = mService.mProcessCpuTracker.printCurrentState(anrTime);
    921            }
                // 记录从anr时间开始的Cpu负载
    922            info.append(processCpuTracker.printCurrentLoad());
    923            info.append(cpuInfo);
    924        }
    925          // 记录从anr时间开始的cpu使用情况(CPU usage from xxms to xxms later)
    926        info.append(processCpuTracker.printCurrentState(anrTime));
    927
    928        Slog.e(TAG, info.toString());
    929        if (tracesFile == null) {
    930            // 如果trace为空,则发送singal 3到发送ANR的进程,相当于执行adb shell kill -3 pid命令
    931            Process.sendSignal(app.pid, Process.SIGNAL_QUIT);
    932        }
    933      //写到DropBox文件夹里面(data/system/dropbox)
    934        mService.addErrorToDropBox("anr", app, app.processName, activity, parent, annotation,
    935                cpuInfo, tracesFile, null);
    936
    937        if (mService.mController != null) {
    938            try {
    939                // 0 == show dialog, 1 = keep waiting, -1 = kill process immediately
    940                int res = mService.mController.appNotResponding(
    941                        app.processName, app.pid, info.toString());
    942                if (res != 0) {
    943                    if (res < 0 && app.pid != MY_PID) {
    944                        app.kill("anr", true);
    945                    } else {
    946                        synchronized (mService) {
    947                            mService.mServices.scheduleServiceTimeoutLocked(app);
    948                        }
    949                    }
    950                    return;
    951                }
    952            } catch (RemoteException e) {
    953                mService.mController = null;
    954                Watchdog.getInstance().setActivityController(null);
    955            }
    956        }
    957
    958        synchronized (mService) {
    959            mService.mBatteryStatsService.noteProcessAnr(app.processName, app.uid);
    960        //如果是后台ANR,直接kill
    961            if (isSilentANR) {
    962                app.kill("bg anr", true);
    963                return;
    964            }
    965
    966            // Set the app's notResponding state, and look up the errorReportReceiver
    967             吧   (app,
    968                    activity != null ? activity.shortComponentName : null,
    969                    annotation != null ? "ANR " + annotation : "ANR",
    970                    info.toString());
    971
    972            // 弹出无响应对话框
    973            Message msg = Message.obtain();
    974            HashMap<String, Object> map = new HashMap<String, Object>();
    975            msg.what = ActivityManagerService.SHOW_NOT_RESPONDING_UI_MSG;
    976            msg.obj = map;
    977            msg.arg1 = aboveSystem ? 1 : 0;
    978            map.put("app", app);
    979            if (activity != null) {
    980                map.put("activity", activity);
    981            }
    982    给android.ui线程发生一个消息,弹出对话框
    983            mService.mUiHandler.sendMessage(msg);
    984        }
    985    }
    986
    

    上面的代码撸了一遍,总结一下appNotResponding的业务逻辑:

    • 1、输出ANR Reason信息到Events log中,关键字是am_anr,一般都是用这个信息来确定Anr的发生时间
    • 2、如果是后台的anr,即isSilentANR值为true,那么不显示ANR对话框,直接kill掉ANR所在进程
    • 3、输出ANR 信息到main log中,关键字是ANR in,含有Cpu负载信息,一般都是用这个看ANR发生时候的其他进程的资源占用情况
    • 4、调用dumpStackTraces去生成trace文件,这个trace文件有哪些进程呢?
      a.firstPids队列:第一个是ANR进程,第二个是system_server,剩余是所有persistent进程;
      b.Native队列:是指/system/bin/目录的mediaserver,sdcard 以及surfaceflinger进程;
      c.lastPids队列: 是指mLruProcesses中的不属于firstPids的所有进程。
    • 5、将traces文件和 CPU使用情况信息保存到dropbox,即data/system/dropbox目录
    • 6、弹窗告知用户

    除了上面的业务总结,我们发现有几个小知识点

    • 1、关机过程虽然比较耗时,但是不会出现ANR,不必多说,你说人家都关机了,还有提示发生ANR的必要了吗?
    • 2、ANR有两种,前台ANR和后台ANR,我在调试中发现,后台ANR非常容易发生,系统直接kill掉这个应用进程,但是用户没有感知。
    • 3、am_anr关键字比ANR in要先打印,所以更接近ANR的实际发生时间
    2、dumpStackTraces(1/2)方法解读

    继续看appNotResponding中的核心逻辑dumpStackTraces

    http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5492
    5483    /**
    5484     * If a stack trace dump file is configured, dump process stack traces.
    5485     * @param clearTraces causes the dump file to be erased prior to the new
    5486     *    traces being written, if true; when false, the new traces will be
    5487     *    appended to any existing file content.
    5488     * @param firstPids of dalvik VM processes to dump stack traces for first
    5489     * @param lastPids of dalvik VM processes to dump stack traces for last
    5490     * @param nativePids optional list of native pids to dump stack crawls
    5491     */
    5492    public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
    5493            ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids,
    5494            ArrayList<Integer> nativePids) {
    5495            ArrayList<Integer> extraPids = null;
    5496            ......
    5524        boolean useTombstonedForJavaTraces = false;
    5525        File tracesFile;
    5526        //如果dalvik.vm.stack-trace-dir没有配置,trace就写到全局文件data/anr/traces.txt中
    5527        final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");
    5528        if (tracesDirProp.isEmpty()) {
    5529            // When dalvik.vm.stack-trace-dir is not set, we are using the "old" trace
    5530            // dumping scheme. All traces are written to a global trace file (usually
    5531            // "/data/anr/traces.txt") so the code below must take care to unlink and recreate
    5532            // the file if requested.
    5533            //
    5534            // This mode of operation will be removed in the near future.
    5535
    5536
    5537            String globalTracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
    5538            if (globalTracesPath.isEmpty()) {
                // 没有配置dalvik.vm.stack-trace-file,则返回null
    5539                Slog.w(TAG, "dumpStackTraces: no trace path configured");
    5540                return null;
    5541            }
    5542
    5543            tracesFile = new File(globalTracesPath);
    5544            try {
                //删除旧的,创建新的
    5545                if (clearTraces && tracesFile.exists()) {
    5546                    tracesFile.delete();
    5547                }
    5548
    5549                tracesFile.createNewFile();
    5550                FileUtils.setPermissions(globalTracesPath, 0666, -1, -1); // -rw-rw-rw-
    5551            } catch (IOException e) {
    5552                Slog.w(TAG, "Unable to prepare ANR traces file: " + tracesFile, e);
    5553                return null;
    5554            }
    5555        } else {
    5556            File tracesDir = new File(tracesDirProp);
    5557            // When dalvik.vm.stack-trace-dir is set, we use the "new" trace dumping scheme.
    5558            // Each set of ANR traces is written to a separate file and dumpstate will process
    5559            // all such files and add them to a captured bug report if they're recent enough.
    5560            maybePruneOldTraces(tracesDir);
    5561
    5562            // NOTE: We should consider creating the file in native code atomically once we've
    5563            // gotten rid of the old scheme of dumping and lot of the code that deals with paths
    5564            // can be removed.
            // 创建trace文件,格式为anr_yyyy-MM-dd-HH-mm-ss-SSS
    5565            tracesFile = createAnrDumpFile(tracesDir);
    5566            if (tracesFile == null) {
    5567                return null;
    5568            }
    5569
    5570            useTombstonedForJavaTraces = true;
    5571        }
    5572
    5573        dumpStackTraces(tracesFile.getAbsolutePath(), firstPids, nativePids, extraPids,
    5574                useTombstonedForJavaTraces);
    5575        return tracesFile;
    5576    }
    

    这段代码并没有真正的执行dump操作,只是确定了trace文件路径,如果”dalvik.vm.stack-trace-dir”这个属性没有配置,就使用旧的dump策略,trace信息写入到全局trace文件/data/anr/traces.txt中,删除旧的traces文件,创建新的traces文件;如果”dalvik.vm.stack-trace-dir”属性被配置了,就创建格式为anr_yyyy-MM-dd-HH-mm-ss-SSS的文件。

    dalvik.vm.stack-trace-file特性
    dalvik.vm.stack-trace-file特性允许你指定要将线程堆栈追踪写入的文件名,如果不存在,将创建,新的信息将追加到文件尾,文件名通过-Xstacktracefile参数写入虚拟机。例如:
    adb shell setprop dalvik.vm.stack-trace-file /tmp/stack-traces.txt
    如果这个特性没有被定义,虚拟机会在收到这个信号时将堆栈追踪信息写入android log。

    http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5581
    5581    private static synchronized File createAnrDumpFile(File tracesDir) {
    5582        if (sAnrFileDateFormat == null) {
    5583            sAnrFileDateFormat = new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss-SSS");
    5584        }
    5585
    5586        final String formattedDate = sAnrFileDateFormat.format(new Date());
    5587        final File anrFile = new File(tracesDir, "anr_" + formattedDate);
    5588
    5589        try {
    5590            if (anrFile.createNewFile()) {
    5591                FileUtils.setPermissions(anrFile.getAbsolutePath(), 0600, -1, -1); // -rw-------
    5592                return anrFile;
    5593            } else {
    5594                Slog.w(TAG, "Unable to create ANR dump file: createNewFile failed");
    5595            }
    5596        } catch (IOException ioe) {
    5597            Slog.w(TAG, "Exception creating ANR dump file:", ioe);
    5598        }
    5599
    5600        return null;
    5601    }
    5602
    
    3、dumpStackTraces(2/2)方法解读
    http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5704
    5704    private static void dumpStackTraces(String tracesFile, ArrayList<Integer> firstPids,
    5705            ArrayList<Integer> nativePids, ArrayList<Integer> extraPids,
    5706            boolean useTombstonedForJavaTraces) {
    5707
    5708        // We don't need any sort of inotify based monitoring when we're dumping traces via
    5709        // tombstoned. Data is piped to an "intercept" FD installed in tombstoned so we're in full
    5710        // control of all writes to the file in question.
    5711        final DumpStackFileObserver observer;
    5712        if (useTombstonedForJavaTraces) {
    5713            observer = null;
    5714        } else {
    5715            // Use a FileObserver to detect when traces finish writing.
    5716            // The order of traces is considered important to maintain for legibility.
    5717            observer = new DumpStackFileObserver(tracesFile);
    5718        }
    5719
    5720        // 所有的dump操作需要在20秒之内完成
    5721        long remainingTime = 20 * 1000;
    5722        try {
    5723            if (observer != null) {
    5724                observer.startWatching();
    5725            }
    5726
    5727            // 第一步收集firstPids中进程的trace,超时会自动放弃
    5728            if (firstPids != null) {
    5729                int num = firstPids.size();
    5730                for (int i = 0; i < num; i++) {
    5731                    if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for pid "
    5732                            + firstPids.get(i));
    5733                    final long timeTaken;
    5734                    if (useTombstonedForJavaTraces) {
    5735                        timeTaken = dumpJavaTracesTombstoned(firstPids.get(i), tracesFile, remainingTime);
    5736                    } else {
    5737                        timeTaken = observer.dumpWithTimeout(firstPids.get(i), remainingTime);
    5738                    }
    5739
    5740                    remainingTime -= timeTaken;
    5741                    if (remainingTime <= 0) {
    5742                        Slog.e(TAG, "Aborting stack trace dump (current firstPid=" + firstPids.get(i) +
    5743                            "); deadline exceeded.");
    5744                        return;
    5745                    }
    5746
    5747                    if (DEBUG_ANR) {
    5748                        Slog.d(TAG, "Done with pid " + firstPids.get(i) + " in " + timeTaken + "ms");
    5749                    }
    5750                }
    5751            }
    5752
    5753              // 第二步收集nativePids中进程的trace,超时会自动放弃
    5754            if (nativePids != null) {
    5755                for (int pid : nativePids) {
    5756                    if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for native pid " + pid);
    5757                    final long nativeDumpTimeoutMs = Math.min(NATIVE_DUMP_TIMEOUT_MS, remainingTime);
    5758
    5759                    final long start = SystemClock.elapsedRealtime();
    5760                    Debug.dumpNativeBacktraceToFileTimeout(
    5761                            pid, tracesFile, (int) (nativeDumpTimeoutMs / 1000));
    5762                    final long timeTaken = SystemClock.elapsedRealtime() - start;
    5763
    5764                    remainingTime -= timeTaken;
    5765                    if (remainingTime <= 0) {
    5766                        Slog.e(TAG, "Aborting stack trace dump (current native pid=" + pid +
    5767                            "); deadline exceeded.");
    5768                        return;
    5769                    }
    5770
    5771                    if (DEBUG_ANR) {
    5772                        Slog.d(TAG, "Done with native pid " + pid + " in " + timeTaken + "ms");
    5773                    }
    5774                }
    5775            }
    5776
    5777            //第三步收集extraPids中进程的trace,超时会自动放弃,这个代码和第一步一样的
    5778            if (extraPids != null) {
    5779                for (int pid : extraPids) {
    5780                    if (DEBUG_ANR) Slog.d(TAG, "Collecting stacks for extra pid " + pid);
    5781
    5782                    final long timeTaken;
    5783                    if (useTombstonedForJavaTraces) {
    5784                        timeTaken = dumpJavaTracesTombstoned(pid, tracesFile, remainingTime);
    5785                    } else {
    5786                        timeTaken = observer.dumpWithTimeout(pid, remainingTime);
    5787                    }
    5788
    5789                    remainingTime -= timeTaken;
    5790                    if (remainingTime <= 0) {
    5791                        Slog.e(TAG, "Aborting stack trace dump (current extra pid=" + pid +
    5792                                "); deadline exceeded.");
    5793                        return;
    5794                    }
    5795
    5796                    if (DEBUG_ANR) {
    5797                        Slog.d(TAG, "Done with extra pid " + pid + " in " + timeTaken + "ms");
    5798                    }
    5799                }
    5800            }
    5801        } finally {
    5802            if (observer != null) {
    5803                observer.stopWatching();
    5804            }
    5805        }
    5806    }
    
    

    firstPids和extraPids中的进程dump trace有两种方式,看useTombstonedForJavaTraces的取值;

    • 如果useTombstonedForJavaTraces为真,采用dumpJavaTracesTombstoned获取trace
    • 如果useTombstonedForJavaTraces为假,采用DumpStackFileObserver#dumpWithTimeout获取trace

    对于nativePids中的进程,只能通过 Debug.dumpNativeBacktraceToFileTimeout来获取trace。

    除了上面的总结还需要思考几个小问题

    • 问题1: 利用am_anr来确定ANR的发生时间真的准确吗?
      在上一篇文章就已经说了,不准确,可能由于系统资源比较紧张造成一定程度的滞后,十多秒都可能,此时需要结合eventlog进一步分析。

    问题2:ANR中经常出现没有trace,或者无效trace是怎么回事?

    • 从appNotResponding方法中,我们看到有不少return的地方,在dumpStackTraces方法,也会有return出现,比如这么多进程的trace需要在20秒之内完成,在系统资源紧张的时候,能不能完成,需要打一个问号了
    • 有一些ANR问题需要binder对端的trace,通过上面的分析看,大概率是没有对端的trace的,除非binder call到system_server或者比较重要的几个native进程里面
    • 另外发生ANR的时候,线程并没有暂停掉,有可能等到我们去抓这个进程的trace的时候,这个进程的主线程都恢复正常了,比如经常性出现下面的nativePollOnce状态的堆栈。
    "main" prio=5 tid=1 Native
     | group="main" sCount=1 dsCount=0 obj=0x7685ba50 self=0x7fb0840a00
     | sysTid=1460 nice=-10 cgrp=default sched=0/0 handle=0x7fb4824aa0
     | state=S schedstat=( 213445473008 99617457476 724519 ) utm=14518 stm=6826 core=0 HZ=100
     | stack=0x7fe82c6000-0x7fe82c8000 stackSize=8MB
     | held mutexes=
     kernel: __switch_to+0x70/0x7c
     kernel: SyS_epoll_wait+0x2d4/0x394
     kernel: SyS_epoll_pwait+0xc4/0x150
     kernel: el0_svc_naked+0x24/0x28
     native: #00 pc 000000000001bf2c  /system/lib64/libc.so (syscall+28)
     native: #01 pc 00000000000e813c  /system/lib64/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+160)
     native: #02 pc 000000000054b6f8  /system/lib64/libart.so (_ZN3artL12GoToRunnableEPNS_6ThreadE+328)
     native: #03 pc 000000000054b56c  /system/lib64/libart.so (_ZN3art12JniMethodEndEjPNS_6ThreadE+28)
     native: #04 pc 00000000008ccc50  /system/framework/arm64/boot-framework.oat (Java_android_os_MessageQueue_nativePollOnce__JI+156)
     at android.os.MessageQueue.nativePollOnce(Native method)
     at android.os.MessageQueue.next(MessageQueue.java:323)
     at android.os.Looper.loop(Looper.java:142)
     at com.android.server.SystemServer.run(SystemServer.java:387)
     at com.android.server.SystemServer.main(SystemServer.java:245)
     at java.lang.reflect.Method.invoke!(Native method)
     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:904)
     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:794)
    
    • 问题3: ANR中的dumpStackTraces是耗时操作,会不会引起系统的ANR?
      我觉得是有可能的,而是在系统稳定性很差的情况下,大量应用都ANR了,都需要dumpStackTraces,那么都有引起系统watchdog的可能。AOSP在P上有多个changelist对此进行优化,有兴趣可以自己查看。
    3.1、DumpStackFileObserver方式获取trace
    http://androidxref.com/8.1.0_r33/xref/frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java#5631
    5625    /**
    5626     * Legacy code, do not use. Existing users will be deleted.
    5627     *
    5628     * @deprecated
    5629     */
    5630    @Deprecated
    5631    public static class DumpStackFileObserver extends FileObserver {
    5632        // Keep in sync with frameworks/native/cmds/dumpstate/utils.cpp
    5633        private static final int TRACE_DUMP_TIMEOUT_MS = 10000; // 10 seconds
    5634
    5635        private final String mTracesPath;
    5636        private boolean mClosed;
    5637
    5638        public DumpStackFileObserver(String tracesPath) {
    5639            super(tracesPath, FileObserver.CLOSE_WRITE);
    5640            mTracesPath = tracesPath;
    5641        }
    5642
    5643        @Override
    5644        public synchronized void onEvent(int event, String path) {
    5645            mClosed = true;
    5646            notify();
    5647        }
    5648
    5649        public long dumpWithTimeout(int pid, long timeout) {
               //发生singal开输出trace,相当于adb shell kill -3
    5650            sendSignal(pid, SIGNAL_QUIT);
    5651            final long start = SystemClock.elapsedRealtime();
    5652            // timeout为20s,TRACE_DUMP_TIMEOUT_MS为10s,取小则为10s
    5653            final long waitTime = Math.min(timeout, TRACE_DUMP_TIMEOUT_MS);
    5654            synchronized (this) {
    5655                try {
    5656                    wait(waitTime); // Wait for traces file to be closed.
    5657                } catch (InterruptedException e) {
    5658                    Slog.wtf(TAG, e);
    5659                }
    5660            }
    5661
    5662            // This avoids a corner case of passing a negative time to the native
    5663            // trace in case we've already hit the overall timeout.
    5664            final long timeWaited = SystemClock.elapsedRealtime() - start;
    5665            if (timeWaited >= timeout) {
    5666                return timeWaited;
    5667            }
    5668
    .......
    5683            return (end - start);
    5684        }
    5685    }
    

    这个方法将会被废弃了,原理上就是用adb shell kill -3抓trace

    3.2、dumpJavaTracesTombstoned方式获取trace
    5687    /**
    5688     * Dump java traces for process {@code pid} to the specified file. If java trace dumping
    5689     * fails, a native backtrace is attempted. Note that the timeout {@code timeoutMs} only applies
    5690     * to the java section of the trace, a further {@code NATIVE_DUMP_TIMEOUT_MS} might be spent
    5691     * attempting to obtain native traces in the case of a failure. Returns the total time spent
    5692     * capturing traces.
    5693     */
    5694    private static long dumpJavaTracesTombstoned(int pid, String fileName, long timeoutMs) {
    5695        final long timeStart = SystemClock.elapsedRealtime();
    5696        if (!Debug.dumpJavaBacktraceToFileTimeout(pid, fileName, (int) (timeoutMs / 1000))) {
    5697            Debug.dumpNativeBacktraceToFileTimeout(pid, fileName,
    5698                    (NATIVE_DUMP_TIMEOUT_MS / 1000));
    5699        }
    5700
    5701        return SystemClock.elapsedRealtime() - timeStart;
    5702    }
    
    

    实质上还是由dumpJavaBacktraceToFileTimeout来获取trace。dumpNativeBacktraceToFileTimeout是一个native方法,下面来看一看

    1131static jboolean android_os_Debug_dumpNativeBacktraceToFileTimeout(JNIEnv* env, jobject clazz,
    1132        jint pid, jstring fileName, jint timeoutSecs) {
    1133    const bool ret = dumpTraces(env, pid, fileName, timeoutSecs, kDebuggerdNativeBacktrace);
    1134    return ret ? JNI_TRUE : JNI_FALSE;
    1135}
    
    1107static bool dumpTraces(JNIEnv* env, jint pid, jstring fileName, jint timeoutSecs,
    1108                       DebuggerdDumpType dumpType) {
    1109    const ScopedUtfChars fileNameChars(env, fileName);
    1110    if (fileNameChars.c_str() == nullptr) {
    1111        return false;
    1112    }
    1113 //打开 /data/anr/anr_yyyy-MM-dd-HH-mm-ss-SSS文件
    1114    android::base::unique_fd fd(open(fileNameChars.c_str(),
    1115                                     O_CREAT | O_WRONLY | O_NOFOLLOW | O_CLOEXEC | O_APPEND,
    1116                                     0666));
    1117    if (fd < 0) {
    1118        fprintf(stderr, "Can't open %s: %s\n", fileNameChars.c_str(), strerror(errno));
    1119        return false;
    1120    }
    1121
    1122    return (dump_backtrace_to_file_timeout(pid, dumpType, timeoutSecs, fd) == 0);
    1123}
    
    int dump_backtrace_to_file_timeout(pid_t tid, DebuggerdDumpType dump_type, int timeout_secs,   int fd) {
     android::base::unique_fd copy(dup(fd));
     if (copy == -1) {
       return -1;
     }
     int timeout_ms = timeout_secs > 0 ? timeout_secs * 1000 : 0;
     return debuggerd_trigger_dump(tid, dump_type, timeout_ms, std::move(copy)) ? 0 : -1;
    }
    

    可以看到实质是调用debuggerd_trigger_dump,可以看debuggerd_client.cpp,可向debuggerd发送命令,利用debuggerd抓trace。

    总结:在dumpStackTraces方法中,主要根据useTombstonedForJavaTraces这个属性来设置不同的抓取trace的方式,java的进程实质上是使用adb shell kill -3来抓trace,对于native的进程实质上是使用debuggerd(debuggerd -b [pid])抓trace。而useTombstonedForJavaTraces的值是false还是true取决与"dalvik.vm.stack-trace-dir"这个属性在开机的时候是否设置了。

    相关文章

      网友评论

      本文标题:应用与系统稳定性第二篇---ANR的监测与信息采集

      本文链接:https://www.haomeiwen.com/subject/zuhkzftx.html