美文网首页
APP-ANR研究

APP-ANR研究

作者: 南歌ccc | 来源:发表于2020-07-23 12:32 被阅读0次

    场景

    • 开发期间或者本地测试机器发生的ANR信息

      • trace文件
      • bugreport文件
    • 线上应用获取用户使用过程中发生的ANR信息

      • 集成框架:Matrix、BlockCanary、ANR-WatchDog、xCrash等配合服务器收集等等
      • FileObserver和WatchDog机制去实现,受限于Android系统的SELinux机制,5.0以后基本已经使低权限应用无法监听到trace文件(系统应用>签名应用>第三方应用)
        对比了一些比较热门的实现方案: 优劣点比较看表格(表格出自腾讯云+社区)
        性能监控框架.jpg

    https://github.com/Tencent/Matrix
    https://github.com/markzhai/AndroidPerformanceMonitor
    https://github.com/SalomonBrys/ANR-WatchDog

    • DropBox
      • 简介:
        为了记录历史的logcat日志,Android引入了Dropbox,将历史日志持久化到磁盘中(/data/system/dropbox)。 logcat的缓冲区大小毕竟是有限的,所以需要循环利用,这样历史的日志信息就会被冲掉。在一些自动化测试的场景下,譬如Monkey需要长时间的运行, 就需要把历史的日志全都保存下来。
        代码中使用DropboxManager来记录需要相对持久化存储的错误日志信息,理论上也是一种可行方式。 没用过,待验证。

    traces文件

    • 简介
      当发生ANR时,系统会将相应的堆栈信息记录在traces文件里,且只会保留最后一次发生ANR时的traces信息。
      Android 5.0 以下,记录在 /data/anr/traces.txt 文件中。
      Android 5.0 以上,记录在 /data/anr/anr_* 等文件。

    • 获取方式

      • 通过adb命令 adb pull /data/anr (需要root权限)
      • 如果无法获取root权限,使用adb bugreport获取的bugreport压缩包里也带有anr trace文件和tombstones文件(需要打开开发者选项)
    • 日志解读

      • 如果trace日志文件有效:
        可以获取很多有用的信息,比如直接的ANR的callstack会直接打印出来,如果打印出来的callstack并不能定位问题所在,可以继续排查发生ANR的进程、线层,时间范围等线索,去查看当时的Log,验证问题。
      • trace日志文件无效,被新的ANR覆盖等:
        那么便无法通过trace文件获取有效信息,因为dump出来的trace文件已经丢失ANR现场,只能尝试去解读log分析,或者等待复现获取有效trace文件。

    例子:

    ----- pid 16271 at 2020-07-22 14:39:25 -----   // 发生ANR的进程pid和时间,与log文件里的需要对应起来,否则便不是目标treace堆栈,可能被覆盖了
    Cmd line: com.example.myexoplayervideodemo     // 发生ANR的进程名
    Build fingerprint: 'Xiaomi/cepheus/cepheus:10/QKQ1.190825.002/V11.0.9.0.QFACNXM:user/release-keys'   // 接下来一堆都是系统版本,内存等状态信息
    ABI: 'arm64'
    Build type: optimized
    Zygote loaded classes=8182 post zygote classes=562
    Dumping registered class loaders
    #0 dalvik.system.PathClassLoader: [], parent #1
    #1 java.lang.BootClassLoader: [], no parent
    #2 dalvik.system.PathClassLoader: [/system/framework/tcmclient.jar], parent #0
    #3 dalvik.system.PathClassLoader: [], parent #0
    #4 dalvik.system.PathClassLoader: [/data/app/com.example.myexoplayervideodemo-qBeUzwMwbXCPAFlijKmhvA==/base.apk:/data/app/com.example.myexoplayervideodemo-qBeUzwMwbXCPAFlijKmhvA==/base.apk!classes2.dex], parent #1
    #5 dalvik.system.PathClassLoader: [/system/app/MiuiContentCatcher/MiuiContentCatcher.apk], parent #1
    #6 dalvik.system.PathClassLoader: [/system/app/CatcherPatch/CatcherPatch.apk], parent #1
    ...
    
    suspend all histogram:  Sum: 50.298ms 99% C.I. 1.311us-1228.373us Avg: 44.868us Max: 14539us
    DALVIK THREADS (14):
    "Signal Catcher" daemon prio=5 tid=7 Runnable
      | group="system" sCount=0 dsCount=0 flags=0 obj=0x13900c38 self=0x7434399400
      | sysTid=16308 nice=0 cgrp=default sched=0/0 handle=0x7440d1dd50
      | state=R schedstat=( 5896410 661457 17 ) utm=0 stm=0 core=6 HZ=100
      | stack=0x7440c27000-0x7440c29000 stackSize=991KB
      | held mutexes= "mutator lock"(shared held)
      native: #00 pc 000000000041098c  /apex/com.android.runtime/lib64/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, int, BacktraceMap*, char const*, art::ArtMethod*, void*, bool)+140)
      native: #01 pc 00000000004f80e4  /apex/com.android.runtime/lib64/libart.so (art::Thread::DumpStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool, BacktraceMap*, bool) const+512)
      native: #02 pc 0000000000512a20  /apex/com.android.runtime/lib64/libart.so (art::DumpCheckpoint::Run(art::Thread*)+828)
      native: #03 pc 000000000050b844  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::RunCheckpoint(art::Closure*, art::Closure*)+456)
      native: #04 pc 000000000050ad28  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool)+1964)
      native: #05 pc 000000000050a408  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char>>&)+844)
      native: #06 pc 00000000004c581c  /apex/com.android.runtime/lib64/libart.so (art::Runtime::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char>>&)+200)
      native: #07 pc 00000000004d9c54  /apex/com.android.runtime/lib64/libart.so (art::SignalCatcher::HandleSigQuit()+1352)
      native: #08 pc 00000000004d8d00  /apex/com.android.runtime/lib64/libart.so (art::SignalCatcher::Run(void*)+252)
      native: #09 pc 00000000000d6cb0  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36)
      native: #10 pc 0000000000074eac  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64)
      (no managed stack frames)
    
    "main" prio=5 tid=1 Native   // // 线程名,优先级,线程号,线程状态
      | group="main" sCount=1 dsCount=0 flags=1 obj=0x75b073c8 self=0x744c869c00  // 线程组名,被挂起次数,线程对象的地址等
      | sysTid=16271 nice=-10 cgrp=default sched=0/0 handle=0x74d2a44ed0     // sysTid是Linux下的内核线程id
      | state=S schedstat=( 3322210770 125918324 2742 ) utm=89 stm=243 core=4 HZ=100
      | stack=0x7ffbb28000-0x7ffbb2a000 stackSize=8192KB
      | held mutexes=
      kernel: (couldn't read /proc/self/task/16271/stack)    // 下面都是调用栈信息,头部几行是native日志,可以通过tombstone一文介绍的方法进行解读定位代码行数
      native: #00 pc 000000000007067c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
      native: #01 pc 000000000014b1f4  /apex/com.android.runtime/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+148)
      native: #02 pc 000000000058f958  /apex/com.android.runtime/lib64/libart.so (_ZN3artL12GoToRunnableEPNS_6ThreadE.llvm.1098154004112041116+440)
      native: #03 pc 000000000058f75c  /apex/com.android.runtime/lib64/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+28)
      at libcore.io.Linux.fdatasync(Native method)
      at libcore.io.ForwardingOs.fdatasync(ForwardingOs.java:107)
      at libcore.io.BlockGuardOs.fdatasync(BlockGuardOs.java:167)
      at libcore.io.ForwardingOs.fdatasync(ForwardingOs.java:107)
      at android.system.Os.fdatasync(Os.java:168)
      at java.io.RandomAccessFile.maybeSync(RandomAccessFile.java:306)
      at java.io.RandomAccessFile.writeBytes(RandomAccessFile.java:547)
      at java.io.RandomAccessFile.write(RandomAccessFile.java:559)
      at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.writeTxtToFile(AnrTestActivity.java:99)
      at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.access$100(AnrTestActivity.java:24)
      at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity$3.onNext(AnrTestActivity.java:66)
      at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity$3.onNext(AnrTestActivity.java:50)
      at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.tryEmitScalar(ObservableFlatMap.java:234)
      at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.subscribeInner(ObservableFlatMap.java:146)
      at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.onNext(ObservableFlatMap.java:139)
      at io.reactivex.internal.operators.observable.ObservableBuffer$BufferExactObserver.onNext(ObservableBuffer.java:113)
      at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.tryEmit(ObservableFlatMap.java:262)
      at io.reactivex.internal.operators.observable.ObservableFlatMap$InnerObserver.onNext(ObservableFlatMap.java:554)
      at io.reactivex.observers.SerializedObserver.onNext(SerializedObserver.java:111)
      at io.reactivex.internal.operators.observable.ObservableConcatMap$SourceObserver$InnerObserver.onNext(ObservableConcatMap.java:249)
      at io.reactivex.internal.operators.observable.ObservableScalarXMap$ScalarDisposable.run(ObservableScalarXMap.java:246)
      at io.reactivex.internal.operators.observable.ObservableJust.subscribeActual(ObservableJust.java:35)
      at io.reactivex.Observable.subscribe(Observable.java:10685)
      at io.reactivex.internal.operators.observable.ObservableConcatMap$SourceObserver.drain(ObservableConcatMap.java:223)
      at io.reactivex.internal.operators.observable.ObservableConcatMap$SourceObserver.onSubscribe(ObservableConcatMap.java:103)
      at io.reactivex.internal.operators.observable.ObservableFromIterable.subscribeActual(ObservableFromIterable.java:54)
      at io.reactivex.Observable.subscribe(Observable.java:10685)
      at io.reactivex.internal.operators.observable.ObservableConcatMap.subscribeActual(ObservableConcatMap.java:52)
      at io.reactivex.Observable.subscribe(Observable.java:10685)
      at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.subscribeInner(ObservableFlatMap.java:162)
      at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.onNext(ObservableFlatMap.java:139)
      at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.tryEmitScalar(ObservableFlatMap.java:234)
      at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.subscribeInner(ObservableFlatMap.java:146)
      at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.onNext(ObservableFlatMap.java:139)
      at io.reactivex.internal.operators.observable.ObservableFromArray$FromArrayDisposable.run(ObservableFromArray.java:105)
      at io.reactivex.internal.operators.observable.ObservableFromArray.subscribeActual(ObservableFromArray.java:35)
      at io.reactivex.Observable.subscribe(Observable.java:10685)
      at io.reactivex.internal.operators.observable.ObservableFlatMap.subscribeActual(ObservableFlatMap.java:55)
      at io.reactivex.Observable.subscribe(Observable.java:10685)
      at io.reactivex.internal.operators.observable.ObservableFlatMap.subscribeActual(ObservableFlatMap.java:55)
      at io.reactivex.Observable.subscribe(Observable.java:10685)
      at io.reactivex.internal.operators.observable.ObservableBuffer.subscribeActual(ObservableBuffer.java:44)
      at io.reactivex.Observable.subscribe(Observable.java:10685)
      at io.reactivex.internal.operators.observable.ObservableFlatMap.subscribeActual(ObservableFlatMap.java:55)
      at io.reactivex.Observable.subscribe(Observable.java:10685)
      at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.createIoAnr(AnrTestActivity.java:50)
      at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.access$000(AnrTestActivity.java:24)
      at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity$1.onClick(AnrTestActivity.java:34)
      at android.view.View.performClick(View.java:7163)
      at android.view.View.performClickInternal(View.java:7140)
      at android.view.View.access$3500(View.java:813)
      at android.view.View$PerformClick.run(View.java:27428)
      at android.os.Handler.handleCallback(Handler.java:883)
      at android.os.Handler.dispatchMessage(Handler.java:100)
      at android.os.Looper.loop(Looper.java:224)
      at android.app.ActivityThread.main(ActivityThread.java:7520)
      at java.lang.reflect.Method.invoke(Native method)
      at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:539)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:950)
    
    

    bugreport

    • 简介
      bugreport是android系统自带的一套日志收集机制,bugreport信息量非常之大,几乎涵盖整个系统各个层面内容,对于分析BUG是一大利器。

    • 获取方式
      adb bugreport
      会得到一个压缩文件,里面有很多文件,最重要的一个文件便是bugreport.txt,里面包含了巨量日志,当分析问题没有线索的时候,也可以考虑去看下这个文件,比较全面。

    • 日志解读

      • anr log:
        发生ANR时,导出并打开bugreport.txt文件,搜索关键字"VM TRACES AT LAST ANR",一般就能定位到如下的位置,看看进程名和时间跟你的进程对上不,对上那就OK了,顺着往下翻就是这次ANR的trace信息了。
    ------ VM TRACES AT LAST ANR (/data/anr/anr_2020-07-22-14-39-25-303: 2020-07-22 14:39:36) ------
    
    ----- pid 16271 at 2020-07-22 14:39:25 -----
    Cmd line: com.example.myexoplayervideodemo
    Build fingerprint: 'Xiaomi/cepheus/cepheus:10/QKQ1.190825.002/V11.0.9.0.QFACNXM:user/release-keys'
    ABI: 'arm64'
    Build type: optimized
    Zygote loaded classes=8182 post zygote classes=562
    Dumping registered class loaders
    #0 dalvik.system.PathClassLoader: [], parent #1
    #1 java.lang.BootClassLoader: [], no parent
    #2 dalvik.system.PathClassLoader: [/system/framework/tcmclient.jar], parent #0
    
    ...
    
    "main" prio=5 tid=1 Native
      | group="main" sCount=1 dsCount=0 flags=1 obj=0x75b073c8 self=0x744c869c00
      | sysTid=16271 nice=-10 cgrp=default sched=0/0 handle=0x74d2a44ed0
      | state=S schedstat=( 3322210770 125918324 2742 ) utm=89 stm=243 core=4 HZ=100
      | stack=0x7ffbb28000-0x7ffbb2a000 stackSize=8192KB
      | held mutexes=
      kernel: (couldn't read /proc/self/task/16271/stack)
      native: #00 pc 000000000007067c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
      native: #01 pc 000000000014b1f4  /apex/com.android.runtime/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+148)
      native: #02 pc 000000000058f958  /apex/com.android.runtime/lib64/libart.so (_ZN3artL12GoToRunnableEPNS_6ThreadE.llvm.1098154004112041116+440)
      native: #03 pc 000000000058f75c  /apex/com.android.runtime/lib64/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+28)
      at libcore.io.Linux.fdatasync(Native method)
      at libcore.io.ForwardingOs.fdatasync(ForwardingOs.java:107)
      at libcore.io.BlockGuardOs.fdatasync(BlockGuardOs.java:167)
      at libcore.io.ForwardingOs.fdatasync(ForwardingOs.java:107)
      at android.system.Os.fdatasync(Os.java:168)
      at java.io.RandomAccessFile.maybeSync(RandomAccessFile.java:306)
      at java.io.RandomAccessFile.writeBytes(RandomAccessFile.java:547)
      at java.io.RandomAccessFile.write(RandomAccessFile.java:559)
      at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.writeTxtToFile(AnrTestActivity.java:99)
      at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.access$100(AnrTestActivity.java:24)
    

    像这个ANR,提示很清晰,也很容易看出来,是IO问题,那么这时我们可以去看一下发生ANR这个时间段前后的CPU使用情况以及IO.
    接下来去看下 【14.39.25】 前后时间内的CPU使用情况: 如果没有cpu日志,则需要去看ANR时间段附近的log,过滤筛选查找线索。
    (由于这个是Demo模拟发生的IO异常,callstack直接就可以定位到IO异常的代码行数了,不过我们这边做只是用来说明下这种问题的排查方法)

    • cpu log:
      AMS里打印cpu相关的log的代码:
    EventLog.writeEvent(EventLoTags.CPU,
        ((user+system+iowait+irq+softIrq) * 100) / total,
        (user * 100) / total,
        (system * 100) / total,
        (iowait * 100) / total,
        (irq * 100) / total,
        (softIrq * 100) /total);
    

    5.4% TOTAL: 2.6% user + 2.1% kernel + 0.1% iowait + 0.3% irq + 0.1% softirq

    07-22 14:39:36.217  1000  1607  1770 I system_server: libdebuggerd_client: done dumping process 29355
    07-22 14:39:36.223  1000  1607  1770 I ActivityManager: Done dumping
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager: ANR in com.example.myexoplayervideodemo (com.example.myexoplayervideodemo/.anrdemo.AnrTestActivity)
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager: PID: 16271
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager: Reason: Input dispatching timed out (com.example.myexoplayervideodemo/com.example.myexoplayervideodemo.anrdemo.AnrTestActivity, Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago.  Wait queue length: 19.  Wait queue head age: 9042.8ms.)
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager: Parent: com.example.myexoplayervideodemo/.anrdemo.AnrTestActivity
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager: Load: 0.0 / 0.0 / 0.0
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager: CPU usage from 0ms to 11562ms later (2020-07-22 14:39:24.661 to 2020-07-22 14:39:36.223):
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   1.6% 1198/media.codec: 1.1% user + 0.5% kernel / faults: 82688 minor 74 major
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   28% 16271/com.example.myexoplayervideodemo: 2.4% user + 25% kernel / faults: 3490 minor 22 major
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   9.1% 1607/system_server: 4.1% user + 5% kernel / faults: 7695 minor 45 major
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   7.7% 873/surfaceflinger: 0.4% user + 7.3% kernel / faults: 480 minor
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   7.6% 29355/com.netease.cloudmusic: 4.3% user + 3.3% kernel / faults: 3761 minor 30 major
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   5.8% 26832/com.xiaomi.smarthome: 3.3% user + 2.5% kernel / faults: 2441 minor 66 major
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   5.2% 637/jbd2/sda31-8: 0% user + 5.2% kernel
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   3% 147/kswapd0: 0% user + 3% kernel
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   0.1% 10301/kworker/u16:2: 0% user + 0.1% kernel
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   2.9% 9564/kworker/u16:4: 0% user + 2.9% kernel
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   0% 769/android.hardware.camera.provider@2.4-service_64: 0% user + 0% kernel / faults: 53 minor 18 major
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   2.8% 9364/kworker/u16:6: 0% user + 2.8% kernel
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   0.1% 1138/media.extractor: 0% user + 0% kernel / faults: 10157 minor 11 major
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   2.4% 11519/kworker/u16:10: 0% user + 2.4% kernel
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   2.4% 12915/com.xunmeng.pinduoduo:titan: 1.8% user + 0.6% kernel / faults: 905 minor
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   1.9% 2770/com.android.phone: 1.1% user + 0.7% kernel / faults: 3830 minor 35 major
    07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   1.9% 13728/kworker/u16:0: 0% user + 1.9% kernel
    ...
    
    

    log打印了ANR的基本信息,我们可以分析CPU使用率推测ANR发生的时候设备在做什么工作;如果CPU使用率很高,接近100%,可能是在进行大规模的计算更可能是陷入死循环;如果CUP使用率很低,说明主线程被阻塞了,并且当IOwait很高,可能是主线程在等待I/O操作的完成.
    此处CPU使用率不高,这里我们可以排除CPU使用率高的情况,怀疑是主线程被阻塞,IO异常就是造成主线程阻塞的原因之一,也和我们之前的ANR callstack呼应。

    关注ANR的reason字段对分析ANR问题也有很大帮助,此处:
    Reason: Input dispatching timed out (com.example.myexoplayervideodemo/com.example.myexoplayervideodemo.anrdemo.AnrTestActivity, Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago. Wait queue length: 19. Wait queue head age: 9042.8ms.)
    输入事件未响应导致ANR,也印证了主线程阻塞,结合callstack或者cpu相关的log,就能大致知道发生ANR的原因了。

    对于ANR只是分析Log很难知道问题所在,我们还需要通过Trace文件分析stack调用情况,两者结合去分析。
    当我们在bugreport.txt里没找到想要的callstack信息,就去trace文件里找看看,反之亦然。

    • ANR的问题千奇百怪,具体问题具体分析,经验总结看《ANR&Watchdog分析总结》一文。
    • bugreport还有一些其他可视化的阅读工具:
      • Battery Historian的
        由于bugreport.txt涵盖了太多的数据,解读起来十分困难,因此google针对android 5.0(api 21)以上的系统开发了一个叫做battery historian的分析工具,这个工具就是用来解析这个txt文本文件,然后使用web图形的形式展现出来,这样出来的效果更加人性化,更加可读.
        使用方法:
        https://github.com/google/battery-historian

      • ChkBugReport
        Sony开源的ChkBugReport
        https://github.com/sonyxperiadev/ChkBugReport

    PS: 顺便介绍下Watchdog

    Watchdog

    • 简介
      Android中的watchdog功能,用于监视系统的运行,确保一些重要的系统服务要保持正常运行,当出现故障时(死锁),通常会让Android系统重启。由于这种机制的存在,保证了Android系统正常稳定运行。
      Watchdog可以理解为System Not Responding。(有点类似ANR,只是属于系统层面)

    • 获取方式
      同ANR,也是去获取trace文件等
      bugreport、tombstone、logcat、trace等

    • 解读
      Watchdog相关的问题需要大量日志进行验证才能去定位问题,对日志要求比较高。

    例子:

    • 发生Watchdog的时候会有如下日志:
    05-23 07:34:45.038  1000  1655  1856 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on foreground thread (android.fg), Blocked in handler on main thread (main), Blocked in handler on display thread (android.display), Blocked in handler on ActivityManager (ActivityManager)
    05-23 07:34:45.040  1000  1655  1856 W Watchdog: android.fg annotated stack trace:
    05-23 07:34:45.040  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:16218)
    05-23 07:34:45.041  1000  1655  1856 W Watchdog:     - waiting to lock <0x00604b97> (a com.android.server.am.ActivityManagerService)
    05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at android.app.ContextImpl.sendOrderedBroadcastAsUser(ContextImpl.java:1317)
    05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at android.app.ContextImpl.sendOrderedBroadcastAsUser(ContextImpl.java:1279)
    05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at com.android.server.ConnectivityService.sendDataActivityBroadcast(ConnectivityService.java:2372)
    05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at com.android.server.ConnectivityService.access$600(ConnectivityService.java:245)
    05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at com.android.server.ConnectivityService$4.interfaceClassDataActivityChanged(ConnectivityService.java:1714)
    05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at com.android.server.NetworkManagementService.lambda$notifyInterfaceClassActivity$5(NetworkManagementService.java:458)
    05-23 07:34:45.044  1000  1655  1856 W Watchdog:     at com.android.server.-$$Lambda$NetworkManagementService$D43p3Tqq7B3qaMs9AGb_3j0KZd0.sendCallback(Unknown Source:6)
    05-23 07:34:45.044  1000  1655  1856 W Watchdog:     at com.android.server.NetworkManagementService.invokeForAllObservers(NetworkManagementService.java:358)
    05-23 07:34:45.044  1000  1655  1856 W Watchdog:     - locked <0x037929aa> (a android.os.RemoteCallbackList)
    05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at com.android.server.NetworkManagementService.notifyInterfaceClassActivity(NetworkManagementService.java:458)
    05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at com.android.server.NetworkManagementService.access$1200(NetworkManagementService.java:122)
    05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at com.android.server.NetworkManagementService$NetdUnsolicitedEventListener.lambda$onInterfaceClassActivityChanged$0$NetworkManagementService$NetdUnsolicitedEventListener(NetworkManagementService.java:705)
    05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at com.android.server.-$$Lambda$NetworkManagementService$NetdUnsolicitedEventListener$0xWa9DGxTnoGVHppsM-nng2PygE.run(Unknown Source:10)
    05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:883)
    05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:100)
    05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at android.os.Looper.loop(Looper.java:241)
    05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:67)
    05-23 07:34:45.046  1000  1655  1856 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)
    05-23 07:34:45.048  1000  1655  1856 W Watchdog: main annotated stack trace:
    05-23 07:34:45.048  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService.isAppBad(ActivityManagerService.java:6181)
    05-23 07:34:45.048  1000  1655  1856 W Watchdog:     - waiting to lock <0x00604b97> (a com.android.server.am.ActivityManagerService)
    05-23 07:34:45.048  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService.access$2000(ActivityManagerService.java:426)
    05-23 07:34:45.048  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService$LocalService.isAppBad(ActivityManagerService.java:19344)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.job.JobSchedulerService.isComponentUsable(JobSchedulerService.java:2353)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.job.JobSchedulerService.areComponentsInPlaceLocked(JobSchedulerService.java:2400)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.job.controllers.StateController.wouldBeReadyWithConstraintLocked(StateController.java:130)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.job.controllers.TimeController.checkExpiredDelaysAndResetAlarm(TimeController.java:325)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     - locked <0x0a8d85de> (a java.lang.Object)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.job.controllers.TimeController$2.onAlarm(TimeController.java:451)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at android.app.AlarmManager$ListenerWrapper.run(AlarmManager.java:252)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:883)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:100)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at android.os.Looper.loop(Looper.java:241)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.SystemServer.run(SystemServer.java:585)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.SystemServer.main(SystemServer.java:381)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at java.lang.reflect.Method.invoke(Native Method)
    05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492)
    05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:919)
    05-23 07:34:45.050  1000  1655  1856 W Watchdog: android.display annotated stack trace:
    05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService$LocalService.updateCpuStats(ActivityManagerService.java:18954)
    05-23 07:34:45.050  1000  1655  1856 W Watchdog:     - waiting to lock <0x00604b97> (a com.android.server.am.ActivityManagerService)
    05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at com.android.server.wm.-$$Lambda$LYW1ECaEajjYgarzgKdTZ4O1fi0.run(Unknown Source:2)
    05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:883)
    05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:100)
    05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at android.os.Looper.loop(Looper.java:241)
    05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:67)
    05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)
    05-23 07:34:45.053  1000  1655  1856 W Watchdog: ActivityManager annotated stack trace:
    05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at com.android.server.am.ActiveServices.serviceTimeout(ActiveServices.java:3877)
    05-23 07:34:45.053  1000  1655  1856 W Watchdog:     - waiting to lock <0x00604b97> (a com.android.server.am.ActivityManagerService)
    05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:1944)
    05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:107)
    05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at android.os.Looper.loop(Looper.java:241)
    05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:67)
    05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)
    05-23 07:34:45.053  1000  1655  1856 W Watchdog: *** GOODBYE!
    

    从上面的log可以看到,由于在等待锁(0x00604b97)导致了Watchdog,咱们继续追踪这个锁(0x00604b97)是怎么回事:
    在trace文件里找到下面这个log:

    "main" prio=5 tid=1 Blocked
      | group="main" sCount=1 dsCount=0 flags=1 obj=0x73039fc8 self=0x79b89e8c00
      | sysTid=1655 nice=-2 cgrp=default sched=0/0 handle=0x79b9f4ded0
      | state=S schedstat=( 339526112412 184124053201 1109981 ) utm=21067 stm=12885 core=4 HZ=100
      | stack=0x7ff8d71000-0x7ff8d73000 stackSize=8192KB
      | held mutexes=
      at com.android.server.am.ActivityManagerService.isAppBad(ActivityManagerService.java:6181)
      - waiting to lock <0x00604b97> (a com.android.server.am.ActivityManagerService) held by thread 138
      at com.android.server.am.ActivityManagerService.access$2000(ActivityManagerService.java:426)
      at com.android.server.am.ActivityManagerService$LocalService.isAppBad(ActivityManagerService.java:19344)
      at com.android.server.job.JobSchedulerService.isComponentUsable(JobSchedulerService.java:2353)
      at com.android.server.job.JobSchedulerService.areComponentsInPlaceLocked(JobSchedulerService.java:2400)
      at com.android.server.job.controllers.StateController.wouldBeReadyWithConstraintLocked(StateController.java:130)
      at com.android.server.job.controllers.TimeController.checkExpiredDelaysAndResetAlarm(TimeController.java:325)
      - locked <0x0a8d85de> (a java.lang.Object)
      at com.android.server.job.controllers.TimeController$2.onAlarm(TimeController.java:451)
      at android.app.AlarmManager$ListenerWrapper.run(AlarmManager.java:252)
      at android.os.Handler.handleCallback(Handler.java:883)
      at android.os.Handler.dispatchMessage(Handler.java:100)
      at android.os.Looper.loop(Looper.java:241)
      at com.android.server.SystemServer.run(SystemServer.java:585)
      at com.android.server.SystemServer.main(SystemServer.java:381)
      at java.lang.reflect.Method.invoke(Native method)
      at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492)
      at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:919)
    

    可以看到锁 0x00604b97 是因为thread 138造成的,接着看下thread 138是什么情况:

    "Binder:1655_1D" prio=5 tid=138 Blocked
      | group="main" sCount=1 dsCount=0 flags=1 obj=0x19d9bb88 self=0x785c6f3800
      | sysTid=12593 nice=-2 cgrp=default sched=0/0 handle=0x7867541d50
      | state=S schedstat=( 62957020280 28452871663 279796 ) utm=4350 stm=1945 core=5 HZ=100
      | stack=0x786744b000-0x786744d000 stackSize=991KB
      | held mutexes=
      at com.android.server.wm.ActivityTaskManagerService$LocalService.getTopProcessState(ActivityTaskManagerService.java:6595)
      - waiting to lock <0x0a36c171> (a com.android.server.wm.WindowManagerGlobalLock) held by thread 19
      at com.android.server.am.OomAdjuster.computeOomAdjLocked(OomAdjuster.java:893)
      at com.android.server.am.OomAdjuster.updateOomAdjLocked(OomAdjuster.java:365)
      at com.android.server.am.ActivityManagerService.updateOomAdjLocked(ActivityManagerService.java:17818)
      at com.android.server.am.ActiveServices.realStartServiceLocked(ActiveServices.java:2703)
      at com.android.server.am.ActiveServices.bringUpServiceLocked(ActiveServices.java:2590)
      at com.android.server.am.ActiveServices.bindServiceLocked(ActiveServices.java:1805)
      at com.android.server.am.ActivityManagerService.bindIsolatedService(ActivityManagerService.java:14717)
      - locked <0x00604b97> (a com.android.server.am.ActivityManagerService)
      at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:2377)
      at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3101)
      at android.os.Binder.execTransactInternal(Binder.java:1036)
      at android.os.Binder.execTransact(Binder.java:1003)
    

    而锁0x0a36c171又是因为(held by thread 19),咱们接着找下thread 19:

    "android.anim" prio=5 tid=19 Native
      | group="main" sCount=1 dsCount=0 flags=1 obj=0x19d80998 self=0x78ca476400
      | sysTid=1851 nice=-10 cgrp=default sched=0/0 handle=0x78c601bd50
      | state=S schedstat=( 131446952824 47039277674 269369 ) utm=9781 stm=3362 core=1 HZ=100
      | stack=0x78c5f19000-0x78c5f1b000 stackSize=1039KB
      | held mutexes=
      kernel: (couldn't read /proc/self/task/1851/stack)
      native: #00 pc 00000000000cd7a4  /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+4)
      native: #01 pc 000000000008b39c  /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+132)
      native: #02 pc 000000000005829c  /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+256)
      native: #03 pc 000000000005915c  /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+60)
      native: #04 pc 0000000000058ef8  /system/lib64/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+180)
      native: #05 pc 000000000004df38  /system/lib64/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+72)
      native: #06 pc 00000000000a3374  /system/lib64/libgui.so (android::BpSurfaceComposer::setTransactionState(android::Vector<android::ComposerState> const&, android::Vector<android::DisplayState> const&, unsigned int, android::sp<android::IBinder> const&, android::InputWindowCommands const&, long, android::client_cache_t const&, std::__1::vector<android::ListenerCallbacks, std::__1::allocator<android::ListenerCallbacks>> const&)+520)
      native: #07 pc 00000000000bf550  /system/lib64/libgui.so (android::SurfaceComposerClient::Transaction::apply(bool)+896)
      at android.view.SurfaceControl.nativeApplyTransaction(Native method)
      at android.view.SurfaceControl.access$400(SurfaceControl.java:77)
      at android.view.SurfaceControl$Transaction.apply(SurfaceControl.java:2273)
      at android.view.SurfaceControl$Transaction.apply(SurfaceControl.java:2255)
      at android.view.SurfaceControl.closeTransaction(SurfaceControl.java:1038)
      - locked <0x0f99c07c> (a java.lang.Class<android.view.SurfaceControl>)
      at com.android.server.wm.WindowManagerService.closeSurfaceTransaction(WindowManagerService.java:929)
      - locked <0x0a36c171> (a com.android.server.wm.WindowManagerGlobalLock)
      at com.android.server.wm.WindowAnimator.animate(WindowAnimator.java:215)
      - locked <0x0a36c171> (a com.android.server.wm.WindowManagerGlobalLock)
      at com.android.server.wm.WindowAnimator.lambda$new$1$WindowAnimator(WindowAnimator.java:99)
      at com.android.server.wm.-$$Lambda$WindowAnimator$ddXU8gK8rmDqri0OZVMNa3Y4GHk.doFrame(lambda:-1)
      at android.view.Choreographer$CallbackRecord.run(Choreographer.java:965)
      at android.view.Choreographer.doCallbacks(Choreographer.java:791)
      at android.view.Choreographer.doFrame(Choreographer.java:722)
      at android.view.Choreographer$FrameDisplayEventReceiver.run(Choreographer.java:952)
      at android.os.Handler.handleCallback(Handler.java:883)
      at android.os.Handler.dispatchMessage(Handler.java:100)
      at android.os.Looper.loop(Looper.java:241)
      at android.os.HandlerThread.run(HandlerThread.java:67)
      at com.android.server.ServiceThread.run(ServiceThread.java:44)
    

    提示的错误已经相对明显: android::BpSurfaceComposer::setTransactionState
    至此,我们容易追踪的watchdog信息便止步于此,不过我们已经可以知道是SurfaceFlinger相关的错误,在去结合代码和其他Log,也可以继续尝试解读native相关的错误Log进一步排查(native相关的callstack解读在Tombstone文章有介绍到),最后才能定位root cause.

    以上便是一个简单的Watchdog问题分析过程。

    相关文章

      网友评论

          本文标题:APP-ANR研究

          本文链接:https://www.haomeiwen.com/subject/hnnclktx.html