美文网首页Troubleshooting
FastJNI死锁问题攻关总结

FastJNI死锁问题攻关总结

作者: dumphex | 来源:发表于2017-07-06 16:32 被阅读489次

    问题现象

    • 复现步骤
      • Android 7.0平台 + W16.49.4/W16.50.4版本
      • monkey测试时低概率出现定屏和Camera2 ANR问题
    • 多次复现后,特征如下
      • 复现概率低
      • adb能响应,除了system_server/Camera2其它进程均正常
      • system_server/Camera2不响应kill -3信号,无法dump java和native堆栈
      • system_server/Camera2的多数线程处于futex_wait状态

    分析定位

    初步分析

    定屏后,system_server进程的log很少: 只有少数几个线程打印log信息。

    • GC线程异常log

      分析多份log,发现GC线程输出异常信息

      A001-02 11:19:31.708   731   741 E art     : Unexpected time out during suspend all.
      A001-02 11:22:49.907   731   741 E art     : Unexpected time out during suspend all.
      A001-02 11:26:08.106   731   741 E art     : Unexpected time out during suspend all.
      A001-02 11:29:26.304   731   741 E art     : Unexpected time out during suspend all.
      A001-02 11:32:44.502   731   741 E art     : Unexpected time out during suspend all.
      A001-02 11:36:02.701   731   741 E art     : Unexpected time out during suspend all.
      A001-02 11:39:20.899   731   741 E art     : Unexpected time out during suspend all.
      A001-02 11:42:39.098   731   741 E art     : Unexpected time out during suspend all.
      A001-02 11:45:57.296   731   741 E art     : Unexpected time out during suspend all.
      A001-02 11:49:15.494   731   741 E art     : Unexpected time out during suspend all.
      
    • Debuggerd native堆栈

      "system_server" sysTid=731
       #00 pc 000173e4  /system/lib/libc.so (syscall+28)
       #01 pc 000b69bd  /system/lib/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+92)
       #02 pc 0027aca3  /system/lib/libart.so (_ZN3art3JNI12NewStringUTFEP7_JNIEnvPKc+214)
       #03 pc 00091ee5  /system/lib/libandroid_runtime.so
       #04 pc 73104e99  /data/dalvik-cache/arm/system@framework@boot.oat (offset 0x2830000)
      
      "Signal Catcher" sysTid=736
       #00 pc 000173e4  /system/lib/libc.so (syscall+28)
       #01 pc 000b69bd  /system/lib/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+92)
       #02 pc 000f9d6d  /system/lib/libart.so (_ZN3art11ClassLinker14DumpForSigQuitERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE+232)
       #03 pc 0031f9eb  /system/lib/libart.so (_ZN3art7Runtime14DumpForSigQuitERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE+114)
       #04 pc 0032431b  /system/lib/libart.so (_ZN3art13SignalCatcher13HandleSigQuitEv+1394)
      #05 pc 00323489  /system/lib/libart.so (_ZN3art13SignalCatcher3RunEPv+336)
       #06 pc 00046fd3  /system/lib/libc.so (_ZL15__pthread_startPv+22)
       #07 pc 00019ded  /system/lib/libc.so (__start_thread+6)
      
      "HeapTaskDaemon" sysTid=741
       #00 pc 000173e4  /system/lib/libc.so (syscall+28)
       #01 pc 0033bcc5  /system/lib/libart.so (_ZN3art10ThreadList18SuspendAllInternalEPNS_6ThreadES2_S2_b+496)
       #02 pc 0033c187  /system/lib/libart.so (_ZN3art10ThreadList10SuspendAllEPKcb+374)
       #03 pc 0016eb45  /system/lib/libart.so (_ZN3art2gc9collector16GarbageCollector11ScopedPauseC2EPS2_+32)
       #04 pc 001737dd  /system/lib/libart.so (_ZN3art2gc9collector9MarkSweep9RunPhasesEv+160)
       #05 pc 0016e555  /system/lib/libart.so (_ZN3art2gc9collector16GarbageCollector3RunENS0_7GcCauseEb+248)
       #06 pc 00191a6d  /system/lib/libart.so   (_ZN3art2gc4Heap22CollectGarbageInternalENS0_9collector6GcTypeENS0_7GcCauseEb+2364)
       #07 pc 0019734d  /system/lib/libart.so (_ZN3art2gc4Heap12ConcurrentGCEPNS_6ThreadEb+68)
       #08 pc 0019be83  /system/lib/libart.so (_ZN3art2gc4Heap16ConcurrentGCTask3RunEPNS_6ThreadE+18)
       #09 pc 001b4243  /system/lib/libart.so (_ZN3art2gc13TaskProcessor11RunAllTasksEPNS_6ThreadE+30)
       #10 pc 72ce401f  /data/dalvik-cache/arm/system@framework@boot.oat (offset 0x2830000)
       ... ...
      

    开会讨论

    • signal catcher为什么不能dump堆栈?

      • signal catcher线程已被GC线程suspend,除非被GC resume,否则无法响应kill -3信号。
    • GC线程suspend机制梳理

      • GC线程先向各线程请求suspend,各线程会在检测到相应标志后主动suspend,GC线程等待其它线程suspend后才去做GC操作。
    • w16.49.4回退提交排查

      • 后期宣告失败,因为早期版本也存在此问题。

    添加debug log

    • 尝试分析debuggerd堆栈,排查未suspended线程。但system_server线程个数接近90,排查完后剩余的线程仍然很多。
    • 复现问题时,打印不能被suspend的线程及相关信息

    问题复现和分析

    ART debug log

    A001-01 09:30:03.427   722   733 E art     : Unexpected time out during suspend all.
    A001-01 09:30:03.428   722   733 I art     : ThreadId = 18, tid = 761, name = android.display, state = Runnable
    
    A001-01 09:49:14.140   722   733 E art     : Unexpected time out during suspend all.
    A001-01 09:49:14.141   722   733 I art     : ThreadId = 18, tid = 761, name = android.display, state = Runnable
    
    A001-01 10:08:24.853   722   733 E art     : Unexpected time out during suspend all.
    A001-01 10:08:24.853   722   733 I art     : ThreadId = 18, tid = 761, name = android.display, state = Runnable
    

    debuggerd堆栈分析

    • system_server main线程
      • debuggerd native堆栈

        "system_server" sysTid=722
         #00 pc 000173e4  /system/lib/libc.so (syscall+28)
         #01 pc 000b69bd  /system/lib/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+92)
         #02 pc 0027aca3  /system/lib/libart.so (_ZN3art3JNI12NewStringUTFEP7_JNIEnvPKc+214)
         #03 pc 00091ee5  /system/lib/libandroid_runtime.so
         #04 pc 732e1e99  /data/dalvik-cache/arm/system@framework@boot.oat (offset 0x2830000)
        
      • 查看对应的文件名和行号等信息

        $ arm-linux-androideabi-addr2line -C -f -e ~/symbols/system/lib/libandroid_runtime.so 00091ee5

        android_content_AssetManager_getArrayStringResource
        frameworks/base/core/jni/android_util_AssetManager.cpp:1977
        
      • 相应的函数code

        1937static jobjectArray android_content_AssetManager_getArrayStringResource(JNIEnv* env, jobject clazz,
        1938                                                                        jint arrayResId)
        1939{
        1940    AssetManager* am = assetManagerForJavaObject(env, clazz);
        1941    if (am == NULL) {
        1942        return NULL;
        1943    }
        1944    const ResTable& res(am->getResources());
        1945
        1946    const ResTable::bag_entry* startOfBag;
        1947    const ssize_t N = res.lockBag(arrayResId, &startOfBag);
        1948    if (N < 0) {
        1949        return NULL;
        1950    }
        1951
        1952    jobjectArray array = env->NewObjectArray(N, g_stringClass, NULL);
        1953    if (env->ExceptionCheck()) {
        1954        res.unlockBag(startOfBag);
        1955        return NULL;
        1956    }
        1957
        1958    Res_value value;
        1959    const ResTable::bag_entry* bag = startOfBag;
        1960    size_t strLen = 0;
        1961    for (size_t i=0; ((ssize_t)i)<N; i++, bag++) {
        1962        value = bag->map.value;
        1963        jstring str = NULL;
        1964
        1965        // Take care of resolving the found resource to its final value.
        1966        ssize_t block = res.resolveReference(&value, bag->stringBlock, NULL);
        1967        if (kThrowOnBadId) {
        1968            if (block == BAD_INDEX) {
        1969                jniThrowException(env, "java/lang/IllegalStateException", "Bad resource!");
        1970                return array;
        1971            }
        1972        }
        1973        if (value.dataType == Res_value::TYPE_STRING) {
        1974            const ResStringPool* pool = res.getTableStringBlock(block);
        1975            const char* str8 = pool->string8At(value.data, &strLen);
        1976            if (str8 != NULL) {
        1977                str = env->NewStringUTF(str8);
        
    • android.display线程
      • debuggerd native堆栈

        "android.display" sysTid=761
         #00 pc 000173e4  /system/lib/libc.so (syscall+28)
         #01 pc 000476a5  /system/lib/libc.so (_ZL33__pthread_mutex_lock_with_timeoutP24pthread_mutex_internal_tbPK8timespec+176)
         #02 pc 00093975  /system/lib/libandroid_runtime.so
         #03 pc 732e17e5  /data/dalvik-cache/arm/system@framework@boot.oat (offset 0x2830000)
        
      • 查看对应的文件名和行号等信息

        $ arm-linux-androideabi-addr2line -C -f -e ~/symbols/system/lib/libandroid_runtime.so 00093975

        android_content_AssetManager_applyStyle
        frameworks/base/core/jni/android_util_AssetManager.cpp:1430
        
      • 相应的函数code

        1340static jboolean android_content_AssetManager_applyStyle(JNIEnv* env, jobject clazz,
        1341                                                        jlong themeToken,
        1342                                                        jint defStyleAttr,
        1343                                                        jint defStyleRes,
        1344                                                        jlong xmlParserToken,
        1345                                                        jintArray attrs,
        1346                                                        jintArray outValues,
        1347                                                        jintArray outIndices)
        1348{
        1349    if (themeToken == 0) {
        1350        jniThrowNullPointerException(env, "theme token");
        1351        return JNI_FALSE;
        1352    }
        1353    if (attrs == NULL) {
        1354        jniThrowNullPointerException(env, "attrs");
        1355        return JNI_FALSE;
        1356    }
        1357    if (outValues == NULL) {
        1358        jniThrowNullPointerException(env, "out values");
        1359        return JNI_FALSE;
        1360    }
        1361
        1362    if (kDebugStyles) {
        1363    ALOGI("APPLY STYLE: theme=0x%" PRIx64 " defStyleAttr=0x%x defStyleRes=0x%x "
        1364          "xml=0x%" PRIx64, themeToken, defStyleAttr, defStyleRes,
        1365          xmlParserToken);
        1366    }
        1367
        1368    ResTable::Theme* theme = reinterpret_cast<ResTable::Theme*>(themeToken);
        1369    const ResTable& res = theme->getResTable();
        1370    ResXMLParser* xmlParser = reinterpret_cast<ResXMLParser*>(xmlParserToken);
        1371    ResTable_config config;
        1372    Res_value value;
        1373
        1374    const jsize NI = env->GetArrayLength(attrs);
        1375    const jsize NV = env->GetArrayLength(outValues);
        1376    if (NV < (NI*STYLE_NUM_ENTRIES)) {
        1377        jniThrowException(env, "java/lang/IndexOutOfBoundsException", "out values too small");
        1378        return JNI_FALSE;
        1379    }
        1380
        1381    jint* src = (jint*)env->GetPrimitiveArrayCritical(attrs, 0);
        1382    if (src == NULL) {
        1383        return JNI_FALSE;
        1384    }
        1385
        1386    jint* baseDest = (jint*)env->GetPrimitiveArrayCritical(outValues, 0);
        1387    jint* dest = baseDest;
        1388    if (dest == NULL) {
        1389        env->ReleasePrimitiveArrayCritical(attrs, src, 0);
        1390        return JNI_FALSE;
        1391    }
        1392
        1393    jint* indices = NULL;
        1394    int indicesIdx = 0;
        1395    if (outIndices != NULL) {
        1396        if (env->GetArrayLength(outIndices) > NI) {
        1397            indices = (jint*)env->GetPrimitiveArrayCritical(outIndices, 0);
        1398        }
        1399    }
        1400
        1401    // Load default style from attribute, if specified...
        1402    uint32_t defStyleBagTypeSetFlags = 0;
        1403    if (defStyleAttr != 0) {
        1404        Res_value value;
        1405        if (theme->getAttribute(defStyleAttr, &value, &defStyleBagTypeSetFlags) >= 0) {
        1406            if (value.dataType == Res_value::TYPE_REFERENCE) {
        1407                defStyleRes = value.data;
        1408            }
        1409        }
        1410    }
        1411
        1412    // Retrieve the style class associated with the current XML tag.
        1413    int style = 0;
        1414    uint32_t styleBagTypeSetFlags = 0;
        1415    if (xmlParser != NULL) {
        1416        ssize_t idx = xmlParser->indexOfStyle();
        1417        if (idx >= 0 && xmlParser->getAttributeValue(idx, &value) >= 0) {
        1418            if (value.dataType == value.TYPE_ATTRIBUTE) {
        1419                if (theme->getAttribute(value.data, &value, &styleBagTypeSetFlags) < 0) {
        1420                    value.dataType = Res_value::TYPE_NULL;
        1421                }
        1422            }
        1423            if (value.dataType == value.TYPE_REFERENCE) {
        1424                style = value.data;
        1425            }
        1426        }
        1427    }
        1428
        1429    // Now lock down the resource object and start pulling stuff from it.
        1430    res.lock();
        
    • HeapTaskDaemon线程
          "HeapTaskDaemon" sysTid=733
             #00 pc 000173e4  /system/lib/libc.so (syscall+28)
             #01 pc 0033bcf1  /system/lib/libart.so (_ZN3art10ThreadList18SuspendAllInternalEPNS_6ThreadES2_S2_b+540)
             #02 pc 0033c2f7  /system/lib/libart.so (_ZN3art10ThreadList10SuspendAllEPKcb+374)
             #03 pc 0016eb45  /system/lib/libart.so (_ZN3art2gc9collector16GarbageCollector11ScopedPauseC2EPS2_+32)
             #04 pc 001737dd  /system/lib/libart.so (_ZN3art2gc9collector9MarkSweep9RunPhasesEv+160)
             #05 pc 0016e555  /system/lib/libart.so (_ZN3art2gc9collector16GarbageCollector3RunENS0_7GcCauseEb+248)
             #06 pc 00191a6d  /system/lib/libart.so (_ZN3art2gc4Heap22CollectGarbageInternalENS0_9collector6GcTypeENS0_7GcCauseEb+2364)
             #07 pc 0019734d  /system/lib/libart.so (_ZN3art2gc4Heap12ConcurrentGCEPNS_6ThreadEb+68)
             #08 pc 0019be83  /system/lib/libart.so (_ZN3art2gc4Heap16ConcurrentGCTask3RunEPNS_6ThreadE+18)
             #09 pc 001b4243  /system/lib/libart.so (_ZN3art2gc13TaskProcessor11RunAllTasksEPNS_6ThreadE+30)
             #10 pc 72ec101f  /data/dalvik-cache/arm/system@framework@boot.oat (offset 0x2830000)
      

    死锁现象总结

    • GC线程向各线程发送suspend request,并等待各线程suspend
    • main线程进入到JNI,拿到某模块一个mutex,并已suspend
    • android.display线程调用到JNI, 等待main线程释放mutex


      fastjni_deadlock1.png

    最新复现

    新的堆栈

    • system_server main线程

      • debuggerd native堆栈

        "system_server" sysTid=720
            #00 pc 000173e4  /system/lib/libc.so (syscall+28)
            #01 pc 000b62bd  /system/lib/libart.so (_ZN3art17ConditionVariable16WaitHoldingLocksEPNS_6ThreadE+92)
            #02 pc 00279a03  /system/lib/libart.so (_ZN3art3JNI12NewStringUTFEP7_JNIEnvPKc+214)
            #03 pc 00091ee5  /system/lib/libandroid_runtime.so
            #04 pc 72b51e99  /data/dalvik-cache/arm/system@framework@boot.oat (offset 0x2830000)
        
      • 查看对应的文件名和行号等信息

        $ arm-linux-androideabi-addr2line -C -f -e ~/symbols/system/lib/libandroid_runtime.so 00091ee5

        android_content_AssetManager_getArrayStringResource
        frameworks/base/core/jni/android_util_AssetManager.cpp:1977
        
    • Binder线程

      • debuggerd native堆栈

          "Binder:720_5" sysTid=944
          #00 pc 000173e4  /system/lib/libc.so (syscall+28)
          #01 pc 000476a5  /system/lib/libc.so (_ZL33__pthread_mutex_lock_with_timeoutP24pthread_mutex_internal_tbPK8timespec+176)
          #02 pc 0001cb49  /system/lib/libandroidfw.so (_ZN7android8ResTable13setParametersEPKNS_15ResTable_configE+24)
          #03 pc 00013993  /system/lib/libandroidfw.so (_ZNK7android12AssetManager26updateResourceParamsLockedEv+86)
          #04 pc 000138f9  /system/lib/libandroidfw.so (_ZN7android12AssetManager15setLocaleLockedEPKc+240)
          #05 pc 00013add  /system/lib/libandroidfw.so (_ZN7android12AssetManager16setConfigurationERKNS_15ResTable_configEPKc+60)
          #06 pc 00092b4f  /system/lib/libandroid_runtime.so
          #07 pc 72b5537d  /data/dalvik-cache/arm/system@framework@boot.oat (offset 0x2830000)
        
      • 查看对应的文件名和行号等信息

        $ arm-linux-androideabi-addr2line -C -f -e ~/symbols/system/lib/libandroidfw.so 00013993

        android::AssetManager::updateResourceParamsLocked() const
        /proc/self/cwd/frameworks/base/libs/androidfw/AssetManager.cpp:744
        

        $ arm-linux-androideabi-addr2line -C -f -e ~/symbols/system/lib/libandroidfw.so 000138f9

        android::AssetManager::setLocaleLocked(char const*)
        /proc/self/cwd/frameworks/base/libs/androidfw/AssetManager.cpp:414
        

        $ arm-linux-androideabi-addr2line -C -f -e ~/symbols/system/lib/libandroidfw.so 00013add

        android::AssetManager::setConfiguration(android::ResTable_config const&, char const*)
        /proc/self/cwd/frameworks/base/libs/androidfw/AssetManager.cpp:441
        

        $ arm-linux-androideabi-addr2line -C -f -e ~/symbols/system/lib/libandroid_runtime.so 00092b4f

        android_content_AssetManager_setConfiguration
        frameworks/base/core/jni/android_util_AssetManager.cpp:721
        
    • HeapTaskDaemon线程

      "HeapTaskDaemon" sysTid=731
         #00 pc 000173e4  /system/lib/libc.so (syscall+28)
         #01 pc 0033aa59  /system/lib/libart.so (_ZN3art10ThreadList18SuspendAllInternalEPNS_6ThreadES2_S2_b+516)
         #02 pc 0033b13f  /system/lib/libart.so (_ZN3art10ThreadList10SuspendAllEPKcb+374)
         #03 pc 0016e445  /system/lib/libart.so (_ZN3art2gc9collector16GarbageCollector11ScopedPauseC2EPS2_+32)
         #04 pc 001730dd  /system/lib/libart.so (_ZN3art2gc9collector9MarkSweep9RunPhasesEv+160)
         #05 pc 0016de55  /system/lib/libart.so (_ZN3art2gc9collector16GarbageCollector3RunENS0_7GcCauseEb+248)
         #06 pc 0019136d  /system/lib/libart.so (_ZN3art2gc4Heap22CollectGarbageInternalENS0_9collector6GcTypeENS0_7GcCauseEb+2364)
         #07 pc 00196c4d  /system/lib/libart.so (_ZN3art2gc4Heap12ConcurrentGCEPNS_6ThreadEb+68)
         #08 pc 0019b783  /system/lib/libart.so (_ZN3art2gc4Heap16ConcurrentGCTask3RunEPNS_6ThreadE+18)
         #09 pc 001b39e3  /system/lib/libart.so (_ZN3art2gc13TaskProcessor11RunAllTasksEPNS_6ThreadE+30)
         #10 pc 7273101f  /data/dalvik-cache/arm/system@framework@boot.oat (offset 0x2830000)
      

    新的死锁

    • 死锁如下
    fastjni_deadlock2.png

    Root Cause

    • 前面多次复现的JNI都跟frameworks/base下面的asset manager里面的JNI函数有关。

    • 对比了Android 6.0 vs. Android 7.0关于gAssetManagerMethods[]的差异,7.0比6.0在部分JNI函数的signature上增加了"!", 表示是fast JNI, 目的是提高性能。

      commit 46cfd93aa22b5a4acab4626c140415eef922445c
      Author: Jeff Sharkey <jsharkey@android.com>
      Date:   Mon Nov 2 18:39:45 2015 -0800
      
      Let's sprinkle some FastJNI into Resources.
      
      Before:
      
                     benchmark   us linear runtime
                      GetColor 14.9 ===========
                    GetInteger 19.9 ===============
          GetLayoutAndTraverse 38.4 =============================
                     GetString 38.5 ==============================
      
      After:
      
                     benchmark   us linear runtime
                      GetColor 13.9 ==========
                    GetInteger 18.8 ==============
          GetLayoutAndTraverse 38.1 =============================
                     GetString 38.2 ==============================
      
      Change-Id: I8c20e14182d2645bc62a0e7fc6345e298b11933c
      
      diff --git a/core/jni/android_util_AssetManager.cpp b/core/jni/android_util_AssetManager.cpp
      index c94bc64..55b7e7e 100644
      --- a/core/jni/android_util_AssetManager.cpp
      +++ b/core/jni/android_util_AssetManager.cpp
      @@ -2156,9 +2156,9 @@ static const JNINativeMethod gAssetManagerMethods[] = {
      ... ....
      -    { "applyStyle","(JIIJ[I[I[I)Z",
      +    { "applyStyle","!(JIIJ[I[I[I)Z",
           (void*) android_content_AssetManager_applyStyle },
      -    { "setConfiguration", "(IILjava/lang/String;IIIIIIIIIIIIII)V",
      +    { "setConfiguration", "!(IILjava/lang/String;IIIIIIIIIIIIII)V",
           (void*) android_content_AssetManager_setConfiguration },
      ... ...
      
    • 但ART在处理fast JNI时,不会进行状态切换

      extern uint32_t JniMethodStart(Thread* self) {
        JNIEnvExt* env = self->GetJniEnv();
        DCHECK(env != nullptr);
        uint32_t saved_local_ref_cookie = env->local_ref_cookie;
        env->local_ref_cookie = env->locals.GetSegmentState();
        ArtMethod* native_method = *self->GetManagedStack()->GetTopQuickFrame();
        if (!native_method->IsFastNative()) {
          // When not fast JNI we transition out of runnable.
          self->TransitionFromRunnableToSuspended(kNative);
        }
        return saved_local_ref_cookie;
      }
      

    解决方案

    • Revert fast JNI optimization

      • 验证
        • 基于出问题的W16.50.4版本验证两轮monkey测试: 20台机器运行48小时没有复现
        • 合入patch后, 类似的定屏和Camera2 ANR问题不再出现.
    • AOSP

      Remove FastJNI optimization on AssetManager to avoid dead lock.

      https://android-review.googlesource.com/#/c/340143/

    相关文章

      网友评论

        本文标题:FastJNI死锁问题攻关总结

        本文链接:https://www.haomeiwen.com/subject/qnnhhxtx.html