结论:不要在服务端oneway接口中处理耗时操作,一旦用于高频调用,服务端又处理耗时,再偶尔碰上cpu负荷高,很可能会发生其他关键调用偶现失败的隐蔽问题,而且这个调用失败不会导致app crash严重问题,只会导致业务逻辑的缺失,如果错误log输出不完善,会让人莫名奇妙,而且很难复现,隐患就一直在那儿了。
最近遇到一个偶现问题,ROM刚OTA后,某个系统应用(App A)在重启过程中bind另一系统服务(服务B),后面却一直没有收到onServiceConnected回调,开始App A很合理的认为是服务B的问题(服务B最近修改过,会导致服务B启动慢,偶尔ANR)。
分析Log后发现,可能是由于OTA重启过程中在做大量的odex编译优化操作,导致CPU占用高,其启动过程比正常启动情况下慢很多(onCreate执行耗时长),但是虽然服务B启动较慢,但是也在ANR时间内完成启动,其他app也bind服务B成功,及时收到了onServiceConnected回调。App A并没有指定onServiceConnected在其他线程执行,于是怀疑App A主线程被他自己的操作block,导致onServiceConnected回调无法执行,但是在几分钟以后的log中发现了App A主线程有log输出,App A也并没有发生ANR,说明App A主线程并没有block,只是单纯的没有onServiceConnected回调。
好在其他App在绑定服务B的onServiceConnected回调中加了log,很明显的可以看到,其他App集中收到了回调,于是在其他App收到回调附近log仔细查看,发现这样的warnning log:
12-26 11:02:44.107 1095 2101 W ActivityManager: Failure sending service ComponentInfo{com.android.**/com.android.**.**} to connection android.os.BinderProxy@a3981ee (in com.**)
12-26 11:02:44.107 1095 2101 W ActivityManager: android.os.DeadObjectException: Transaction failed on small parcel; remote process probably died
12-26 11:02:44.107 1095 2101 W ActivityManager: at android.os.BinderProxy.transactNative(Native Method)
12-26 11:02:44.107 1095 2101 W ActivityManager: at android.os.BinderProxy.transact(Binder.java:764)
12-26 11:02:44.107 1095 2101 W ActivityManager: at android.app.IServiceConnection$Stub$Proxy.connected(IServiceConnection.java:95)
12-26 11:02:44.107 1095 2101 W ActivityManager: at com.android.server.am.ActiveServices.publishServiceLocked(ActiveServices.java:1526)
12-26 11:02:44.107 1095 2101 W ActivityManager: at com.android.server.am.ActivityManagerService.publishService(ActivityManagerService.java:18828)
12-26 11:02:44.107 1095 2101 W ActivityManager: at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:627)
12-26 11:02:44.107 1095 2101 W ActivityManager: at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3055)
12-26 11:02:44.107 1095 2101 W ActivityManager: at android.os.Binder.execTransact(Binder.java:697)
很显然,这是ActivityManagerService向App A回调onServiceConnected接口时抛出的异常,应该是App A进程有问题,但是根据后面的log看,App A其他work线程有正常log输出,主线程也有log输出,进程运行的很好,怎么会抛这个异常呢?
直觉上觉得可能是Binder出问题了,但Binder是成熟的不能再成熟的IPC机制了,又不太可能出问题,联系到Binder是通过linux驱动来实现的,于是先去kernel log中对应时间点看看有没有什么线索,然后就看到了很多类似下面kernel中的错误log:
12-26 11:02:44.102 0 0 I k : binder: 1095:2101 transaction failed 29201/-28, size 200-8 line 3142
注意 1095:2101是指binder client端的进程号和线程号,与上面Warning log一致。再找到对应该log的kernel代码。
aosp_kernel/common/drivers/android/binder.c
binder_debug(BINDER_DEBUG_FAILED_TRANSACTION,
"%d:%d transaction failed %d/%d, size %lld-%lld line %d\n",
proc->pid, thread->pid, return_error, return_error_param,
(u64)tr->data_size, (u64)tr->offsets_size,
return_error_line);
可以看到log中其他信息分别是,return_error:29201, return_error_param: -28,这个-28是通用的linux错误码:
errno:/usr/include/asm-generic/errno-base.h
#define EPERM 1 /* Operation not permitted */
#define ENOENT 2 /* No such file or directory */
#define ESRCH 3 /* No such process */
#define EINTR 4 /* Interrupted system call */
#define EIO 5 /* I/O error */
#define ENXIO 6 /* No such device or address */
#define E2BIG 7 /* Argument list too long */
#define ENOEXEC 8 /* Exec format error */
#define EBADF 9 /* Bad file number */
#define ECHILD 10 /* No child processes */
#define EAGAIN 11 /* Try again */
#define ENOMEM 12 /* Out of memory */
#define EACCES 13 /* Permission denied */
#define EFAULT 14 /* Bad address */
#define ENOTBLK 15 /* Block device required */
#define EBUSY 16 /* Device or resource busy */
#define EEXIST 17 /* File exists */
#define EXDEV 18 /* Cross-device link */
#define ENODEV 19 /* No such device */
#define ENOTDIR 20 /* Not a directory */
#define EISDIR 21 /* Is a directory */
#define EINVAL 22 /* Invalid argument */
#define ENFILE 23 /* File table overflow */
#define EMFILE 24 /* Too many open files */
#define ENOTTY 25 /* Not a typewriter */
#define ETXTBSY 26 /* Text file busy */
#define EFBIG 27 /* File too large */
#define ENOSPC 28 /* No space left on device */
#define ESPIPE 29 /* Illegal seek */
#define EROFS 30 /* Read-only file system */
#define EMLINK 31 /* Too many links */
#define EPIPE 32 /* Broken pipe */
#define EDOM 33 /* Math argument out of domain of func */
#define ERANGE 34 /* Math result not representable */
可以看到错误是-ENOSPC,表示device上没有空间了,而binder在Android上,就是以设备驱动的形式存在,然后顺藤摸瓜,只有在下面的这个函数中找到返回了-ENOSPC:
aosp_kernel/common/drivers/android/binder_alloc.candroid/binder_alloc.c
static struct binder_buffer *binder_alloc_new_buf_locked(
struct binder_alloc *alloc,
size_t data_size,
size_t offsets_size,
size_t extra_buffers_size,
int is_async)
{
}
往上跟踪,最终错误返回到:
aosp_kernel/common/drivers/android/binder.c 所在函数为binder_transaction,由函数名可以看出所有binder的传输,应该都有该函数处理(纯属猜测)
t->buffer = binder_alloc_new_buf(&target_proc->alloc, tr->data_size,
tr->offsets_size, extra_buffers_size,
!reply && (t->flags & TF_ONE_WAY));
if (IS_ERR(t->buffer)) {
/*
* -ESRCH indicates VMA cleared. The target is dying.
*/
return_error_param = PTR_ERR(t->buffer);
return_error = return_error_param == -ESRCH ?
BR_DEAD_REPLY : BR_FAILED_REPLY;
return_error_line = __LINE__;
t->buffer = NULL;
goto err_binder_alloc_buf_failed;
}
这里的PTR_ERR(t->buffer)就转换回了-ENOSPC然后跳转到了err_binder_alloc_buf_failed:打印出了kernel中的那行log
err_binder_alloc_buf_failed:
err_bad_extra_size:
if (secctx)
security_release_secctx(secctx, secctx_sz);
err_get_secctx_failed:
kfree(tcomplete);
binder_stats_deleted(BINDER_STAT_TRANSACTION_COMPLETE);
err_alloc_tcomplete_failed:
kfree(t);
binder_stats_deleted(BINDER_STAT_TRANSACTION);
err_alloc_t_failed:
err_bad_todo_list:
err_bad_call_stack:
err_empty_call_stack:
err_dead_binder:
err_invalid_target_handle:
if (target_thread)
binder_thread_dec_tmpref(target_thread);
if (target_proc)
binder_proc_dec_tmpref(target_proc);
if (target_node) {
binder_dec_node(target_node, 1, 0);
binder_dec_node_tmpref(target_node);
}
binder_debug(BINDER_DEBUG_FAILED_TRANSACTION,
"%d:%d transaction failed %d/%d, size %lld-%lld line %d\n",
proc->pid, thread->pid, return_error, return_error_param,
(u64)tr->data_size, (u64)tr->offsets_size,
return_error_line);
再继续仔细分析抛出异常之前的log, App A有大量的“接收”数据的log输出,App A刚修改过一个sdk的配置,导致云端(另一个进程)可能大量给App A发数据,可能就是这个操作占满了Binder空间(不到1MB):
android/frameworks/native/libs/binder/ProcessState.cpp
#define BINDER_VM_SIZE ((1 * 1024 * 1024) - sysconf(_SC_PAGE_SIZE) * 2)
//......
// mmap the binder, providing a chunk of virtual address space to receive transactions.
mVMStart = mmap(0, BINDER_VM_SIZE, PROT_READ, MAP_PRIVATE | MAP_NORESERVE, mDriverFD, 0);
继续看其他kernel中的错误log:
12-26 11:02:43.694 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 5872-0 line 3142
12-26 11:02:43.694 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 136-0 line 3142
12-26 11:02:43.694 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 208-0 line 3142
12-26 11:02:43.694 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 136-0 line 3142
12-26 11:02:43.694 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 136-0 line 3142
12-26 11:02:43.694 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 140-0 line 3142
12-26 11:02:43.695 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 136-0 line 3142
12-26 11:02:43.695 0 0 I chatty : uid=0(root) logd identical 2 lines
12-26 11:02:43.695 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 136-0 line 3142
12-26 11:02:43.695 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 140-0 line 3142
12-26 11:02:43.695 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 140-0 line 3142
12-26 11:02:43.695 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 136-0 line 3142
12-26 11:02:43.695 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 136-0 line 3142
12-26 11:02:43.695 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 140-0 line 3142
12-26 11:02:43.695 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 136-0 line 3142
12-26 11:02:43.695 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 140-0 line 3142
12-26 11:02:43.695 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 136-0 line 3142
12-26 11:02:43.695 0 0 I k : binder: 828:957 transaction failed 29201/-28, size 136-0 line 3142
12-26 11:02:43.694
这个时间点是出现第一条错误的log,进程828,线程957执行的binder调用没有成功,可以看到在1ms内调用了好多次,可以怀疑前面没出错时这个线程也在一直进行binder调用导致了binder服务端处理不过来binder驱动缓存没有及时释放导致后面的调用分配空间失败。再去logcat中查找828进程的log:
12-26 11:02:43.694 828 957 W IDataCallback: transact data error! cmd=320, errorCode=-2147483646
12-26 11:02:43.694 828 957 W IDataCallback: transact data error! cmd=318, errorCode=-2147483646
12-26 11:02:43.694 828 957 W IDataCallback: transact data error! cmd=142, errorCode=-2147483646
12-26 11:02:43.694 828 957 W IDataCallback: transact data error! cmd=318, errorCode=-2147483646
在和kernel log出错相同时间点,相同的进程、线程输出了传输数据错误的log。到这基本确认就是binder传输太频繁导致服务端进程无法再接收其他binder调用(App版拒绝服务攻击??),包括AMS向App A调用onServiceConnected。
于是准备写个demo来复现这个错误:
验证demo
IRemoteService.aidl
// IRemoteService.aidl
package com.example.mycardemo;
// Declare any non-default types here with import statements
interface IRemoteService {
/**
* Demonstrates some basic types that you can use as parameters
* and return values in AIDL.
*/
void remoteCallbackTest();
}
RemoteService.java
package com.example.mycardemo;
import android.app.Service;
import android.content.Intent;
import android.os.IBinder;
import android.os.RemoteException;
import android.util.Log;
import androidx.annotation.Nullable;
public class RemoteService extends Service {
private static final String TAG = "RemoteService";
private BinderService mService;
@Override
public void onCreate() {
super.onCreate();
mService = new BinderService();
}
@Nullable
@Override
public IBinder onBind(Intent intent) {
return mService;
}
private volatile int mCount = 0;
private class BinderService extends IRemoteService.Stub {
@Override
public void remoteCallbackTest() throws RemoteException {
Log.d(TAG, "remoteCallbackTest before wait " + (mCount++));
try {
Thread.sleep(10000);
} catch (InterruptedException e) {
e.printStackTrace();
}
Log.d(TAG, "remoteCallbackTest after wait " + (--mCount));
}
}
}
RemoteService2.java
package com.example.mycardemo;
import android.app.Service;
import android.content.ComponentName;
import android.content.Context;
import android.content.Intent;
import android.content.ServiceConnection;
import android.os.IBinder;
import android.os.RemoteException;
import android.util.Log;
import androidx.annotation.Nullable;
public class RemoteService2 extends Service {
private static final String TAG = "RemoteService2";
private IRemoteService mRemoteService;
private volatile int mCallCount = 0;
private Runnable mRemoteCall = new Runnable() {
@Override
public void run() {
try {
Log.d(TAG, "RemoteService2 client call remote " + (mCallCount++));
mRemoteService.remoteCallbackTest();
Log.d(TAG, "RemoteService2 client call remote end " + (--mCallCount));
} catch (RemoteException e) {
e.printStackTrace();
}
}
};
private ServiceConnection mConnection = new ServiceConnection() {
@Override
public void onServiceConnected(ComponentName componentName, IBinder iBinder) {
Log.d(TAG, "onServiceConnected :" + componentName);
mRemoteService = IRemoteService.Stub.asInterface(iBinder);
for(int i = 0; i < 500; i++) {
new Thread(mRemoteCall).start();
}
}
@Override
public void onServiceDisconnected(ComponentName componentName) {
}
};
@Override
public void onCreate() {
super.onCreate();
Intent intent = new Intent();
intent.setPackage("com.example.mycardemo");
intent.setAction("action.test.remote.call");
bindService(intent, mConnection, Context.BIND_AUTO_CREATE);
}
@Nullable
@Override
public IBinder onBind(Intent intent) {
return null;
}
}
AndroidManifest.xml
<application
android:allowBackup="true"
android:icon="@mipmap/ic_launcher"
android:label="@string/app_name"
android:roundIcon="@mipmap/ic_launcher_round"
android:supportsRtl="true"
android:theme="@style/AppTheme">
<activity android:name=".MainActivity">
<intent-filter>
<action android:name="android.intent.action.MAIN" />
<category android:name="android.intent.category.LAUNCHER" />
</intent-filter>
</activity>
<service
android:name=".RemoteService"
android:process=":remote">
<intent-filter>
<action android:name="action.test.remote.call"/>
</intent-filter>
</service>
<service android:name=".RemoteService2"
android:process=":remote2">
<intent-filter>
<action android:name="action.test.remote.call2"/>
</intent-filter>
</service>
</application>
信心满满运行(截取log):
01-09 14:18:02.990 1985 2742 D RemoteService2: RemoteService2 client call remote 497
01-09 14:18:02.992 1985 2743 D RemoteService2: RemoteService2 client call remote 498
01-09 14:18:02.993 1985 2745 D RemoteService2: RemoteService2 client call remote 499
01-09 14:18:11.475 1992 2037 D RemoteService: remoteCallbackTest after wait 15
01-09 14:18:11.475 1992 2037 D RemoteService: remoteCallbackTest before wait 15
01-09 14:18:11.475 1985 2044 D RemoteService2: RemoteService2 client call remote end 499
01-09 14:18:11.490 1992 2028 D RemoteService: remoteCallbackTest after wait 15
01-09 14:18:11.490 1985 2045 D RemoteService2: RemoteService2 client call remote end 498
01-09 14:18:11.490 1992 2028 D RemoteService: remoteCallbackTest before wait 15
01-09 14:18:11.491 1992 2029 D RemoteService: remoteCallbackTest after wait 15
01-09 14:18:11.492 1992 2029 D RemoteService: remoteCallbackTest before wait 15
01-09 14:18:11.492 1985 2050 D RemoteService2: RemoteService2 client call remote end 497
01-09 14:18:11.496 1992 2058 D RemoteService: remoteCallbackTest after wait 15
01-09 14:18:11.497 1992 2058 D RemoteService: remoteCallbackTest before wait 15
01-09 14:18:11.497 1985 2052 D RemoteService2: RemoteService2 client call remote end 496
01-09 14:18:11.504 1992 2059 D RemoteService: remoteCallbackTest after wait 15
01-09 14:18:11.504 1992 2059 D RemoteService: remoteCallbackTest before wait 15
居然没问题,井然有序的执行完了,疑惑不已,再和另一个同事讨论中,同事提醒onServiceConnected是oneway调用,会不会和这个有关,于是修改aidl:
oneway interface IRemoteService {
/**
* Demonstrates some basic types that you can use as parameters
* and return values in AIDL.
*/
void remoteCallbackTest();
}
再执行:
01-09 14:22:00.924 3290 3883 D RemoteService2: RemoteService2 client call remote 2
01-09 14:22:00.924 3290 3883 D RemoteService2: RemoteService2 client call remote end 2
01-09 14:22:00.925 3290 3884 D RemoteService2: RemoteService2 client call remote 2
01-09 14:22:00.925 3290 3884 D RemoteService2: RemoteService2 client call remote end 2
01-09 14:22:00.931 3290 3885 D RemoteService2: RemoteService2 client call remote 2
01-09 14:22:00.931 3290 3886 D RemoteService2: RemoteService2 client call remote 3
01-09 14:22:00.931 3290 3886 D RemoteService2: RemoteService2 client call remote end 3
01-09 14:22:00.931 3290 3885 D RemoteService2: RemoteService2 client call remote end 2
01-09 14:22:09.673 3295 3325 D RemoteService: remoteCallbackTest after wait 0
01-09 14:22:09.673 3295 3325 D RemoteService: remoteCallbackTest before wait 0
01-09 14:22:19.675 3295 3325 D RemoteService: remoteCallbackTest after wait 0
01-09 14:22:19.675 3295 3325 D RemoteService: remoteCallbackTest before wait 0
01-09 14:22:29.680 3295 3325 D RemoteService: remoteCallbackTest after wait 0
01-09 14:22:29.683 3295 3323 D RemoteService: remoteCallbackTest before wait 0
01-09 14:22:39.686 3295 3323 D RemoteService: remoteCallbackTest after wait 0
01-09 14:22:39.686 3295 3323 D RemoteService: remoteCallbackTest before wait 0
01-09 14:22:49.686 3295 3323 D RemoteService: remoteCallbackTest after wait 0
01-09 14:22:49.687 3295 3323 D RemoteService: remoteCallbackTest before wait 0
还是没问题T_T,不过意外的是,oneway方法服务端是依次执行,可以看到每个服务端的被调用都是隔10s调用一次,因为服务端接口中sleep了10s,同事和我在一块尝试复现,在修改调用次数到5000的时候,再执行:
01-09 14:31:11.258 15103 18251 E JavaBinder: !!! FAILED BINDER TRANSACTION !!! (parcel size = 84)
01-09 14:31:11.258 15103 18236 W System.err: android.os.DeadObjectException: Transaction failed on small parcel; remote process probably died
01-09 14:31:11.258 15103 18236 W System.err: at android.os.BinderProxy.transactNative(Native Method)
01-09 14:31:11.258 15103 18236 W System.err: at android.os.BinderProxy.transact(Binder.java:764)
01-09 14:31:11.259 15103 18250 D RemoteService2: RemoteService2 client call remote 227
01-09 14:31:11.259 15103 18236 W System.err: at com.example.mycardemo.IRemoteService$Stub$Proxy.remoteCallbackTest(IRemoteService.java:98)
01-09 14:31:11.259 15103 18236 W System.err: at com.example.mycardemo.RemoteService2$1.run(RemoteService2.java:26)
01-09 14:31:11.259 15103 18236 W System.err: at java.lang.Thread.run(Thread.java:764)
01-09 14:31:11.259 15103 18253 E JavaBinder: !!! FAILED BINDER TRANSACTION !!! (parcel size = 84)
01-09 14:31:11.259 15103 18252 E JavaBinder: !!! FAILED BINDER TRANSACTION !!! (parcel size = 84)
01-09 14:31:11.259 15103 18250 E JavaBinder: !!! FAILED BINDER TRANSACTION !!! (parcel size = 84)
01-09 14:31:11.260 15103 18251 W System.err: android.os.DeadObjectException: Transaction failed on small parcel; remote process probably died
01-09 14:31:11.260 15103 18251 W System.err: at android.os.BinderProxy.transactNative(Native Method)
01-09 14:31:11.260 15103 18251 W System.err: at android.os.BinderProxy.transact(Binder.java:764)
01-09 14:31:11.260 15103 18249 D RemoteService2: RemoteService2 client call remote 228
01-09 14:31:11.262 15103 18251 W System.err: at com.example.mycardemo.IRemoteService$Stub$Proxy.remoteCallbackTest(IRemoteService.java:98)
01-09 14:31:11.262 15103 18251 W System.err: at com.example.mycardemo.RemoteService2$1.run(RemoteService2.java:26)
01-09 14:31:11.262 15103 18251 W System.err: at java.lang.Thread.run(Thread.java:764)
复现了!和上面ActivityManagerService回调onServiceConnected一样的错误。
再改回非oneway接口测试,OOM,crash了,线程太多导致栈溢出了。但是binder调用还是正常。
01-09 14:27:42.286 9465 9465 E AndroidRuntime: java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Try again
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at java.lang.Thread.nativeCreate(Native Method)
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at java.lang.Thread.start(Thread.java:733)
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at com.example.mycardemo.RemoteService2$2.onServiceConnected(RemoteService2.java:40)
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at android.app.LoadedApk$ServiceDispatcher.doConnected(LoadedApk.java:1652)
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at android.app.LoadedApk$ServiceDispatcher$RunConnection.run(LoadedApk.java:1681)
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at android.os.Handler.handleCallback(Handler.java:790)
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at android.os.Handler.dispatchMessage(Handler.java:99)
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at android.os.Looper.loop(Looper.java:164)
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at android.app.ActivityThread.main(ActivityThread.java:6494)
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at java.lang.reflect.Method.invoke(Native Method)
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:438)
01-09 14:27:42.286 9465 9465 E AndroidRuntime: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:807)
然后去官网看了看oneway接口的说明:
FLAG_ONEWAY
Added in API level 1
public static final int FLAG_ONEWAY
Flag to transact(int, Parcel, Parcel, int): this is a one-way call, meaning that the caller returns immediately, without waiting for a result from the callee. Applies only if the caller and callee are in different processes.
The system provides special ordering semantics for multiple oneway calls being made to the same IBinder object: these calls will be dispatched in the other process one at a time, with the same order as the original calls. These are still dispatched by the IPC thread pool, so may execute on different threads, but the next one will not be dispatched until the previous one completes. This ordering is not guaranteed for calls on different IBinder objects or when mixing oneway and non-oneway calls on the same IBinder object.
These are still dispatched by the IPC thread pool, so may execute on different threads, but the next one will not be dispatched until the previous one completes.
的确binder驱动对于oneway的调用是类似于handler sendmessage那样的,挨个处理,所以如果服务端的oneway接口处理太慢而客户端调用太多的话,来不及处理的调用会占满binder驱动的缓存,导致其他调用抛出上面的transaction failed!这个有点坑啊,oneway binder调用岂不是成了不可靠调用??照上面的分析,只有oneway调用会导致这个问题,回到实际的那个问题,根据log tag 找到了IDataCallback.aidl
oneway interface IDataCallback {
void onReceiveBytes(int groupId, in byte[] data);
}
果然是个oneway的远程回调接口,到这应该能确定,就是oneway的使用导致了上面的那个偶现问题。
这个坑太隐蔽,本来oneway就是为了客户端不阻塞快速返回,一旦用于高频调用,服务端又处理耗时,再偶尔碰上cpu负荷高,很可能会发生其他关键调用偶现失败的隐蔽问题,而且这个调用失败不会导致app crash严重问题,只会导致业务逻辑的缺失,如果错误log输出不完善,会让人莫名奇妙,而且很难复现,隐患就一直在那儿了。
网友评论