背景
主项目是媒体类app,有一个绑在主进程里的播放器进程,用的是foreground service。有一个普通的推送进程用的是普通的service。
某个版本发布之后突然大量报错,说启动播放进程的时候没有startforeground,但是我们早在oncreate跟onstartcommend的时候都设置了通知,按理说不应该出现这个问题才是。
后续仔细观察log,发现崩溃抛出的进程居然是推送进程,而不是主进程,明明只有主进程启动了播放进程,为什么会在推送进程报错呢?
没办法,只能去跟进service启动流程的源码,具体步骤就不说了,直接说发现:
ActiveServices.class 这边只贴关键代码
...
//这里可以知道如果启动一个ForegroundService10秒都没有startforeground的话就会抛出错误跟ANR
// How long the startForegroundService() grace period is to get around to
// calling startForeground() before we ANR + stop it.
static final int SERVICE_START_FOREGROUND_TIMEOUT = 10*1000;
...
//错误名称就是在这里抛出的
void serviceForegroundCrash(ProcessRecord app, CharSequence serviceRecord) {
mAm.crashApplication(app.uid, app.pid, app.info.packageName, app.userId,
"Context.startForegroundService() did not then call Service.startForeground(): "
+ serviceRecord);
}
AppErrors.class
//这边是抛出异常的地方,可以看到首先匹配进程id,如果进程id找不到那么就是最后一个相同packagename的进程抛出该异常
void scheduleAppCrashLocked(int uid, int initialPid, String packageName, int userId,
String message) {
ProcessRecord proc = null;
// Figure out which process to kill. We don't trust that initialPid
// still has any relation to current pids, so must scan through the
// list.
synchronized (mService.mPidsSelfLocked) {
for (int i=0; i<mService.mPidsSelfLocked.size(); i++) {
ProcessRecord p = mService.mPidsSelfLocked.valueAt(i);
if (uid >= 0 && p.uid != uid) {
continue;
}
if (p.pid == initialPid) {
proc = p;
break;
}
if (p.pkgList.containsKey(packageName)
&& (userId < 0 || p.userId == userId)) {
proc = p;
}
}
}
if (proc == null) {
Slog.w(TAG, "crashApplication: nothing for uid=" + uid
+ " initialPid=" + initialPid
+ " packageName=" + packageName
+ " userId=" + userId);
return;
}
proc.scheduleCrash(message);
}
于是得出结论,应该是启动service之后主进程被干掉了,并且没有崩溃日志。
第一反应是 exitProcess(0),之后全局搜索在mainactivity的ondestory里找到了,具体为什么这么写是因为业务需求跟历史原因。
abstract class BaseActivity : AppCompatActivity() {
override fun onCreate(savedInstanceState: Bundle?) {
AppCompatDelegate.setDefaultNightMode(AppCompatDelegate.MODE_NIGHT_NO)
savedInstanceState.checkVersion()
super.onCreate(savedInstanceState)
StatusBarUtil.init(this)
}
后面为了验证猜测写了个demo
class TestService : Service() {
override fun onStartCommand(intent: Intent?, flags: Int, startId: Int): Int {
Log.e("fat", "onStartCommanding")
Handler().postDelayed({
// stopSelf()
exitProcess(0)
}, 9900)
return START_NOT_STICKY
}
override fun onBind(intent: Intent?): IBinder? {
TODO("not implemented") //To change body of created functions use File | Settings | File Templates.
}
}
启动之后果然复现该问题(需要额外启动另一个进程用来接受崩溃)。
但是这样并不能找到真凶,现在的结论是进程被干掉了,如果是崩溃导致的那理论上应该在下次启动之后被上传上来,然而崩溃检测上并没有。那么很有可能是另一种情况ANR,考虑到目前市面上anr的检测似乎都不咋地。于是自己写一个
object ANRShooting {
private val anrThreshold = TimeLength.ONE_SECOND * 3
private var disabledForMismatch = false
private val reportHandler = run {
val thread = HandlerThread("ANRReporting")
thread.start()
Handler(thread.looper)
}
private lateinit var mainThread: Thread
@MainThread
fun start() {
mainThread = Thread.currentThread()
Looper.getMainLooper().setMessageLogging { msg ->
if (disabledForMismatch) {
return@setMessageLogging
}
when {
msg.startsWith(">>>>> Dispatching to ") -> onMessageStart(msg)
msg.startsWith("<<<<< Finished to ") -> onMessageEnd(msg)
else -> disabledForMismatch = true // Probably a modified ROM.
}
}
}
private fun onMessageStart(msg: String) {
logvInDebug { "msg=$msg" }
reportHandler.removeCallbacksAndMessages(null)
reportHandler.postDelayed({
ifDebugging {
if (Debug.isDebuggerConnected()) { // Do not interrupt debugging.
return@postDelayed
}
}
val e = CaughtException("Main thread seems to be stuck. msg=$msg")
e.stackTrace = mainThread.stackTrace
//此处会调用自定义错误的记录上传机制
throwInDebugMode(e)
}, anrThreshold.millis)
}
private fun onMessageEnd(msg: String) {
logvInDebug { "msg=$msg" }
reportHandler.removeCallbacksAndMessages(null)
}
}
后面确认成功抓到了真凶,最后发现居然是获取UserAgent的时候ANR了,定位了下需求,果然这版本启动日志里增加了UA的字段,再联系到我们的浏览器内核用的是腾讯的X5内核,那很有可能是因为这个。
以上就是案件的全过程(这谁想得到啊!)
网友评论