问题描述
yarn服务故障,查看服务管理一个nodemanger状态异常
分析过程
1.首先分析启动日志,由于HEATH_CHECK_STOP停止了nodemanger
2019-06-19 15:13:29 | INFO | PID-16052 | start to stop nodemanager | yarn-start-stop.sh
2019-06-19 15:13:29 | INFO | PID-16052 | stop type: HEATH_CHECK_STOP. | yarn-start-stop.sh
2.分析nodemanger运行日志,全是delete app log dir的打印,直到最后收到RECEIVED SIGNAL 15,进程kill
2019-06-19 15:13:26,899 | INFO | main | delete app log dir,application_1550654406365_11333211_DEL_1559995833078 | ResourceLocalizationService.java:1474
2019-06-19 15:13:26,899 | INFO | main | delete app log dir,application_1550654406365_10625928_DEL_1559995968558 | ResourceLocalizationService.java:1474
2019-06-19 15:13:26,899 | INFO | main | delete app log dir,application_1550654406365_12077960_DEL_1560384533291 | ResourceLocalizationService.java:1474
2019-06-19 15:13:26,899 | INFO | main | delete app log dir,application_1550654406365_11315373_DEL_1559996652333 | ResourceLocalizationService.java:1474
2019-06-19 15:13:26,899 | INFO | main | delete app log dir,application_1550654406365_11035836_DEL_1559996652333 | ResourceLocalizationService.java:1474
2019-06-19 15:13:26,899 | INFO | main | delete app log dir,application_1550654406365_11127905_DEL_1559996105413 | ResourceLocalizationService.java:1474
2019-06-19 15:13:26,899 | INFO | main | delete app log dir,application_1550654406365_11274943_DEL_1559996241710 | ResourceLocalizationService.java:1474
2019-06-19 15:13:26,899 | INFO | main | delete app log dir,application_1547246203054_1961416_DEL_1550657777851 | ResourceLocalizationService.java:1474
2019-06-19 15:13:26,899 | INFO | main | delete app log dir,application_1547246203054_2114204_DEL_1550657777851 | ResourceLocalizationService.java:1474
2019-06-19 15:13:26,899 | INFO | main | delete app log dir,application_1550654406365_10699626_DEL_1559996379568 | ResourceLocalizationService.java:1474
2019-06-19 15:13:26,899 | INFO | main | delete app log dir,application_1550654406365_11842934_DEL_1560261203480 | ResourceLocalizationService.java:1474
2019-06-19 15:13:29,650 | ERROR | SIGTERM handler | RECEIVED SIGNAL 15: SIGTERM | LogAdapter.java:69
3.基于上述分析,nodemanger是在正常启动,只是启动时候需要清理大量的app的信息。由于还未清理完成,健康检查就失败,任务重启。
解决办法
1.手工先清理nodemanger日志,rm -rf /srv/BigData/hadoop/data*/nm
2.重启nodemanger
网友评论