美文网首页
记一次Hive任务hang住的问题(2)

记一次Hive任务hang住的问题(2)

作者: 井地儿 | 来源:发表于2020-03-21 09:14 被阅读0次

    1 背景

    线上Hive任务偶尔出现hang住的现象.
    经排查确认是触发了Hive的bug, 该bug在Hive-10569(https://issues.apache.org/jira/browse/HIVE-10569中已修复.

    2 现象

    用户beeline任务运行至Ended Job之后,一直未结束,查看yarn日志,该job已经运行成功.

    INFO  : 2020-03-20 20:25:21,401 Stage-36 map = 100%,  reduce = 0%, Cumulative CPU 3.96 sec
    INFO  : 2020-03-20 20:25:30,224 Stage-36 map = 100%,  reduce = 100%, Cumulative CPU 9.44 sec
    INFO  : MapReduce Total cumulative CPU time: 9 seconds 440 msec
    INFO  : Ended Job = job_1582793079899_13676666INFO  : 2020-03-20 20:25:21,401 Stage-36 map = 100%,  reduce = 0%, Cumulative CPU 3.96 sec
    INFO  : 2020-03-20 20:25:30,224 Stage-36 map = 100%,  reduce = 100%, Cumulative CPU 9.44 sec
    INFO  : MapReduce Total cumulative CPU time: 9 seconds 440 msec
    INFO  : Ended Job = job_1582793079899_13676666
    

    3 排查步骤

    3.1 查看hiveserver2日志

    找到该job在hive server端的日志,没有发现任何异常.

    3.2 抓取现场

    分别通过jstack -l <pid>jmap -dump:[live,]format=b,file=<filename> <pid>获取server端的堆栈信息.

    3.3 定位排查

    • 首先, 确定session ID.
      根据 job id 定位到该MR对应的hive session id(session id会在hive conf中记录);
    • 然后, 确定线程ID.
      通过MAT分析工具打开之前获取到的heap文件, 根据session id查找到对应的HiveSessionImpl对象, 进而追踪到对应的线程ID;
    • 最后, 确定线程栈.
      在jstack文件中找到对应的线程,从而定位到具体的线程栈.

    本案例中线程栈如下:

    "HiveServer2-Background-Pool: Thread-10663039" #10663039 prio=5 os_prio=0 tid=0x00007fcd94c87000 nid=0x213ed in Object.wait() [0x00007fcbec2fb000]
       java.lang.Thread.State: TIMED_WAITING (on object monitor)
            at java.lang.Object.wait(Native Method)
            at org.apache.hadoop.hive.ql.DriverContext.pollFinished(DriverContext.java:118)
            - locked <0x00007fd29eff91d0> (a org.apache.hadoop.hive.ql.DriverContext)
            at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1574)
            at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
            at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1113)
            at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1108)
            at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:157)
            at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:70)
            at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:209)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:422)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1923)
            at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:221)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    

    3.4 源码剖析

    查看源码位置, 分析可知runner.isRunning状态未更新会导致hang住不结束.

      /**
       * Polls running tasks to see if a task has ended.
       *
       * @return The result object for any completed/failed task
       */
      public synchronized TaskRunner pollFinished() throws InterruptedException {
        while (!shutdown) {
          Iterator<TaskRunner> it = running.iterator();
          while (it.hasNext()) {
            TaskRunner runner = it.next();
            if (runner != null && !runner.isRunning()) {
              it.remove();
              return runner;
            }
          }
          wait(SLEEP_TIME);
        }
        return null;
      }
    

    4 解决方案

    查找社区issue发现, 该bug已经在1.3版本中进行了修复.(https://issues.apache.org/jira/browse/HIVE-10569
    后续 backport 该patch即可.

    相关文章

      网友评论

          本文标题:记一次Hive任务hang住的问题(2)

          本文链接:https://www.haomeiwen.com/subject/hnneyhtx.html