美文网首页
ExitCodeException exitCode=1

ExitCodeException exitCode=1

作者: 焉知非鱼 | 来源:发表于2018-09-05 17:58 被阅读438次

    提交了 Spark Streaming 程序后,看到 Spark Streaming 的 UI 界面, Executors 下面只有一个 driver 在运行,Streaming 界面一直在排队,像卡住了一样!点进去 driver 的 logs -> stdout 界面,看到很多这样的日志:

    18/09/04 19:10:45 org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
    18/09/04 19:10:51 org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66) WARN YarnAllocator: Container marked as failed: container_e13_1534402443030_0354_02_000125 on host: WMBigdata6. Exit status: 1. Diagnostics: Exception from container-launch.
    Container id: container_e13_1534402443030_0354_02_000125
    Exit code: 1
    Stack trace: ExitCodeException exitCode=1: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
        at org.apache.hadoop.util.Shell.run(Shell.java:507)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    

    YARN调度报错Stack trace: ExitCodeException exitCode=1解决方式, 这篇日志说是权限问题, 于是找到 CDH 集群下 RESOURCEMANAGER 的日志:

    vi /var/log/hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-WMBigdata0.log.out 
    

    看到类似的日志:

    2018-08-30 12:58:36,062 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:appattempt_1534402443030_0230_000001 (auth:TOKEN) cause:org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1534402443030_0230_000001 doesn't exist in ApplicationMasterService cache.
    2018-08-30 12:58:36,062 INFO org.apache.hadoop.ipc.Server: IPC Server handler 40 on 8030, call org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 10.0.201.124:56400 Call#2491 Retry#0
    org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1534402443030_0230_000001 doesn't exist in ApplicationMasterService cache.
            at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:442)
            at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
            at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
            at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
            at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
            at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226)
            at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2222)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:422)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
            at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2220)
    

    还发现一个 dr.who:

    2018-08-31 10:52:34,563 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: dr.who is accessing unchecked http://WMBigdata3:42229/api/v1/applications/application_1534402443030_0277/allexecutors which is the app master GUI of application_1534402443030_0277 owned by root
    

    各种搜索一无所获,此时还注意到一个异常:

    Slow ReadProcessor read fields took 30001ms (threshold=30000ms);
    

    reduce100%卡死故障排除 这里让我查 DataNode 的日志:

    vi /var/log/hadoop-hdfs/hadoop-cmf-hdfs-DATANODE-bigdata3.log.out 
    

    搜索一下 error 关键字:

    2018-08-31 16:14:45,695 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: bigdata3:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.0.166.172:45462 dst: /10.0.166.172:50010
    java.io.IOException: Premature EOF from inputStream
            at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:203)
            at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
            at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
            at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
            at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
            at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:901)
            at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:808)
            at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
            at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
            at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
            at java.lang.Thread.run(Thread.java:748)
    

    谷歌到了这个解决方法

    /etc/security/limits.conf
    # End of file
    *               -      nofile          1000000
    *               -      nproc           1000000
    

    在 CDH UI 界面中修改 dfs.datanode.max.transfer.threads 的值 4096, 改为 8192, 然后重启 hdfs 服务:

    <property> 
        <name>dfs.datanode.max.transfer.threads</name> 
        <value>8192</value> 
        <description> 
            Specifies the maximum number of threads to use for transferring data
            in and out of the DN. 
        </description>
    </property>
    

    过了一会儿, Spark Streamig 的界面好了, 不卡了。大西瓜啊, 之前忘记修改最大文件数了。

    相关文章

      网友评论

          本文标题:ExitCodeException exitCode=1

          本文链接:https://www.haomeiwen.com/subject/ihczwftx.html