美文网首页
Hadoop YARN 从节点NodeManager异常

Hadoop YARN 从节点NodeManager异常

作者: vickeex | 来源:发表于2020-03-14 11:44 被阅读0次

    Hadoop服务好一阵没管了,今天上线发现从节点已宕机。
    重启服务后,有个从节点只运行了DataNode而没有NodeManager进程,查看日志报错如下。

    2020-03-14 11:26:37,654 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
    org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
            at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:138)
            at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
            at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.createServer(ResourceLocalizationService.java:412)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.serviceStart(ResourceLocalizationService.java:388)
            at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
            at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:668)
            at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
            at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
            at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
            at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:937)
            at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1016)
    Caused by: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
            at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
            at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
            at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
            at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
            at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
            at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:736)
            at org.apache.hadoop.ipc.Server.bind(Server.java:621)
            at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:1185)
            at org.apache.hadoop.ipc.Server.<init>(Server.java:3067)
            at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1005)
            at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:426)
            at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:347)
            at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:846)
            at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:172)
            at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:131)
            ... 12 more
    Caused by: java.net.BindException: Address already in use
            at sun.nio.ch.Net.bind0(Native Method)
            at sun.nio.ch.Net.bind(Net.java:433)
            at sun.nio.ch.Net.bind(Net.java:425)
            at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:220)
    

    敲重点:Problem binding to [0.0.0.0:8040]......Address already in use
    $ netstat anp | grep 8040 查看该端口并未被占用。如果被占用了就kill -9 {PROCESS_ID}
    $ ./sbin/start-yarn.sh 在该节点单独重启服务,启动成功。

    问题解决了,追踪溯源了一下,发现该节点经常出现问题pdsh@blockchain-004: blockchain-003: ssh exited with exit code 1
    推测:节点与master的ssh连接断开了,但节点本地的NodeMagager进程当时仍在运行;此时Master重连上该点的SSH之后,尝试重启该节点的NodeMagager服务,造成服务冲突出错。而当我发现时,之前的NodeManager也已经自动结束了,造成了“谎报军情”的假象。

    相关文章

      网友评论

          本文标题:Hadoop YARN 从节点NodeManager异常

          本文链接:https://www.haomeiwen.com/subject/owzmshtx.html