Namenode节点因网络问题挂掉以后,整个集群的datanode等服务也相继挂了,待修复网络问题,并且启动集群后发现有两个datanode节点无法启动,查看日志发现其报错如下:
2019-07-20 20:35:10,432 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Setting up storage: nsid=1366158422;bpid=BP-854081761-10.0.0.163-1563623298061;lv=-56;nsInfo=lv=-63;cid=CID-2fdc1483-a28e-4e5e-b95c-6333c3d148fe;nsid=1366158422;c=0;bpid=BP-854081761-10.0.0.163-1563623298061;dnuuid=997e0e1d-5895-4945-abc0-489d4292180a
2019-07-20 20:35:10,450 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory [DISK]file:/mnt/disk1/hdfs/ has already been used.
2019-07-20 20:35:10,485 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-854081761-10.0.0.163-1563623298061
2019-07-20 20:35:10,485 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-854081761-10.0.0.163-1563623298061
java.io.IOException: BlockPoolSliceStorage.recoverTransitionRead: attempt to load an used block storage: /mnt/disk1/hdfs/current/BP-854081761-10.0.0.163-1563623298061
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:212)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:244)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:395)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:477)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:748)
2019-07-20 20:35:10,488 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage for block pool: BP-854081761-10.0.0.163-1563623298061 : BlockPoolSliceStorage.recoverTransitionRead: attempt to load an used block storage: /mnt/disk1/hdfs/current/BP-854081761-10.0.0.163-1563623298061
2019-07-20 20:35:10,511 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk2/hdfs/in_use.lock acquired by nodename 33950@owner-node1
2019-07-20 20:35:10,511 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /mnt/disk2/hdfs is in an inconsistent state: cluster Id is incompatible with others.
2019-07-20 20:35:10,511 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to owner-node4/10.0.0.163:8020. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:748)
2019-07-20 20:35:10,511 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to owner-node3/10.0.0.162:8020. Exiting.
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 1, volumes configured: 2, volumes failed: 1, volume failures tolerated: 0
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.<init>(FsDatasetImpl.java:285)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:34)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:30)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1371)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:748)
2019-07-20 20:35:10,513 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to owner-node3/10.0.0.162:8020
2019-07-20 20:35:10,512 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to owner-node4/10.0.0.163:8020
2019-07-20 20:35:10,614 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool <registering> (Datanode Uuid unassigned)
2019-07-20 20:35:12,615 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2019-07-20 20:35:12,617 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2019-07-20 20:35:12,619 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at owner-node1/10.0.0.160
************************************************************/
分析最终发现无法启动原因是disk1和disk2存储的数据版本不一致:
Failed to add storage for block pool: BP-854081761-10.0.0.163-1563623298061 : BlockPoolSliceStorage.recoverTransitionRead: attempt to load an used block storage: /mnt/disk1/hdfs/current/BP-854081761-10.0.0.163-1563623298061
InconsistentFSStateException: Directory /mnt/disk2/hdfs is in an inconsistent state: cluster Id is incompatible with others.
FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to owner-node4/10.0.0.163:8020. Exiting.
继续排查发现:
cat /mnt/disk2/hdfs/current/VERSION
#Sat Jul 20 11:29:40 CST 2019
storageID=DS-255b4fcb-7ade-433f-a14a-4bead35c0902
clusterID=CID-ec926c36-cc13-4b06-a239-21854f9b7221
cTime=0
datanodeUuid=8f1cc8a2-e5c0-4346-8339-b4d1b5e6a2d6
storageType=DATA_NODE
layoutVersion=-56
而 cat /mnt/disk1/hdfs/current/VERSION
#Sat Jul 01 20:35:10 CST 2019
storageID=DS-ce80d916-3fad-4c2d-aa0c-f72ac56beb25
clusterID=CID-2fdc1483-a28e-4e5e-b95c-6333c3d148fe
cTime=0
datanodeUuid=997e0e1d-5895-4945-abc0-489d4292180a
storageType=DATA_NODE
layoutVersion=-56
解决,根据日志提示,将原来disk1盘的UUID替换到新的disk2对应文件:
cp /mnt/disk2/hdfs/current/VERSION /mnt/disk2/hdfs/current/VERSION.bak
cat /mnt/disk1/hdfs/current/VERSION > /mnt/disk2/hdfs/current/VERSION
# 然后再master执行
start-dfs.sh
# 或直接在该节点启动
# hadoop-daemon.sh start datanode
~ 不明白为啥这个ID会变~, 后续推测若集群可能被其他同事多次 hdfs namenode -format是可能会导致此问题的
网友评论