客户在使用我们的SPARK(用的是 spark thrift server
)产品后,反馈说,使用一天后就报错。重启一下spark thrift server
。但是这种方式治标不治本,本质问题还是的挖出来解决掉的。
报错信息如下:
- 1.jpg
- 2.jpg
-
3.jpg
因为报错的日志是hdfs
中的BlockManager
,于是查看源码中该类的方法chooseTarget4NewBlock
:如下
/**
* Choose target datanodes for creating a new block.
*
* @throws IOException
* if the number of targets < minimum replication.
* @see BlockPlacementPolicy#chooseTarget(String, int, Node,
* Set, long, List, BlockStoragePolicy)
*/
public DatanodeStorageInfo[] chooseTarget4NewBlock(final String src,
final int numOfReplicas, final Node client,
final Set<Node> excludedNodes,
final long blocksize,
final List<String> favoredNodes,
final byte storagePolicyID) throws IOException {
List<DatanodeDescriptor> favoredDatanodeDescriptors =
getDatanodeDescriptors(favoredNodes);
final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy(storagePolicyID);
final DatanodeStorageInfo[] targets = blockplacement.chooseTarget(src,
numOfReplicas, client, excludedNodes, blocksize,
favoredDatanodeDescriptors, storagePolicy);
if (targets.length < minReplication) {
throw new IOException("File " + src + " could only be replicated to "
+ targets.length + " nodes instead of minReplication (="
+ minReplication + "). There are "
+ getDatanodeManager().getNetworkTopology().getNumOfLeaves()
+ " datanode(s) running and "
+ (excludedNodes == null? "no": excludedNodes.size())
+ " node(s) are excluded in this operation.");
}
return targets;
}
- 从改方法中可以看出来是
hdfs block
的问题,于是执行:hdfs dfsadmin -report
发现有两台机器DFS Remaining
和DFS Remaining%
空间严重不足。当然也可以通过查看datanode
的日志发现一些问题滴。 - 解决问题的思路:
1. 新增DataNode
节点
2. 在空间不足的DataNode
增加硬盘(也有可能是有足够的硬盘空间,但是没有成功的挂载到HDFS
上)
网友评论