BlockManager为Spark 存储block主要类,和HDFS类似点:
三份数据存储时,本机,本RACK,其它机器。
和HDFS不一样点:
数据先存Memory,如果StorageLevel指定了disk,在内存不足时存disk。
几个主要的类:
1.BlockInfoManager,主要管理blockID与blockinfo信息
private[this] val infos = new mutable.HashMap[BlockId, BlockInfo]
对block的操作提供读写保护锁
2.BlockManagerMasterEndpoint,管理集群BlockManagers,
提供BlockMangerId=>BlockMangerInfo mapping,其它
BlockMangerInfo为单个blockManager及上面存储的blocks
// Mapping from block manager id to the block manager's information.
private val blockManagerInfo = new mutable.HashMap[BlockManagerId, BlockManagerInfo]
// Mapping from executor ID to block manager ID.
private val blockManagerIdByExecutor = new mutable.HashMap[String, BlockManagerId]
// Mapping from block id to the set of block managers that have the block.
private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]]
BlockManagerInfo类:
// Mapping from block id to its status.
private val _blocks = new JHashMap[BlockId, BlockStatus]
// Cached blocks held by this BlockManager. This does not include broadcast blocks.
private val _cachedBlocks = new mutable.HashSet[BlockId]
最后就是memoryStore和diskStore了,优先使用memoryStore,这也是为啥Spark打败MapReduce的原因,优先使用内存。MapReduce,至少V1,Map端存储到disk,shuffle到Reduce端后,在Reduce端做外排序。
网友评论