美文网首页
hdfs副本摆放机制

hdfs副本摆放机制

作者: handsomemao666 | 来源:发表于2021-03-21 16:39 被阅读0次

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.

Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.

The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.

For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.

The current, default replica placement policy described here is a work in progress.

Replica Selection
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.

相关文章

  • hdfs副本摆放机制

    The placement of replicas is critical to HDFS reliability...

  • 《十小时入门大数据》学习笔记之HDFS

    笔记目录 HDFS概述及设计目标 HDFS架构 HDFS副本机制 HDFS环境搭建 HDFS shell Java...

  • 04.HDFS拓展

    HDFS拓展 一、副本摆放策略 不同的版本副本拜访策略不同,主要是针对rack(机架)而选择的策略,机架之间通信的...

  • HDFS纠删码

    1.目的 副本是昂贵的--在HDFS中默认的3副本机制有200%的存储空间和其它的资源(比如:网络带宽)开销。然而...

  • HDFS 副本

    1. 副本策略 NameNode具有RackAware机架感知功能,这个可以配置。 若client为DataNod...

  • Hadoop学习笔记(二)HDFS

    HDFS的设计目标 通过上一篇文章的介绍我们已经了解到HDFS到底是怎样的东西,以及它是怎样通过多副本机制来提供高...

  • HDFS概述

    HDFS优缺点 HDFS优点 高容错性 数据自动保存多个副本 副本丢失后,自动恢复 适合大数据批处理 移动计算不移...

  • 没用过 HBASE 数据库?看完你就是老手了

    一、Hbase简介 1、什么是Hbase Hbase是一个高可靠性(存储在hdfs上,有副本机制),高性能,面向列...

  • Kafka可靠性保证

    多副本机制 Kafka为分区引入了多副本(Replica) 机制, 通过增加副本数量可以提升容灾能力。副本之间是“...

  • [Hadoop] HDFS 详解一(原理篇)

    目录 HDFS的工作机制 概述 HDFS 写数据流程 HDFS 读数据流程 NameNode的工作机制 NameN...

网友评论

      本文标题:hdfs副本摆放机制

      本文链接:https://www.haomeiwen.com/subject/rbxocltx.html