一、核心概念
基础
基本特征是:a Java-based, open source, NoSQL, non-relational, column-oriented, distributed , fault-tolerant database built on top of the Hadoop Distributed Filesystem (HDFS)
hosting a few large tables of sparse data (billions/trillions of rows by millions of columns,有些书也说是十亿行 十亿列), while allowing for very low latency and near real-time random reads and random writes.
但是我们需要注意HBase的限制:
- HBase is not an SQL RDBMS replacement.
- HBase is not a transactional database.
- HBase doesn’t provide an SQL API.
Distributed
两种分布式环境:伪分部式和完全分布式,后者只能用于HDFS上,前者可以用于本地文件系统和HDFS
HBase supports auto-sharding, which implies that tables are dynamically split and distributed by the database when they become too large.
Big Data Store
HBase supports hosting very large tables with billions of rows and billions/millions of columns. HBase can handle petabytes of data.
It's designed for queries of massive data sets and is optimized for read performance
Flexible Data Model
The basic unit of storage in HBase is a table . A table consists of one or more column families , which further consists of columns . Columns are grouped into column families. Data is stored in rows . A row is a collection of key/value pairs. Each row is uniquely identified by a row key. The row keys are created when table data is added and the row keys are used to determine the sort order and for data sharding , which is splitting a large table and distributing data across the cluster.
columns may be added to a table column family as required without predefining the columns. Only the table and column family/ies are required to be defined in advance. No two rows in a table are required to have the same column/s. All columns in a column family are stored in close proximity.
HBase does not support transactions.
HBase does not have the notion of data types, but all data is stored as an array of bytes
Scalable
The basic unit of horizontal scalability in HBase is a region . Rows are shared by regions. A region is a sorted set consisting of a range of adjacent rows stored together. A table’s data can be stored in one or more regions. When a region becomes too large, it splits into two at the middle row key into approximately two equal regions.
Roles in Hadoop Big Data Ecosystem
HBase stores data in StoreFiles on HDFS. HBase does not make use of the MapReduce framework of Hadoop but could serve as the source and/or destination of MapReduce jobs.
HBase优化了batch processing systems使得能够streamed access to large data sets.
不同于RDBMS的几个地方
...
- 内部不支持 joins. Joins 可以通过 MapReduce来实现
- Data structure支持:Un-structured, semi-structured, and structured
- 不支持触发器和存储过程,但是可以通过coprocessor达到相同目的: image.png
- 存储模型区别:HBase stores data as Store Files ( HFiles ) on the HDFS Datanodes(HFile is the file format for HBase,and a Store File is a lightweight wrapper around the HFile),而RDBMS是Tablespaces
- 查询速率上:HBase是Millions of queries per second,而RDBMS是1000s of queries per second
Apache HBase and HDFS
image.png-
In addition to storing table data HBase also stores the write-ahead logs (WALs ), which store data before it is written to HFiles on HDFS.
-
HBase provides a Java API for client access. HBase itself is a HDFS client and makes use of the Java class DFSClient . HBase makes use of the class to connect to NameNode to get block locations for Datanode blocks and add data to the Datanode blocks.HBase leverages the fault tolerance provided by the Hadoop Distributed File System (HDFS) .HBase requires some configuration at the client side (HBase) and the server side (HDFS)
-
注意:使用HDFS的HBase有一些特点,比如会产生Large number of open files
-
使用HDFS的HBase中的生产者和消费者的读写路径如下(注意写的时候也可以直接写道HDFS中):
image.png
HBase会用到的三种类型文件
• WALs or HLogs
• Data files (also known as store files or HFiles)
• 0 length files: References (symbolic or logical links)
HBase communicates with both the NameNode and the Datanodes using HDFS Client classes such as DFSClient .
HBase write errors,including communication issues between Datanodes, are logged to the HDFS logs and not the HBase logs.(DataNode相关的日志仍然是由HDFS自己记录的)
image.png
HBase存储结构
The basic element of distribution for an HBase table is a Region, which is further comprised of a Store per column family。A Store has an in-memory component called the MemStore and a persistent storage component called an HFile or StoreFile
When data is added to HBase, the following sequence is used to store the data:
- The data is first written to a WAL called HLog .
- The data is written to an in-memory MemStore .
- When memory exceeds certain threshold, data is flushed to
disk as HFile (also called a StoreFile ). - HBase merges smaller HFiles into larger HFiles with a process
called compaction
HBase组件
- Master
- RegionServers
- Regions within a RegionServer
-
MemStores and HFiles within a Region
image.png image.png
Data Locality
For efficient operation, HBase needs data to be available locally, for which it is a best practice to run a HDFS node on each RegionServer. HDFS has to be running on the same cluster as HBase for data locality.
HBase Ecosystem
The HBase ecosystem consists of Apache Hadoop HDFS for data durability and reliability (write-ahead log) and Apache ZooKeeper for distributed coordination, and built-in support for running Apache Hadoop MapReduce jobs image.pngHBase architecture services
image.pngThe RegionServer contains a set of Regions and is responsible for handling reads and writes. The region is the basic unit of scalability and contains a subset of a table’s data as a contiguous, sorted range of rows stored together. The Master coordinates the HBase cluster, including assigning and balancing the regions. The Master handles all the admin operations including create/delete/modify of a table. The ZooKeeper provides distributed coordination.
The Write Path to Insert Data
image.png image.pngScaling
At startup, each region is assigned to a RegionServer and the Master may move a region from one RegionServer to the other for load balancing. The Master also handles RegionServer failures by assigning the regions handled by the failed RegionServer to another RegionServer. The Region ➤ RegionServer mapping is kept in a system table called .META. . From the .META. table, it can be found which region is responsible for which key.
For Put and Get operations, clients don’t have to contact the Master and can directly contact the RegionServer responsible for handling the specified row. For a client scan, the client can directly contact the RegionServers responsible for handling the specified set of keys. The client queries the .META. table to identify a RegionServer. The .META. table is a system table used to track the regions. The . META. table contains the RegionServer names, region identifiers (Ids), the table names, and startKey for each region.
image.pngTo avoid having to get the region location again and again, the client keeps a cache of region locations. The cache is refreshed when a region is split or moved to another RegionServer due to balancing or assignment policies, and cache is refreshed by getting updated information from the .META. table
The . META. table is also a table like the other tables and client has to find from ZooKeeper on which RegionServer the .META. table is located. With HBase 0.96, the META locations are stored in the ZooKeeper
image.png二、数据模型
image.png物理存储
The filesystem used by Apache HBase is HDFS. HDFS is an abstract filesystem that stores data on the underlying disk filesystem. HBase indexes data into HFiles and stores the data on the HDFS Datanodes.
A HBase client communicates with the ZooKeeper and the HRegionServers. The HMaster coordinates the RegionServers. The RegionServers run on the datanodes. Each RegionServer is collocated with a datanode(这样做的目的是为了本地化存储)。
Column Family 和 Column Qualifier
Column qualifiers are the column names, also known as column keys
-
The row keys form the primary index and the column qualifiers form the per row secondary index(RowKey形成一级索引,列名形成二级索引). Both the row keys and the column keys are sorted in ascending lexicographical order.
-
Each row can have different column qualifiers
-
A column’s family is used for performance reasons. Each row in a table has the same column family/ies, although a row does not have to store a value in each column family.
-
A column qualifier is the actual column name to provide an index for a column.
A KeyValue consists of the key and a value, with the key being comprised of the row key plus the column family plus the column qualifier plus the timestamp(注意下图Key包含的内容):
image.png推荐只使用商量的column families,因为每个column family都存储再它自己的data file仲,太多的column families会导致很多的data file被打开
Row Versioning
Timestamps are stored in descending order in an HFile, which implies the most recent timestamp is stored first. The timestamp identifies a version and must be unique for each cell. A {row, column, version} tuple specifies a cell in a table. A KeyValue consists of the key and a value with the key being comprised of the row key + column Family + column qualifier + timestamp
注意:The versions are configurable for a column family。修改version也是针对ColumnFamily来做修改的,比如:alter ‘table1', NAME => ‘cf1', VERSIONS => 5
Logical Storage
The entire cell including all the structural information is a object(cell里面可以理解为一个KeyValue对象) and includes {row key,column key {column family:column qualifier}, timestamp, and value}. A KeyValue object is sorted by row key first (primary index) and by column key next (secondary index). A KeyValue could also be called a cell.
Data in a cell is not updated on a table update. Every update creates a new cell.
Architecture
Major Components of a Cluster
在一个分布式集群里, the Master 一般和the HDFS NameNode位于相同的node上, 而RegionServers和HDFS Datanode位于相同的节点上。对于一个小的集群而言, a ZooKeeper和NameNode位于一个node上, 但是对一个大集群而言, the ZooKeeper应当运行在一个独立的节点上 image.pngMaster
The Master manages the cluster. The Master assigns Regions to RegionServers on startup and failover, and performs load balancing.
RegionServers
For example, if a table has row keys A, H, N, S, V, Y, Z, row keys [D-G] and [V-Z] could be on RegionServer 1, row keys [A-C] and [R-U] could be on RegionServer 2, and row keys [H-M] and [N-Q] could be on RegionServer 3
image.pngThe RegionServer runs several background threads including those for minor/major compactions, MemStore flush to StoreFile, and WAL. RegionServers are collocated with the datanodes, providing data locality.
ZooKeeper
The ZooKeeper bootstraps and coordinates the cluster. The ZooKeeper provides shared state information for components of the distributed system. ZooKeeper also provides server failure notifications so that a Master can failover to another RegionServer. The ZooKeeper can be a single node or an ensemble of nodes.
The ZooKeeper is also used to store metadata for operations such as master address and recovery state. The hbase:meta table (previously .META. ) stores a list of all regions in the system. The hbase:meta is stored in the ZooKeeper. An odd number (3, 5, 7) of nodes in a ZooKeeper ensemble is recommended because it will tolerate more node failures. An ensemble of 2n+1 nodes tolerates n failures.
To start an HBase cluster, a ZooKeeper should be started first, followed by a Master, followed by RegionServers.
Regions
A region has a startKey and a endKey and contains a sorted, contiguous range of rows; 每个Region包含多个Column Families
A complete table is not necessarily stored on the same region or even the same RegionServer.
Each region can be on a different node and may consist of several HDFS files and blocks, each of which is replicated
The hierarchy of components including the region is table ➤ region ➤ store ➤ MemStore and store file ➤ block
Store
A store stores data and is made up of(由...构成) a single MemStore and 0 or more StoreFiles. A StoreFile is a façade on an HFile, which is stored in HDFS on disk. StoreFiles are made of blocks. The block size is configured per column family.
Data modifications are first stored in memory (MemStore) and flushed to disk on regular intervals, or if a memory threshold is exceeded, or explicitly with a shell command. Each flush generates
an HFile
Region
A region is the unit of horizontal scalability in HBase. A region has a startKey and an endKey and contains a sorted, contiguous range of rows.
Regions are non-overlapping; the same row key is not stored on multiple regions.(但我理解Region本身是存在副本的)
Each region can be on a different node and may consist of several HDFS files and blocks
Regions are made available to clients by RegionServers
每个Region Server推荐使用a small number (20-200) of medium-large sized (5-20GB)的regions,一个比较好的数量是100个。region数量好控制的原因为:
- The available heap space is a limiting factor in selecting
the number of regions. Approximately 2MB is required per
MemStore, a MemStore being per column family per region(每个region的每个列簇都会需要一个MemStore).
With 100 regions and 3 column families per region, the
MemStore heap space requirement is 600MB. Having fewer
regions reduces the MemStore heap requirement. - A large number of regions generates a large number of tiny
flushes; with each flush generating a StoreFile, a large number
of StoreFiles are generated, which in turn require more
compactions. Also, the MemStore and the StoreFile index
require more heap space.(region多则会产生很多次的小flush(flush of the MemStore),每次flush会产生一个StoreFile(is façade on an HFile),StoreFile多了后则会有很多的compaction动作;同时MemStore和StoreFile的index也需要很多heap空间) - A large number of regions put a load on the Master because
the Master has to assign/reassign regions to RegionServers.
Also, the Master has to move the regions around for load
balancing.(Master为负载均衡不得不重新分配region到不同的RegionServer上)
Compactions
When the number of StoreFiles in a store gets too large, the RegionServer performs a compaction to merge the StoreFiles into a smaller number of StoreFiles.
Failover
With multiple RegionServers serving data, a RegionServer could fail and the regions on the RegionServer could become unavailable. The ZooKeeper detects the RegionServer failure. But, because data is replicated across the cluster, the Master performs a failover to another RegionServer hosting the same set of row keys in a region. Again, regions provide a fundamental unit of the logical data model. Regions are assigned similarly as on startup.
Region Locality
Locality is the closeness of a region to a RegionServer(Region和RegionServer靠在一起). Region locality is achieved with HDFS block replication across the cluster. A client contacts a RegionServer, and if a region is closer to the RegionServer, less network transfer is required for client operations.
Partitioning
Regions provide partitioning of data. Multiple clients accessing a table can use different regions of the table and as a result won’t overload a partition (region) and RegionServer. Having multiple regions reduces the number of disk seeks required to find a row, and data is returned faster (low latency) to a client request.
Region Splitting
Regions split when a threshold is exceeded. Splits are handled by the RegionServer, which splits a region and offlines the split region
Finding a Row in a Table
查找顺序如下: image.pngThe Master is not involved in finding a row in an HBase table. But, how does a client find which RegionServer stores the row/s of data? The Region ➤ Region Server mapping is stored in the hbase:meta catalog table (also known as the META table). As hbase:meta is also a table, just like any other HBase table, it is stored on a RegionServer. The location of hbase:meta is kept in the ZooKeeper on assignment by the Master. When HBase starts up, the Master assigns the regions to each RegionServer, including the regions for the hbase:meta table.
Block Cache
The block cache caches the following information, which is used in finding a row of data:
- Row data : Each Get or Scan that yields data that is not already in
the block cache is added to the block cache. - Row keys : When a value (key/value) is loaded into block cache,
its key is also cached. It is advantageous to make the keys small so
that they occupy less space in the cache. - The hbase:meta table, which keeps track of the RegionServer
➤ region mappings, is given in-memory priority and is kept
in memory for as long as feasible. The hbase:meta table could
consume several MB of cache if a large number of regions are
defined. - Block indexes of HFiles are stored in the block cache. Using an
index, a client is able to find a row of data without having to open
the entire HFile. Index size is a factor of the size of the row keys,
block size, and amount of data stored in an HFile.
Compactions
Each flush of the MemStore generates a StoreFile。The MemStore size at which a flush is performed is set in base.hregion.
memstore.flush.size , which is 128MB by default。If the number of StoreFiles in a store exceed some limits/thresholds, the files are
compacted into larger StoreFiles
Compaction is the process of creating a larger StoreFile(HFile) file by merging smaller StoreFile files. 注意Compactions do not merge regions。Two types of compactions are performed: minor compaction and major compaction . Major compaction merges all the files
Compaction is performed on a per-region basis.
Compactions are performed to improve read performance.
Compaction could become necessary if HBase has scanned too many StoreFile files to find a result but is not able to find a result
Minor Compactions
Minor compaction just merges two or more smaller StoreFile files into one larger StoreFile file.Minor compactions do not drop deletes or excess or expired versions.
Major Compactions
Major compaction merges all the StoreFile files in a store into a single StoreFile. In a major compaction, deleted and duplicate key/values are removed. Major compactions run automatically at a frequency set in a configuration property called hbase.hregion.major compaction , which has a default value of 7 days
Write Throughput and Compactions
注意:对于写比较多的场景,压缩频率要设置低一点:Frequent compactions could affect the write throughput and therefore for write-intensive workloads less-frequent compactions are recommended
Region Failover
When a RegionServer crashes, all the regions on the RegionServer migrate to another RegionServer. The Master handles RegionServer failures by assigning the regions handled by the failed RegionServer to another RegionServer.
MTTR (Mean Time to Recover) is the average time required to recover from a failed RegionServer. The objective of MTTR for HBase regions metric is to detect failure of a RegionServer and restore access to the failed regions as soon as possible.
The Role of the ZooKeeper
The ZooKeeper has the all-important role of detecting RegionServer crashes and notifying the Master so that the Master may perform the failover to another RegionServer.
The ZooKeeper coordinates, communicates, and shares state between the Master/s and the RegionServer. The ZooKeeper is a client/server system for distributed coordination and it provides an interface similar to a filesystem, consisting of nodes called znodes, which may contain transient data. When a RegionServer starts, it creates a sub-znode for describing its online state.
HBase Resilience
- HBase puts table data in HFiles, which are stored in HDFS.
HDFS replicates the blocks of the HFiles, three times by default. - HBase keeps a commit log called a write-ahead log (WAL),
also stored in the HDFS and also replicated three times by default.
Phases of Failover
The client contacts the RegionServer directly for a write. The RegionServer is collocated with a datanode. The HBase table data is written to the local datanode and subsequently replicated to other datanodes with three replicas by default. The ZooKeeper keeps a watch on all of the RegionServers.
Failure Detection
Detecting RegionServer failure due to a crash is performed by the ZooKeeper. Each RegionServer is connected to the ZooKeeper and the Master monitors these connections. When a ZooKeeper detects that a RegionServer has crashed, the ZooKeeper ends the RegionServer’s session and notifies the Master about the RegionServer. The Master declares the RegionServer as unavailable by notifying the client. The Master starts the data recovery process and subsequent region reassignment.
Data recovery makes use of the edits stored in the WALs.One logical WAL is created per region.
One physical WAL is created per RegionServer.
Data Recovery
The recovery process is slowed down if it is not just a RegionServer crash, but also the node (machine) on which the RegionServer is running has crashed. As WAL logs are replicated three times, with one of the replicas being on HDFS datanode on the same node (machine) as the RegionServer, 1/3 (33%) of replicas have become unavailable. During data recovery, 33% of reads go the failed datanode first and are redirected to a non-failed datanode.
Regions Reassignment
The objective is to reassign the regions as fast as possible. The ZooKeeper has an important role in the reassignment. Reassignment is performed by the ZooKeeper and requires synchronization between the Master and the RegionServers through the ZooKeeper. From these phases, the failure detection takes about 30-90 seconds. ***Data recovery is about 10 secs ***and *region reassignment is 10 seconds.
Creating a Column Family
A column consists of a column family and a column qualifier.
Column families must be declared when a table is created, but the column qualifiers may be created on an as-needed basis dynamically.
The maximum number of row versions is configured per column family
All column family members are stored together on disk
All columns within a column family share the same characteristics such as versioning and compression.
Schema Design
Region Splitting
a table’s row keys are not stored in the same region; a table's row keys are distributed across the cluster stored on different regions on different RegionServers.
网友评论