美文网首页我爱编程
2018-01-09 Hadoop Platform and A

2018-01-09 Hadoop Platform and A

作者: 鸭鸭学语言 | 来源:发表于2018-01-09 20:25 被阅读0次

Overview of Hadoop Stack

HDFS holds data. YARN is resource manager. MapReduce is one option of engine, Spark is another. Tez is alos one option in Hadoop 2.0, where the applications are layered on top of that.

HBase - a scalable data warehouse with support for large tables

Hive - a data warehouse infrastructure that provides data summarization and ad hoc quering

pig - A high-level data-flow language and execution framework for parallel computation

Spart - a fast and general compute engune for Hadoop data. Wide range of applications -ETL, Machine Learning, stream processing, and graph analytics.


Cloudera Setup:


HDFS and HDFS2

Concept:

    Scalable distributed filesystem

    Distribute data on local disks on several nodes

    Low cost commodity hardware

Design goals:

    Resilience - recover from nodes or nodes' components failing

    Scalability - spreading out the data to blocks on lots of nodes ; namespace capacity

    Application Locality - data scale but application does not. It localise on each compute node and keep compute task on the node with data

    Portability - means commodity hardware widely accepted about OS type and not much change needed.

Architecture:

    Single NameNode 

        Metadata is info about filesystem state, block information, edit & transaction info, locks

    Multiple DataNodes - Data is spreaded across to blocks on lots of nodes 

        Manange storage - blocks of data (downward) 

        Serving read/write requests from clients (upward)

        Block creation, deletion, replication (horizontally) - Replication is 3 times by default

   From Hadoop2.0 (Federation):

    Multiple NameNode but not single any more. Multiple namespaces providing scalability. Each namespace has a block pool. Metadata is stored in block pools. Pools are spread out over all data nodes. 

    Standby NameNode taking snapshot, but failover is handling manually.

    Heterogeneous Storage - Archive storage, SSD, Ram_disk


MapReduce Framework

Basic idea: (1)Job splits data into chunks, and MapBus maps tasks to all the (2)compute nodes to process chunks. Once the process chunks of data is finished, the framework sorts the map's output. Reduce tasks use the sorted map's output as input to perform some reduction opetaions.

Typically, compute and data nodes are the same, so MapReduce tasks and HDFS are running on the same nodes.

Before Hadoop 2.0 YARN burn:

Single master JobTracker (1)  - schedules, monitors, and re-executes failed tasks. It's the main daemon in Hadoop. It initiates TaskTrackers on SlaveNodes (compute nodes/data nodes)

One slave TaskTracker per cluster node (2) - executes tasks from JobTracker requests (with HDFS handler).


YARN

From MapReduce. Main idea : separate resource management and job scheduling / monitoring.

Overall/Coordiante -- ResourceManager : on Master Node, gets job requests from clients, gets Node Status from NodeManagers about what resources are available, gets status of applications from ApplicationMaster.

Resource Management part -- NodeManager : on each node. Like Capacity scheduler / fair share scheduler - choosing container/allocatiing resource based on capacity and queues to jobs

Job Scheduling / monitoring part -- ApplicationMaster : one for each application on certain nodes. All of them together break out that piece of original single JobTracker

So, YARN is doing MapReduce's (1) part, but it is more deeper from container level for scheduling jobs.

YARN has features below also:

    High Availability ResouceManager in the newest Hadoop release - One Standby RM.

    Timeline server - trace storage/application history like how many map/reduce/resource are done/used.

    Cgourps - manage resources used by containers, as it also support Secure Containers with restrictions to particular users.

    Restful API providing web services for cluster access.


Lesson 3 Slides

相关文章

网友评论

    本文标题:2018-01-09 Hadoop Platform and A

    本文链接:https://www.haomeiwen.com/subject/mmymnxtx.html