美文网首页我爱编程
2018-01-05 Hadoop Platform and A

2018-01-05 Hadoop Platform and A

作者: 鸭鸭学语言 | 来源:发表于2018-01-05 21:23 被阅读0次

    What is hadoop

    Hadoop is created by Doug Cutting(who is working at Yahoo, and now is CA at Cloudera) and Mike Cafarella in 2005. 

    "Hadoop" is the name of Doug's son's elephant toy.


    Hadoop is (1)apache open source software (2)framework, for (3)storage and (4)large scale processing of data-sets, on (5)clusters of commodity hardware.

    (2) is provided by MapReduce, a shared and integrated foundation where we can bring additional tools to the framework.

    (4)proceed with (3)cheap computing storage. (5)Scalability is the core.

    New way to (4)take and analyze data: schema and read style -- creating schema while reading raw data, instead of create-schema-after-read. More granularity, complex analytics on small amount of data.


    Hadoop MapReduce is derived from Google's MapReduce.

    Hadoop HDFS (Hadoop Distribution File System) is derived from Google FS.


    Framework Basic Modules

    Hadoop Common

    base libraries, tools needed by other modules


    Hadoop Distributed File System

    storing data on commodity machine across the entire cluster

    jave coded.

    1 node in Hadoop = 1 name node + 1 HDFS cluster of data nodes.   

    NameNode = Primary NameNode and Secondary NameNode building snapshots of the primary's.


    Hadoop YARN

    managing compute in cluster in order to schedule users and applications.


    Hadoop MapReduce

    Programming model 

    Scaling data across a lot of different processes.

    Engine:

        job tracker dispatch job to task trackers in cluster

    Zoo

    bigdata table : derive HBASE, handle massive date tables

    mySQL Gateway : adjust to allow query data.

    Sawzall : high level access MapReduce in the cluster and submit jobs.

    Evenflow : chain together complex work codes and coordinate events and services

    Dremel : in metadata manager, able to process a very large amount of unstructured data.

    Chubby : coordinate all of these above


    Cloudera Stack

    Ecosystem

    Core components

    Sqoop 

    Transferring bulk data between Hadoop and structured datastores like relational databases.

    CLI tool,  import tables/DB to HDFS. 


    HBASE

    Based on Google's bigdata table

    Handle massive data tables with billions columes.


    Hive

    Data warehouse software facilitates querying and managing large datasets residing in distributed storage, by projecting structure on the top of all of this data and allow us to use SQL like queries.

    SQL Language : Hive QL


    Pig

    Scripting Language Pig Latin for creating MapReduce programs using Hadoop

    It can execute bi-directionally with other languages.

    Excel at describing data analysis problem as data flows


    Oozie

    Workflow scheduler system, maange Hadoop jobs

    Support job schedules for MapReduce, Pig, Hive, Sqoop, etc


    Zookeeper

    It provides a distributed configuration service and synchronization service so he can synchronize all these jobs and a naming registry for the entire distributed system.

    Distributed applications use the zookeeper to store immediate updates to important configuration information on the cluster itself.


    Flume

    collecting, aggregating, and moving large amonts of log data


    Other components

    Impala

    Massively parallel processing SQL query engine


    Spark

    A scalable data analytics platform that incorporates primitives for in-memory computing. 

    Scala language

    Greatly support machine learning libraties 

    URL

    https://hadoop.apache.org/

    Lesson1 Slides

    相关文章

      网友评论

        本文标题:2018-01-05 Hadoop Platform and A

        本文链接:https://www.haomeiwen.com/subject/slxsnxtx.html