美文网首页
2018-01-10 Hadoop Platform and A

2018-01-10 Hadoop Platform and A

作者: 鸭鸭学语言 | 来源:发表于2018-01-10 19:00 被阅读0次

    YARN

    It support classic MapReduce framework

    It also support other open source / commercial applications running on it, like Impala, Storm and they do not need change anything. 

    It also support user developed applications

    It also enables frameworks like Tez, Spark


    Execution Frameworks: YARN, Tez, Spark

        Support DAG(directed acyclic graph) of tasks.

        In memeory caching of data


    MapReduce

    Application engine.

    Applications fits the MapReduce paradigm: need know the distributed data chains, and which are independent of each other, and then have the shuffle process that will feed the data into the reduce process.

    Application does not fit the MapReduce paradigm:

        Interactive data exploration - load data into memeory to avoid loading data from disk again and again.

        Iterative data procesing - Machine Learing algorithms.


    Tez

    Application engine.

    Features:

        Handle Dataflow graphs with expressive API.

        Support customized data types and customized logic application, so no restriction as on MapReduce of framework.

        Can run complex DAG of tasks

        Dynamic DAG changes

        Reuse resource(containers) to avoid those costs of containers startup. More efficient.


    Compare MapReduce and Tez on :

        Use case: 

            SELECT a.vendor, COUNT(*), AVG(c.cost) FROM  a JOIN b ON (a,id=b.id)  JOIN a ON (a.itemid=c.itemid) GROUP BY a.vendor

     MapReduce Tez

    Spark

    Application engine.

    It could run on HDFS directly without YARN is needed. It can also run on other storage too. 

    Features:

        Advance DAG execution engine - Data can be shared across DAGs, between iterations and reused. So much faster than other DAG engines.

        Support cyclic data flow

        In-memory computing. If out of memory, it excels at gracefully spilling over to disks.

        Can be accessd from Java, Scala, Python, R

        Existing optimized libraries


    Hadoop Resource Scheduling

    Schedulers:

        FIFO (default)

        Fairshare - balance resource between application, default resource is memory but we can add CPUs as resource.

            Balance out resource allocation among apps over time.

            Can organize into queues/sub-queues

            Garrantee minimum shares

            Weighted app priorities

        Capacity - guaratee resource for each application

            Queues and sub-queues

            Capacity Guarantee with elasticity

            ACLs for security

            Runtime changes/draining apps

            Resource based scheduling


        Lesson 4 Slides

    相关文章

      网友评论

          本文标题:2018-01-10 Hadoop Platform and A

          本文链接:https://www.haomeiwen.com/subject/poignxtx.html