美文网首页
spark核心概念

spark核心概念

作者: shone_shawn | 来源:发表于2019-08-16 13:54 被阅读0次

    Application:基于Spark的应用程序 = 1 driver + executors

        User program built on Spark.  //用户程序构建在spark上
        Consists of a driver program and executors on the cluster.//
        spark0402.py
        pyspark/spark-shell
    

    Driver program

        The process running the main() function of the application 
        creating the SparkContext   //是一个进程,用来运行应用的main方法,来创建一个sparkcontext
    

    Cluster manager //集群的资源获取管理

        An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)   //一个外部的服务,从集群上获取资源()
        spark-submit --master local[2]/spark://hadoop000:7077/yarn
    

    Deploy mode //部署模式

        Distinguishes where the driver process runs. //区分你的driver进程运行在哪里
            In "cluster" mode, the framework launches the driver inside of the cluster. //你的框架会启动你的driver在集群里,并且运行在am里
            In "client" mode, the submitter launches the driver outside of the cluster. //你的提交者会启动你的driver在集群外面,在本节点启动
    

    Worker node //工作节点

        Any node that can run application code in the cluster //运行你的应用程序在集群里
        standalone: slave节点 slaves配置文件
        yarn: nodemanager
    

    Executor //一个进程,运行在worker node上面的,运行task,缓存数据,而且每个应用程序有一组独立的executor

        A process launched for an application on a worker node
        runs tasks 
        keeps data in memory or disk storage across them  //能够对数据进行缓存,存到内存或者磁盘里面
        Each application has its own executors. //每个应用程序都有它独立的executor
    

    申请资源,通过cluster manager来申请,可以指定yarn,standalone,local等等,可以采用本地client方法,也可以cluster(deploy mode)

    Task //工作单元,从driver发起,通过网络,会被发送到executor去执行

        A unit of work that will be sent to one executor    
    

    Job //一个并行计算,这个计算由多个task构成

        A parallel computation consisting of multiple tasks that //job包含了多个task
        gets spawned in response to a Spark action (e.g. save, collect); //lazy的,遇到action了才会到集群上运行变成job
        you'll see this term used in the driver's logs.
        一个action对应一个job
    

    Stage

        Each job gets divided into smaller sets of tasks called stages //一个job会被拆分成很多小的任务集,叫stage
        that depend on each other                                      //它们之间是有相互的依赖的
        (similar to the map and reduce stages in MapReduce);           //类似map和reduce的stage
        you'll see this term used in the driver's logs.                //你能够在driver的日志里查看到
        一个stage的边界往往是从某个地方取数据开始,到shuffle的结束
    

    一个应用程序由1个driver和多个executor组成,executor运行在worker node上面,executor上面会有一堆的task,这些task是从driver发过来,这些task是遇到一个job的时候触发的,什么是job呢,遇到action时触发的,一个job又会被拆分成很多个task子集,就是stage,task是最小运行单元,在运行时通过不同的运行模式(cluster manager)指定deploy mode到底是client还是cluster来运行

    一个job是action触发的,然后一个job里面可能会有一到多个stage,然后stage里有一堆task,这些task运行在executor里面

        executor运行在worker node上面
    

    Spark Cache
    rdd.cache(): StorageLevel

    cache它和tranformation: lazy   没有遇到action是不会提交作业到spark上运行的
    
    如果一个RDD在后续的计算中可能会被使用到,那么建议cache
    
    cache底层调用的是persist方法,传入的参数是:StorageLevel.MEMORY_ONLY
    cache=persist
    
    unpersist: 立即执行的
    

    窄依赖:一个父RDD的partition之多被子RDD的某个partition使用一次

    宽依赖:一个父RDD的partition会被子RDD的partition使用多次,有shuffle

    hello,1
    hello,1       hello
    world,1
    
    hello,1       world
    world,1
    

    相关文章

      网友评论

          本文标题:spark核心概念

          本文链接:https://www.haomeiwen.com/subject/acwfsctx.html