SparkContext历史使命
我们写Spark程序时各种用到的sc是也, 可以说是产品经理的灵魂所在. 作为产品经理, 管理各种各样的任务和数据, 汇总团队情报然后分发. 裁掉那些摸鱼的搬砖工, 然后把活分给新来的.

sc大体上的任务
-
启动sparkEnv, sparkEnv的结构前文中已经有配图. 里面包含了大量的Service.
这些Service负责分布式的状态管理 -
启动DAGScheduler. 根据RDD的一系列的内部构造, 生成一个任务执行的蓝图
-
启动JOBScheduler. 监控好各个Executor的状态, 让所有的JOB能够正常的运行在这些worker上
-
维护好各种配置信息, 并把任务执行需要的全局变量和全局计数器发到Executor上.
-
维护好RDD的状态, 作为RDD的初始化入口, 要么从文件系统中读取, 要么依赖一些第三方包如Kakfa-spark插件从其它的存储器中读取.
RDD
sc的构成
-
Applicaiton Status
- SparkEnv
- SparkConf
- deployment environment (as master URL)
- application name
- unique identifier of execution attempt
- deploy mode
- default level of parallelism
- Spark user
- the time (in milliseconds) when
SparkContext
was created - URL of web UI
- Spark version
- Storage status
-
Setting Configuration
- master URL
- Local Properties — Creating Logical Job Groups
- Setting Local Properties to Group Spark Job
- Default Logging Level
-
Creating Distributed Entities
- RDDs
- Accumulators
- Broadcast variables
-
Many services
- BlockManager
- ShuffleManager
- ...
-
Running jobs synchronously
-
Submitting jobs asynchronously
-
Cancelling a job
-
Cancelling a stage
-
Assigning custom Scheduler Backend, TaskScheduler and DAGScheduler
-
Closure cleaning
-
Accessing persistent RDDs
-
Unpersisting RDDs, i.e. marking RDDs as non-persistent
-
Registering SparkListener
-
Programmable Dynamic Allocation
网友评论