![](https://img.haomeiwen.com/i13764477/63b3c10368cf3818.png)
![](https://img.haomeiwen.com/i13764477/f604f2f258fdeb4a.png)
- driver-core是Java的
![](https://img.haomeiwen.com/i13764477/6a69e963aa124a3c.png)
![](https://img.haomeiwen.com/i13764477/1945c9586042b8cb.png)
- 提交 脚本命令 提交 客户端 集群 resource application master Java进程
Java申请资源 分发 创建 汇报给Java 任务执行代码 worker端分布式执行
![](https://img.haomeiwen.com/i13764477/d8c48cd8697b2fb1.png)
![](https://img.haomeiwen.com/i13764477/846e616df2f215ba.png)
![](https://img.haomeiwen.com/i13764477/e53914060f23a7fe.png)
![](https://img.haomeiwen.com/i13764477/f4d72a37ee61c2a1.png)
![](https://img.haomeiwen.com/i13764477/eb419203a0184a40.png)
- Action算子 决定Job任务
![](https://img.haomeiwen.com/i13764477/d4bf0a5eecf76214.png)
- 根据key进行重分区
![](https://img.haomeiwen.com/i13764477/0fa589553b2f14db.png)
![](https://img.haomeiwen.com/i13764477/9236e889adace8c2.png)
*task-reducetask
![](https://img.haomeiwen.com/i13764477/e739c0f9bd272135.png)
- 一个worker端 可能多个小文件
![](https://img.haomeiwen.com/i13764477/907d4a40fc6e3f7f.png)
- 一个worker一个task下的文件 =CPU个数 CPU文件复用追加
![](https://img.haomeiwen.com/i13764477/e140b63e6e788b4a.png)
*少了根据Key在内存中排序的操作
![](https://img.haomeiwen.com/i13764477/90739316c5d4d944.png)
- reduceByKey=GroupBykey+Map操作
![](https://img.haomeiwen.com/i13764477/d779ee50c7f90ae7.png)
*增大分区 产生shuffer; repartition
*减小分区 coalesce
*Spark默认分区200个参数 改变分区方式:根据算子 改变参数SparkConf
![](https://img.haomeiwen.com/i13764477/01d66b9b60c3179b.png)
![](https://img.haomeiwen.com/i13764477/da6a0393bac94181.png)
- 广播变量 Redis单节点集群运行 Java端变量分发到worker上;累加器可写;共享变量;
![](https://img.haomeiwen.com/i13764477/48e074ba3d5484a6.png)
![](https://img.haomeiwen.com/i13764477/c9f48521f3c9a2aa.png)
![](https://img.haomeiwen.com/i13764477/220c73f0f305111f.png)
![](https://img.haomeiwen.com/i13764477/594252f986f3faad.png)
![](https://img.haomeiwen.com/i13764477/0cc0d0748e374508.png)
![](https://img.haomeiwen.com/i13764477/57e1ef1ba8d912dd.png)
- checkpoint小文件问题;Zookerper中偏移量;预写日志会频繁访问namenode节点
![](https://img.haomeiwen.com/i13764477/9887d82a2a0909b3.png)
- 偏移量在kafka中
![](https://img.haomeiwen.com/i13764477/45aacefe85c887b6.png)
- 窗口函数 基于checkpoint 基于历史数据叠加
![](https://img.haomeiwen.com/i13764477/29b172f563461dc2.png)
![](https://img.haomeiwen.com/i13764477/c2bae45a5c5c6002.png)
-
Key聚合 sortBy
image.png
*Key聚合 分区 迭代排序;
![](https://img.haomeiwen.com/i13764477/debf9d3d27144c69.png)
![](https://img.haomeiwen.com/i13764477/897e11930638fddc.png)
*基于原有分区 不影其他分区
![](https://img.haomeiwen.com/i13764477/fc7ba5a80ba1c611.png)
*DF 内存不够溢写到磁盘;RDD:内存不够 内存&初始端拉取
![](https://img.haomeiwen.com/i13764477/19e17c13a058b496.png)
![](https://img.haomeiwen.com/i13764477/46c17bd92e146826.png)
![](https://img.haomeiwen.com/i13764477/a44f7d9b409e82b4.png)
*小表在Java本地Join 然后广播到worker
![](https://img.haomeiwen.com/i13764477/4d37300d5a22a8bb.png)
![](https://img.haomeiwen.com/i13764477/663950e639d38ac8.png)
![](https://img.haomeiwen.com/i13764477/ccfd2c9903836911.png)
- 数据倾斜导致某个task运行较慢
-
等数据处理完 再关闭
image.png
网友评论