数据集合parallelize
![](https://img.haomeiwen.com/i6231724/1e43bec8738bb079.png)
![](https://img.haomeiwen.com/i6231724/eb653392b13f1fb5.png)
外部数据testfile
![](https://img.haomeiwen.com/i6231724/9ae4c64fcd276431.png)
![](https://img.haomeiwen.com/i6231724/f3205bdecdeb272d.png)
map用于操作元素
![](https://img.haomeiwen.com/i6231724/c322a721c42d2f73.png)
![](https://img.haomeiwen.com/i6231724/ccb3cfc47a999e83.png)
filter用于过滤元素
![](https://img.haomeiwen.com/i6231724/9d3f7544d996113a.png)
![](https://img.haomeiwen.com/i6231724/faa33817370aa9c5.png)
flatMap用于映射元素
![](https://img.haomeiwen.com/i6231724/f337b8b39fb16d67.png)
![](https://img.haomeiwen.com/i6231724/7fb60e80ef1ee72a.png)
mapPartions用于分区数据的整体处理
![](https://img.haomeiwen.com/i6231724/24fcd4251dfc9e70.png)
![](https://img.haomeiwen.com/i6231724/1f4206234a248f4f.png)
sample用于取样操作
![](https://img.haomeiwen.com/i6231724/12caf1e1a980a56b.png)
union用于两个数据集合并
![](https://img.haomeiwen.com/i6231724/577599a4d4d60ecc.png)
intersection用于返回两个集合的交际
![](https://img.haomeiwen.com/i6231724/a83fff30f7f6470b.png)
distinct用于两个数据集的去重
![](https://img.haomeiwen.com/i6231724/b843fe892796b5f5.png)
groupByKey对数据进行分组
-
groupByKey([numTasks])对数据进行分组,返回一个(k,seq[V]),默认是使用8个并行任务进行分组,可以设置numTasks的数量。
image.png
reduceByKey数组分组聚合操作
![](https://img.haomeiwen.com/i6231724/c4a596dd4f0788ef.png)
combineByKey是对RDD中的数据集按照key进行聚合
![](https://img.haomeiwen.com/i6231724/6d8fc66aa6760095.png)
![](https://img.haomeiwen.com/i6231724/e5df1bae302c7f70.png)
sortByKey对key进行排序
![](https://img.haomeiwen.com/i6231724/f2a2092a98e8a245.png)
join对两个集合进行连接操作,生成(k,(v,w))
cogroup聚合两个集合按照key生成[k,Seq[V],Seq[W]]
![](https://img.haomeiwen.com/i6231724/12d8b1b130309be4.png)
![](https://img.haomeiwen.com/i6231724/013ae471ecf33b9b.png)
cartesian对两个集合进行笛卡尔积
subtract对两个数据集进行减法
![](https://img.haomeiwen.com/i6231724/3ea02372f60330dd.png)
![](https://img.haomeiwen.com/i6231724/8861599e24e5e7f0.png)
zip进两个序列进行压缩成对的操作
![](https://img.haomeiwen.com/i6231724/9856a732d379ae58.png)
coalesce(numPartitions)对RDD进行重分区,不需shuffle,而repartition(numPartitions)则需要进行需shuffle
reduce对数据进行聚合操作,区别于bykey的只对value进行操作
![](https://img.haomeiwen.com/i6231724/0ce214a0fb051912.png)
takeSample返回随机num个元素的数组
![](https://img.haomeiwen.com/i6231724/9ac6d86f46837a84.png)
takeOrdered(n,[ordering])返回随机的n个元素的数组
countBykey
![](https://img.haomeiwen.com/i6231724/46765f05eb388aab.png)
网友评论