reference: https://www.tutorialspoint.com/apache_spark/apache_spark_rdd.htm
![](https://img.haomeiwen.com/i6560153/c0b92ed36e0defbb.png)
Map-Reduce的瓶颈:
多个Map-Reduce job之间, 会先把data存到stable storage system: HDFS里。然后再read出来。
可以认为是硬盘的读取,非常慢。所以Hadoop花了90%在IO读取上。
![](https://img.haomeiwen.com/i6560153/8914b834e1e4771a.png)
以下是map-reduce的结构:
![](https://img.haomeiwen.com/i6560153/a64fb3d5957cb798.png)
![](https://img.haomeiwen.com/i6560153/296d753bae7129a0.png)
Key in Spark是使用RDD。支持in memory processing computation.
把中间结果存放在分布式内存里,如果数据太大存不下才会放在disk。
![](https://img.haomeiwen.com/i6560153/c3db81ce674c12e6.png)
网友评论