【Spark Java API】Transformation(4

作者: 小飞_侠_kobe | 来源:发表于2016-02-03 19:15 被阅读651次

【Spark Java API】Transformation(4
【Spark Java API】Transformation(1
【Spark Java API】Transformation(9
【Spark Java API】Transformation(1
【Spark Java API】Transformation(1
【Spark Java API】Transformation(8
【Spark Java API】Transformation(1
【Spark Java API】Transformation(5
【Spark Java API】Transformation(6
【Spark Java API】Transformation(7

coalesce

官方文档描述：

Return a new RDD that is reduced into `numPartitions` partitions.

函数原型：

def coalesce(numPartitions: Int): JavaRDD[T]

def coalesce(numPartitions: Int, shuffle: Boolean): JavaRDD[T]

源码分析：

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)    : RDD[T] = withScope {  
if (shuffle) {    
/** Distributes elements evenly across output partitions, starting from a random partition. */    
val distributePartition = (index: Int, items: Iterator[T]) => {      
  var position = (new Random(index)).nextInt(numPartitions)      
  items.map { t =>        
  // Note that the hash code of the key will just be the key itself. The HashPartitioner        
  // will mod it with the number of total partitions.        
  position = position + 1        
  (position, t)      
 }    
} : Iterator[(Int, T)]    
// include a shuffle step so that our upstream tasks are still distributed    
new CoalescedRDD(
  new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),      
  new HashPartitioner(numPartitions)),      
  numPartitions).values  
  } else {    
    new CoalescedRDD(this, numPartitions)  
 }
}

**
从源码中可以看出，当shuffle=false时，由于不进行shuffle，问题就变成parent RDD中哪些partition可以合并在一起，合并的过程依据设置的numPartitons中的元素个数进行合并处理。
当shuffle=true时，进行shuffle操作，原理很简单，先是对partition中record进行k-v转换，其中key是由 (new Random(index)).nextInt(numPartitions)+1计算得到，value为record，index 是该 partition 的索引，numPartitions 是 CoalescedRDD 中的 partition 个数，然后 shuffle 后得到 ShuffledRDD，可以得到均分的 records，再经过复杂算法来建立 ShuffledRDD 和 CoalescedRDD 之间的数据联系，最后过滤掉 key，得到 coalesce 后的结果 MappedRDD。
**

实例：

List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
// shuffle默认是false
JavaRDD<Integer> coalesceRDD = javaRDD.coalesce(2);   
System.out.println(coalesceRDD);

JavaRDD<Integer> coalesceRDD1 = javaRDD.coalesce(2,true);
System.out.println(coalesceRDD1);

注意：

**
coalesce() 可以将 parent RDD 的 partition 个数进行调整，比如从 5 个减少到 3 个，或者从 5 个增加到 10 个。需要注意的是当 shuffle = false 的时候，是不能增加 partition 个数的（即不能从 5 个变为 10 个）。
**

repartition

官网文档描述：

Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. 
Internally, this uses a shuffle to redistribute data.
If you are decreasing the number of partitions in this RDD, consider using `coalesce`,which can avoid performing a shuffle.

**
特别需要说明的是，如果使用repartition对RDD的partition数目进行缩减操作，可以使用coalesce函数，将shuffle设置为false，避免shuffle过程，提高效率。
**

函数原型：

def repartition(numPartitions: Int): JavaRDD[T]

源码分析：

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {  
   coalesce(numPartitions, shuffle = true)
}

**
从源码中可以看到repartition等价于 coalesce(numPartitions, shuffle = true)
**

实例：

List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
//等价于 coalesce(numPartitions, shuffle = true)
JavaRDD<Integer> repartitionRDD = javaRDD.repartition(2);
System.out.println(repartitionRDD);

网友评论

Spark深入学习

本文标题：【Spark Java API】Transformation(4

本文链接：https://www.haomeiwen.com/subject/ckrekttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

【Spark Java API】Transformation(4

coalesce

官方文档描述：

函数原型：

源码分析：

实例：

注意：

repartition

官网文档描述：

函数原型：

源码分析：

实例：

相关文章

【Spark Java API】Transformation(4

【Spark Java API】Transformation(1

【Spark Java API】Transformation(9

【Spark Java API】Transformation(1

【Spark Java API】Transformation(1

【Spark Java API】Transformation(8

【Spark Java API】Transformation(1

【Spark Java API】Transformation(5

【Spark Java API】Transformation(6

【Spark Java API】Transformation(7

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Spark深入学习