coalease 和 repartition的区别

作者: pcqlegend | 来源:发表于2018-06-08 14:47 被阅读0次

coalease 和 repartition的区别
coalesce()方法和repartition()方法的区别
Spark从入门到精通63:Dataset的typed操作
189、Spark 2.0之Dataset开发详解-typed操
spark中repartition与coalesce的区别
repartition和coalesce
161、Spark内核原理进阶之repartition算子内部实
Spark中repartition和coalesce的用法
（二十）coalesce和repartition
repartition 和 coalesce算子

coalesce 英文翻译是联合合并

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
      : RDD[T] = withScope

官方解释：

返回一个被reduce 成 numPartitions(入参是numPartitions) 数量的partition的新的RDD
这将生成一个窄依赖。举个例子，如果你要将一个RDD从1000个分区转化到100个分区，并不会产生shuffle操作，100个新partition的任何一个需要当前的10个分区。
但是如果你要进行的是一个比较极端的coalesce,比如设置numPartitions = 1,与你希望的不同，这将会导致计算发生在少数的几个节点上（比如 numPartitions = 1的话就只有一个节点）。为了避免这个情况，你可以设置参数传入 shuffle = true，这将会增加一个shuffle的操作，这就意味着，当前的分区将会并行的计算。
注意：设置shuffle = true, 可以合并成一个数量很多的partition，比如你有大量的分区，比如100，但是有一些分区的数据量非常大。调用 coalesce(1000, shuffle = true) 则会使用默认的hash分区器，将所有的分区分发成1000个分区。

/**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

repartition

创建指定数量分区的RDD，可以增加或者减少这个RDD的并行度，内部就是对数据进行shuffle操作。如果你想要减少这个分区的分区数，你可以考虑使用coalesce 方法，它可以避免一个shuffle操作。

所以 repartition 就是shuffle为true的coalesce

网友评论

本文标题：coalease 和 repartition的区别

本文链接：https://www.haomeiwen.com/subject/nwhasftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

coalease 和 repartition的区别

coalesce 英文翻译是联合合并

官方解释：

repartition

相关文章