美文网首页
Spark中如何确定Shuffle过程中Reducer的数量

Spark中如何确定Shuffle过程中Reducer的数量

作者: liuzx32 | 来源:发表于2019-01-07 14:36 被阅读0次

    Spark的Shuffle操作对应到Spark运行过程中会引起Shuffle的算子,比如join, repartition,reduceByKey等;

    Spark的Shuffle过程中Reducer的数量要依据具体的算子来确定。有的算子可以具体Reducer的个数,比如repartition(100),会对应启动100个Reducer任务;有的算子则需要内部逻辑确定Reducer的个数,比如reduceByKey则会依据数据的Key关键字来规划Reducer的数量。

    Spark中Shuffle类的都有哪些呢?参考如下:

    去重类算子

    def distinct()
    def distinct(numPartitions: Int)

    聚合类算子

    def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
    def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
    def groupBy[K](f: T => K, p: Partitioner):RDD[(K, Iterable[V])]
    def groupByKey(partitioner: Partitioner):RDD[(K, Iterable[V])]
    def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner): RDD[(K, U)]
    def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int): RDD[(K, U)]
    def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
    def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
    def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean =true, serializer: Serializer =null): RDD[(K, C)]

    排序类算子

    def sortByKey(ascending: Boolean =true, numPartitions: Int = self.partitions.length): RDD[(K, V)]
    def sortBy[K](f: (T) => K, ascending: Boolean =true, numPartitions: Int =this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

    重分区类算子

    def coalesce(numPartitions: Int, shuffle: Boolean =false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
    def repartition(numPartitions: Int)(implicit ord: Ordering[T] =null)

    集合或者表操作类算子

    def intersection(other: RDD[T]): RDD[T]
    def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] =null): RDD[T]
    def intersection(other: RDD[T], numPartitions: Int): RDD[T]
    def subtract(other: RDD[T], numPartitions: Int): RDD[T]
    def subtract(other: RDD[T], p: Partitioner)(implicit ord: Ordering[T] =null): RDD[T]
    def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]
    def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)]
    def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)]
    def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
    def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
    def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
    def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

    #

    相关文章

      网友评论

          本文标题:Spark中如何确定Shuffle过程中Reducer的数量

          本文链接:https://www.haomeiwen.com/subject/zgjurqtx.html