【Spark Java API】Action(5)—treeAg

作者: 小飞_侠_kobe | 来源:发表于2016-08-19 16:16 被阅读1118次

【Spark Java API】Action(5)—treeAg
【Spark Java API】Action(1)—reduce
【Spark Java API】Action(4)—sortBy
【Spark Java API】Action(3)—foreac
【Spark Java API】Action(6)—saveAs
【Spark Java API】Action(2)—fold、c
Spark支持的java.time.Instant最大(小)值是
【Spark Java API】Transformation(5
【Spark】Java API中的一些Function的说明
spark应用开发-开发工具篇

treeAggregate

官方文档描述：

Aggregates the elements of this RDD in a multi-level tree pattern.

函数原型：

def treeAggregate[U](    
    zeroValue: U,    
    seqOp: JFunction2[U, T, U],    
    combOp: JFunction2[U, U, U],
    depth: Int): U 
def treeAggregate[U](    
    zeroValue: U,    
    seqOp: JFunction2[U, T, U],    
    combOp: JFunction2[U, U, U]): U

**
可理解为更复杂的多阶aggregate。
**

源码分析：

def treeAggregate[U: ClassTag](zeroValue: U)(    
    seqOp: (U, T) => U,    
    combOp: (U, U) => U,    
    depth: Int = 2): U = withScope {  
  require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")  
  if (partitions.length == 0) {    
    Utils.clone(zeroValue, context.env.closureSerializer.newInstance())  
  } else {    
    val cleanSeqOp = context.clean(seqOp)    
    val cleanCombOp = context.clean(combOp)    
    val aggregatePartition =      
      (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)    
    var partiallyAggregated = mapPartitions(it => Iterator(aggregatePartition(it)))    
    var numPartitions = partiallyAggregated.partitions.length    
    val scale = math.max(math.ceil(math.pow(numPartitions, 1.0 / depth)).toInt, 2)    
    // If creating an extra level doesn't help reduce    
    // the wall-clock time, we stop tree aggregation.          
    // Don't trigger TreeAggregation when it doesn't save wall-clock time    
    while (numPartitions > scale + math.ceil(numPartitions.toDouble / scale)) {      
      numPartitions /= scale      
      val curNumPartitions = numPartitions      
      partiallyAggregated = partiallyAggregated.mapPartitionsWithIndex {        
        (i, iter) => iter.map((i % curNumPartitions, _))      
      }.reduceByKey(new HashPartitioner(curNumPartitions), cleanCombOp).values    
  }    
  partiallyAggregated.reduce(cleanCombOp)  
  }
}

**
从源码中可以看出，treeAggregate函数先是对每个分区利用scala的aggregate函数进行局部聚合的操作；同时，依据depth参数计算scale，如果当分区数量过多时，则按i%curNumPartitions进行key值计算，再按key进行重新分区合并计算；最后，在进行reduce聚合操作。这样可以通过调解深度来减少reduce的开销。
**

实例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
//转化操作
JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() {    
  @Override    
  public String call(Integer v1) throws Exception {        
    return Integer.toString(v1);    
  }
});

String result1 = javaRDD1.treeAggregate("0", new Function2<String, String, String>() {    
  @Override    
  public String call(String v1, String v2) throws Exception {        
    System.out.println(v1 + "=seq=" + v2);        
    return v1 + "=seq=" + v2;    
  }
}, new Function2<String, String, String>() {    
    @Override    
    public String call(String v1, String v2) throws Exception {        
      System.out.println(v1 + "<=comb=>" + v2);        
      return v1 + "<=comb=>" + v2;    
  }
});
System.out.println(result1);

treeReduce

官方文档描述：

Reduces the elements of this RDD in a multi-level tree pattern.

函数原型：

def treeReduce(f: JFunction2[T, T, T], depth: Int): T
def treeReduce(f: JFunction2[T, T, T]): T

**
与treeAggregate类似，只不过是seqOp和combOp相同的treeAggregate。
**

源码分析：

def treeReduce(f: (T, T) => T, depth: Int = 2): T = withScope {  
  require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")  
  val cleanF = context.clean(f)  
  val reducePartition: Iterator[T] => Option[T] = iter => {    
    if (iter.hasNext) {      
      Some(iter.reduceLeft(cleanF))    
    } else {      
      None    
    }  
  }  
  val partiallyReduced = mapPartitions(it => Iterator(reducePartition(it)))  
  val op: (Option[T], Option[T]) => Option[T] = (c, x) => {    
  if (c.isDefined && x.isDefined) {      
    Some(cleanF(c.get, x.get))    
  } else if (c.isDefined) {      
    c    
  } else if (x.isDefined) {      
    x    
  } else {      
    None    
  }  
 }  
partiallyReduced.treeAggregate(Option.empty[T])(op, op, depth)    
  .getOrElse(throw new UnsupportedOperationException("empty collection"))}

**
从源码中可以看出，treeReduce函数先是针对每个分区利用scala的reduceLeft函数进行计算；最后，在将局部合并的RDD进行treeAggregate计算，这里的seqOp和combOp一样，初值为空。在实际应用中，可以用treeReduce来代替reduce，主要是用于单个reduce操作开销比较大，而treeReduce可以通过调整深度来控制每次reduce的规模。
**

实例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,5);
JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() {    
    @Override    
    public String call(Integer v1) throws Exception {        
      return Integer.toString(v1);    
    }
});
String result = javaRDD1.treeReduce(new Function2<String, String, String>() {    
    @Override    
    public String call(String v1, String v2) throws Exception {        
      System.out.println(v1 + "=" + v2);        
      return v1 + "=" + v2;    
  }
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + treeReduceRDD);

网友评论

本文标题：【Spark Java API】Action(5)—treeAg

本文链接：https://www.haomeiwen.com/subject/nhubsttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

【Spark Java API】Action(5)—treeAg

treeAggregate

官方文档描述：

函数原型：

源码分析：

实例：

treeReduce

官方文档描述：

函数原型：

源码分析：

实例：

相关文章

【Spark Java API】Action(5)—treeAg

【Spark Java API】Action(1)—reduce

【Spark Java API】Action(4)—sortBy

【Spark Java API】Action(3)—foreac

【Spark Java API】Action(6)—saveAs

【Spark Java API】Action(2)—fold、c

Spark支持的java.time.Instant最大(小)值是

【Spark Java API】Transformation(5

【Spark】Java API中的一些Function的说明

spark应用开发-开发工具篇

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Spark 应用

首页投稿（暂停使用，暂停投稿）

程序员

spark

大数据