【Spark Java API】Action(5)—treeAg

作者: 小飞_侠_kobe | 来源:发表于2016-08-19 16:16 被阅读1118次

    treeAggregate


    官方文档描述:

    Aggregates the elements of this RDD in a multi-level tree pattern.
    

    函数原型:

    def treeAggregate[U](    
        zeroValue: U,    
        seqOp: JFunction2[U, T, U],    
        combOp: JFunction2[U, U, U],
        depth: Int): U 
    def treeAggregate[U](    
        zeroValue: U,    
        seqOp: JFunction2[U, T, U],    
        combOp: JFunction2[U, U, U]): U 
    

    **
    可理解为更复杂的多阶aggregate
    **

    源码分析:

    def treeAggregate[U: ClassTag](zeroValue: U)(    
        seqOp: (U, T) => U,    
        combOp: (U, U) => U,    
        depth: Int = 2): U = withScope {  
      require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")  
      if (partitions.length == 0) {    
        Utils.clone(zeroValue, context.env.closureSerializer.newInstance())  
      } else {    
        val cleanSeqOp = context.clean(seqOp)    
        val cleanCombOp = context.clean(combOp)    
        val aggregatePartition =      
          (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)    
        var partiallyAggregated = mapPartitions(it => Iterator(aggregatePartition(it)))    
        var numPartitions = partiallyAggregated.partitions.length    
        val scale = math.max(math.ceil(math.pow(numPartitions, 1.0 / depth)).toInt, 2)    
        // If creating an extra level doesn't help reduce    
        // the wall-clock time, we stop tree aggregation.          
        // Don't trigger TreeAggregation when it doesn't save wall-clock time    
        while (numPartitions > scale + math.ceil(numPartitions.toDouble / scale)) {      
          numPartitions /= scale      
          val curNumPartitions = numPartitions      
          partiallyAggregated = partiallyAggregated.mapPartitionsWithIndex {        
            (i, iter) => iter.map((i % curNumPartitions, _))      
          }.reduceByKey(new HashPartitioner(curNumPartitions), cleanCombOp).values    
      }    
      partiallyAggregated.reduce(cleanCombOp)  
      }
    }
    

    **
    从源码中可以看出,treeAggregate函数先是对每个分区利用scala的aggregate函数进行局部聚合的操作;同时,依据depth参数计算scale,如果当分区数量过多时,则按i%curNumPartitions进行key值计算,再按key进行重新分区合并计算;最后,在进行reduce聚合操作。这样可以通过调解深度来减少reduce的开销。
    **

    实例:

    List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
    JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
    //转化操作
    JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() {    
      @Override    
      public String call(Integer v1) throws Exception {        
        return Integer.toString(v1);    
      }
    });
    
    String result1 = javaRDD1.treeAggregate("0", new Function2<String, String, String>() {    
      @Override    
      public String call(String v1, String v2) throws Exception {        
        System.out.println(v1 + "=seq=" + v2);        
        return v1 + "=seq=" + v2;    
      }
    }, new Function2<String, String, String>() {    
        @Override    
        public String call(String v1, String v2) throws Exception {        
          System.out.println(v1 + "<=comb=>" + v2);        
          return v1 + "<=comb=>" + v2;    
      }
    });
    System.out.println(result1);
    

    treeReduce


    官方文档描述:

    Reduces the elements of this RDD in a multi-level tree pattern.
    

    函数原型:

    def treeReduce(f: JFunction2[T, T, T], depth: Int): T
    def treeReduce(f: JFunction2[T, T, T]): T
    

    **
    与treeAggregate类似,只不过是seqOp和combOp相同的treeAggregate。
    **

    源码分析:

    def treeReduce(f: (T, T) => T, depth: Int = 2): T = withScope {  
      require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")  
      val cleanF = context.clean(f)  
      val reducePartition: Iterator[T] => Option[T] = iter => {    
        if (iter.hasNext) {      
          Some(iter.reduceLeft(cleanF))    
        } else {      
          None    
        }  
      }  
      val partiallyReduced = mapPartitions(it => Iterator(reducePartition(it)))  
      val op: (Option[T], Option[T]) => Option[T] = (c, x) => {    
      if (c.isDefined && x.isDefined) {      
        Some(cleanF(c.get, x.get))    
      } else if (c.isDefined) {      
        c    
      } else if (x.isDefined) {      
        x    
      } else {      
        None    
      }  
     }  
    partiallyReduced.treeAggregate(Option.empty[T])(op, op, depth)    
      .getOrElse(throw new UnsupportedOperationException("empty collection"))}
    

    **
    从源码中可以看出,treeReduce函数先是针对每个分区利用scala的reduceLeft函数进行计算;最后,在将局部合并的RDD进行treeAggregate计算,这里的seqOp和combOp一样,初值为空。在实际应用中,可以用treeReduce来代替reduce,主要是用于单个reduce操作开销比较大,而treeReduce可以通过调整深度来控制每次reduce的规模。
    **

    实例:

    List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
    JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,5);
    JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() {    
        @Override    
        public String call(Integer v1) throws Exception {        
          return Integer.toString(v1);    
        }
    });
    String result = javaRDD1.treeReduce(new Function2<String, String, String>() {    
        @Override    
        public String call(String v1, String v2) throws Exception {        
          System.out.println(v1 + "=" + v2);        
          return v1 + "=" + v2;    
      }
    });
    System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + treeReduceRDD);
    

    相关文章

      网友评论

        本文标题:【Spark Java API】Action(5)—treeAg

        本文链接:https://www.haomeiwen.com/subject/nhubsttx.html