美文网首页Spark深入学习
【Spark Java API】Transformation(8

【Spark Java API】Transformation(8

作者: 小飞_侠_kobe | 来源:发表于2016-03-08 10:44 被阅读924次

    fullOuterJoin


    官方文档描述:

    Perform a full outer join of `this` and `other`. For each element (k, v) in `this`, 
    the resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for w in `other`, 
    or the pair (k, (Some(v), None)) if no elements in `other` have key k. Similarly, 
    for each element (k, w) in `other`, the resulting RDD will either contain all pairs (k, (Some(v), Some(w))) 
    for v in `this`, or the pair (k, (None, Some(w))) if no elements in `this` have key k. 
    Uses the given Partitioner to partition the output RDD.
    

    函数原型:

    def fullOuterJoin[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, (Optional[V], Optional[W])]
    
    def fullOuterJoin[W](other: JavaPairRDD[K, W], numPartitions: Int)
    : JavaPairRDD[K, (Optional[V], Optional[W])]
    
    def fullOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner)
    : JavaPairRDD[K, (Optional[V], Optional[W])]
    

    源码分析:

    def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)    : RDD[(K, (Option[V], Option[W]))] = self.withScope {  
      this.cogroup(other, partitioner).flatMapValues {    
        case (vs, Seq()) => vs.iterator.map(v => (Some(v), None))    
        case (Seq(), ws) => ws.iterator.map(w => (None, Some(w)))    
        case (vs, ws) => for (v <- vs.iterator; w <- ws.iterator) yield (Some(v), Some(w))  
      }
    }
    

    **
    从源码中可以看出,fullOuterJoin() 与 join() 类似,首先进行 cogroup(), 得到 <K, (Iterable[V1], Iterable[V2])> 类型的 MappedValuesRDD,然后对 Iterable[V1] 和 Iterable[V2] 做笛卡尔集,注意在V1,V2中添加了None,并将集合 flat() 化。
    **

    实例:

    List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
    final Random random = new Random();
    JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
    JavaPairRDD<Integer,Integer> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() {    
      @Override    
      public Tuple2<Integer, Integer> call(Integer integer) throws Exception {        
        return new Tuple2<Integer, Integer>(integer,random.nextInt(10));    
      }
    });
    
    //全关联
    JavaPairRDD<Integer,Tuple2<Optional<Integer>,Optional<Integer>>> fullJoinRDD = javaPairRDD.fullOuterJoin(javaPairRDD);
    System.out.println(fullJoinRDD);
    
    JavaPairRDD<Integer,Tuple2<Optional<Integer>,Optional<Integer>>> fullJoinRDD1 = javaPairRDD.fullOuterJoin(javaPairRDD,2);
    System.out.println(fullJoinRDD1);
    
    JavaPairRDD<Integer,Tuple2<Optional<Integer>,Optional<Integer>>> fullJoinRDD2 = javaPairRDD.fullOuterJoin(javaPairRDD, new Partitioner() {    
      @Override    
      public int numPartitions() {        return 2;    }    
      @Override    
      public int getPartition(Object key) {   return (key.toString()).hashCode()%numPartitions();    }
    });
    System.out.println(fullJoinRDD2);
    

    leftOuterJoin


    官方文档描述:

    Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, 
    the resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, 
    or the pair (k, (v, None)) if no elements in `other` have key k. 
    Uses the given Partitioner to partition the output RDD.
    

    函数原型:

    def leftOuterJoin[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, (V, Optional[W])]
    
    def leftOuterJoin[W](other: JavaPairRDD[K, W], numPartitions: Int)
    : JavaPairRDD[K, (V, Optional[W])]
    
    def leftOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner): JavaPairRDD[K, (V, Optional[W])] 
    

    源码分析:

    def leftOuterJoin[W](    other: RDD[(K, W)],    partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {  
    this.cogroup(other, partitioner).flatMapValues { pair =>    
        if (pair._2.isEmpty) {      
          pair._1.iterator.map(v => (v, None))    
        } else {        
          for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))    
        }  
      }
    }
    

    **
    从源码中可以看出,leftOuterJoin() 与 fullOuterJoin() 类似,首先进行 cogroup(), 得到 <K, (Iterable[V1], Iterable[V2])> 类型的 MappedValuesRDD,然后对 Iterable[V1] 和 Iterable[V2] 做笛卡尔集,注意在V1中添加了None,并将集合 flat() 化。
    **

    实例:

    List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
    final Random random = new Random();
    JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
    JavaPairRDD<Integer,Integer> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() {    
      @Override    
      public Tuple2<Integer, Integer> call(Integer integer) throws Exception {        
        return new Tuple2<Integer, Integer>(integer,random.nextInt(10));    
      }
    });
    
    //左关联 JavaPairRDD<Integer,Tuple2<Integer,Optional<Integer>>> leftJoinRDD = javaPairRDD.leftOuterJoin(javaPairRDD);
    System.out.println(leftJoinRDD);
    
    JavaPairRDD<Integer,Tuple2<Integer,Optional<Integer>>> leftJoinRDD1 = javaPairRDD.leftOuterJoin(javaPairRDD,2);
    System.out.println(leftJoinRDD1);
    
    JavaPairRDD<Integer,Tuple2<Integer,Optional<Integer>>> leftJoinRDD2 = javaPairRDD.leftOuterJoin(javaPairRDD, new Partitioner() {    
        @Override    
        public int numPartitions() {        return 2;    }    
        @Override    
        public int getPartition(Object key) { return (key.toString()).hashCode()%numPartitions();    
      }
    });
    System.out.println(leftJoinRDD2);
    

    rightOuterJoin


    官方文档描述:

    Perform a right outer join of `this` and `other`. For each element (k, w) in `other`, 
    the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`, 
    or the pair (k, (None, w)) if no elements in `this` have key k. 
    Uses the given Partitioner to partition the output RDD.
    

    函数原型:

    def rightOuterJoin[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, (Optional[V], W)]
    
    def rightOuterJoin[W](other: JavaPairRDD[K, W], numPartitions: Int)
    : JavaPairRDD[K, (Optional[V], W)]
    
    def rightOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner): JavaPairRDD[K, (Optional[V], W)]
    

    源码分析:

    def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)    : RDD[(K, (Option[V], W))] = self.withScope {  
    this.cogroup(other, partitioner).flatMapValues { pair =>    
        if (pair._1.isEmpty) {      
          pair._2.iterator.map(w => (None, w))    
        } else {      
          for (v <- pair._1.iterator; w <- pair._2.iterator) yield (Some(v), w)    
        }  
      }
    }
    

    **
    从源码中可以看出,rightOuterJoin() 与 fullOuterJoin() 类似,首先进行 cogroup(), 得到 <K, (Iterable[V1], Iterable[V2])> 类型的 MappedValuesRDD,然后对 Iterable[V1] 和 Iterable[V2] 做笛卡尔集,注意在V2中添加了None,并将集合 flat() 化。
    **

    实例:

    List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
    final Random random = new Random();
    JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
    JavaPairRDD<Integer,Integer> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() {    
      @Override    
      public Tuple2<Integer, Integer> call(Integer integer) throws Exception {        
        return new Tuple2<Integer, Integer>(integer,random.nextInt(10));    
      }
    });
    
    //右关联
    JavaPairRDD<Integer,Tuple2<Optional<Integer>,Integer>> rightJoinRDD = javaPairRDD.rightOuterJoin(javaPairRDD);
    System.out.println(rightJoinRDD);
    
    JavaPairRDD<Integer,Tuple2<Optional<Integer>,Integer>> rightJoinRDD1 = javaPairRDD.rightOuterJoin(javaPairRDD,2);
    System.out.println(rightJoinRDD1);
    
    JavaPairRDD<Integer,Tuple2<Optional<Integer>,Integer>> rightJoinRDD2 = javaPairRDD.rightOuterJoin(javaPairRDD, new Partitioner() {    
      @Override    
      public int numPartitions() {        return 2;    }    
      @Override    
      public int getPartition(Object key) { return (key.toString()).hashCode()%numPartitions();    }
    });
    System.out.println(rightJoinRDD2);
    

    相关文章

      网友评论

        本文标题:【Spark Java API】Transformation(8

        本文链接:https://www.haomeiwen.com/subject/rikzkttx.html