RDD五大特性

作者: 大数据修行 | 来源:发表于2019-05-24 15:44 被阅读0次

    1. a list of partitions

    2. a function for computing each split

    3. a list dependencies on other RDDs

    4. optionally, a partitioner for key-value RDDS

    (比如按照key的hash值进行重分区)

    5. optionally, a list of preferred locations to compute each split on

      /**
       * :: DeveloperApi ::
       * Implemented by subclasses to compute a given partition.
       */
      @DeveloperApi
      def compute(split: Partition, context: TaskContext): Iterator[T]
    
      /**
       * Implemented by subclasses to return the set of partitions in this RDD. This method will only
       * be called once, so it is safe to implement a time-consuming computation in it.
       *
       * The partitions in this array must satisfy the following property:
       *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
       */
      protected def getPartitions: Array[Partition]
    
      /**
       * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
       * be called once, so it is safe to implement a time-consuming computation in it.
       */
      protected def getDependencies: Seq[Dependency[_]] = deps
    
      /**
       * Optionally overridden by subclasses to specify placement preferences.
       */
      protected def getPreferredLocations(split: Partition): Seq[String] = Nil
    
      /** Optionally overridden by subclasses to specify how they are partitioned. */
      @transient val partitioner: Option[Partitioner] = None
    
      // 
    
    

    相关文章

      网友评论

        本文标题:RDD五大特性

        本文链接:https://www.haomeiwen.com/subject/gtawzqtx.html