RDD五大特性

作者: 大数据修行 | 来源:发表于2019-05-24 15:44 被阅读0次

1. a list of partitions

2. a function for computing each split

3. a list dependencies on other RDDs

4. optionally, a partitioner for key-value RDDS

(比如按照key的hash值进行重分区)

5. optionally, a list of preferred locations to compute each split on

  /**
   * :: DeveloperApi ::
   * Implemented by subclasses to compute a given partition.
   */
  @DeveloperApi
  def compute(split: Partition, context: TaskContext): Iterator[T]

  /**
   * Implemented by subclasses to return the set of partitions in this RDD. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   *
   * The partitions in this array must satisfy the following property:
   *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
   */
  protected def getPartitions: Array[Partition]

  /**
   * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   */
  protected def getDependencies: Seq[Dependency[_]] = deps

  /**
   * Optionally overridden by subclasses to specify placement preferences.
   */
  protected def getPreferredLocations(split: Partition): Seq[String] = Nil

  /** Optionally overridden by subclasses to specify how they are partitioned. */
  @transient val partitioner: Option[Partitioner] = None

  //

RDD五大特性

1. a list of partitions

2. a function for computing each split

3. a list dependencies on other RDDs

4. optionally, a partitioner for key-value RDDS

5. optionally, a list of preferred locations to compute each split on

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据学习之路

大数据&云计算

大数据

大数据，机器学习，人工智能

玩转大数据

Linux学习之路