美文网首页spark
1.1 Spark-RDD存储机制以及特性

1.1 Spark-RDD存储机制以及特性

作者: 不羁之后_ | 来源:发表于2019-07-05 14:59 被阅读0次

    RDD的存储机制:

    • 其数据分布存储在多台机器上,都是以block的形式存储在服务器上。每个Executor都会启动一个BlockManagerSlave,并且管理一部分block;block的元数据保存在Driver的BlockManagerMaster上,当BlockManagerSlave产生block之后就会向BlockManagerMaster注册该block。RDD的partion是一个逻辑数据块,对应相应的物理数据块block。具体流程如下:


      RDD数据存储机制

    1.分区列表性

      /**
       * Implemented by subclasses to return the set of partitions in this RDD. This method will only
       * be called once, so it is safe to implement a time-consuming computation in it.
       *
       * The partitions in this array must satisfy the following property:
       *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
       */
     protected def getPartitions: Array [Partition]
    

    2.依赖其他RDD,血缘关系

      /**
       * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
       * be called once, so it is safe to implement a time-consuming computation in it.
       */
     protected def getDependencies: Seq [Dependency[_]]= deps
    

    3.每个分区都有一个函数计算

      /**
       * :: DeveloperApi ::
       * Implemented by subclasses to compute a given partition.
       */
      @DeveloperApi
      def compute(split: Partition, context: TaskContext): Iterator[T]
    

    4.key-value的分区器

      /** Optionally overridden by subclasses to specify how they are partitioned. */
      @transient val partitioner: Option[Partitioner] = None
    

    5.每个分区都有一个分区位置列表

    /**
    * Optionally overridden by subclasses to specify placement preferences.
    * 可选的,输入参数是 split分片,返回的是一组优先的节点位置。例如hdfs的block位置
    */
    protected def getPreferredLocations(split: Partition): Seq[String] = Nil
    

    相关文章

      网友评论

        本文标题:1.1 Spark-RDD存储机制以及特性

        本文链接:https://www.haomeiwen.com/subject/ipyshctx.html