Spark RDD详解

作者: 由木人_番茄 | 来源:发表于2019-03-23 23:01 被阅读1次

    SparkRDD

    在Spark中RDD具有5个主要的属性:

    • A list of partitions
    • A function(compute) to computing each split(partation)
    • A list of dependencies on other RDDs
    • Optionally, a partitioner for key-value RDDs
    • Optionally,a list of preferred locations to compute each split on

    其中,Partition 是数据集的基本单位,每个partition都会被一几个计算任务处理,它决定了并行计算的粒度。RDD的分片数可以在创建时指定,也可以使用默认值spark.default.parallelism,在Local模式下默认值是local[n]中n,Mesos模式下是8,在其他模式下是max(coreNumber,2),coreNumber是Application分配到的CPU core数(虚拟core数,不一定CPU物理core数)。

    Compute是计算每个Partition的的函数,它会对迭代器进行复合,不需要保存每次计算的结果。

    Dependencies是RDD之间的依赖关系,实际上它保存了所有父RDD的信息。RDD之间的这些依赖关系诠释了RDD之间的血统(lineage),Spark在划分stage的时候就是根据这些信息来划分的。

    Partitioner是RDD的分片函数,目前Spark实现了两种类型的分片函数,基于Hash的HashPartitioner和基于范围的RangePartitioner,只有Key-value的RDD才会有Partitioner,非key-value的RDD的Partitioner的值是None。Partitioner既决定了本身的分片数量,也决定了父RDD shuffle时输出的分片数量。

    Preferred location存储每个Partition的优先位置,对于HDFS文件这个列表保存的就是Partition所在的block的位置。按照“移动数据不如移动计算”的理念,Spark在进行任务调度时会尽可能的将计算任务分配到其所需处理数据块的存储位置。

    RDD中有5个方法代表上述属性

    /**
    * 输入一个Partition并对其代表的数据进行计算
    */
    @DeveloperApi
    def compute(split: Partition, context: TaskContext): Iterator[T]
    /**
    * 数据如何split的逻辑
    */
    protected def getPartitions: Array[Partition]
    /**
    * 这个RDD的依赖,及其父RDD
    */
    protected def getDependencies: Seq[Dependency[_]] = deps
    /**
    * 存储这个RDD的每个分片的Preferred Location
    */
    protected def getPreferredLocations(split: Partition): Seq[String] = Nil
    /**
    * key-value RDD分片
    */
    @transient val partitioner: Option[Partitioner] = None
    

    Compute

    RDD抽象类要求其所有子类都必须实现compute方法,方法的参数是Partition和TaskContext,目的是计算该分区中的数据
    下面是BlockRDD的compute方法实现,直观的该方法就是从blockManager获取这个block代表的具体数据。

    override def compute(split: Partition, context: TaskContext): Iterator[T] = {
            assertValid()
            val blockManager = SparkEnv.get.blockManager
            val blockId = split.asInstanceOf[BlockRDDPartition].blockId
            blockManager.get[T](blockId) match {
              case Some(block) => block.data.asInstanceOf[Iterator[T]]
              case None =>
                throw new Exception(s"Could not compute split, block $blockId of RDD $id not found")
            }
          }
    

    MapPartitionsRDD中的compute实现如下:

    override def compute(split: Partition, context: TaskContext): Iterator[U] =
            f(context, split.index, firstParent[T].iterator(split, context))
        
          override def clearDependencies() {
            super.clearDependencies()
            prev = null
          }
    

    MapPartitionsRDD类的compute方法调用当前RDD的第一个父RDD的iterator方法(iterator方法是拉取父RDD对应分区的数据并返回一个Iterator对象,Iterator内部存储的每个元素即父RDD对应分区的数据记录),RDD会对每个分区(不是一条一条的数据记录)内的数据执行操作f,最终返回包含所有经过转换过的数据记录的新的迭代器,即新的Partition。

    Partition

    A partition(split) is a logical chunk of large distributed data set.
    RDD代表的原始数据会被按照某种逻辑切分成N分,每份数据对应RDD中的一个Partition,Partition的数量决定Task的数量,影响着程序的并行的。Spark通过Partition来管理数据可以方便的进行数据并行处理以及减少不同executors之间的网络传输,通常spark会从距离较近的节点读取数据。
    可以使用def getPartitions: Array[Partition]来获取某个RDD的Partitions or someRDD.partitions.size
    Spark只能为RDD的每个分区运行1个并发任务,最多可以为集群中的cores数。例如有一个包含50个core的集群,那么RDD至少有50个分区(可能是该分区数的2-3倍-100-150)。
    另外,Partition数也影响着在保存RDD时需要创建多少个文件,每个Partition的大小受限于Executor的内存大小。

    Tips: 当使用sc.texeFile读取压缩文件(file.txt.gz,demo.gz)时,Spark产生的RDD只有1个Partition,在这种情况下需要主动使用reparation进行分区。

    rdd = sc.textFile("demo.gz")
    nrdd = rdd.repartation(100)
    

    Partition的定义

    Spark源码中对Partition的定义如下:

    /**
     * An identifier for a partition in an RDD.
     */
    trait Partition extends Serializable {
      /**
       * 序列号 从0开始一次递增
       * Get the partition's index within its parent RDD
       */
      def index: Int
    
      // A better default implementation of HashCode
      override def hashCode(): Int = index
    
      override def equals(other: Any): Boolean = super.equals(other)
    }
    

    Partition和RDD是伴生的,每种RDD都有其对应的Partition实现,分析RDD只要是分析其子类。如HadoopPartition的实现如下:

    /**
     * A Spark split class that wraps around a Hadoop InputSplit.
     */
    private[spark] class HadoopPartition(rddId: Int, override val index: Int, s: InputSplit)
      extends Partition {
    
      val inputSplit = new SerializableWritable[InputSplit](s)
    
      override def hashCode(): Int = 31 * (31 + rddId) + index
    
      override def equals(other: Any): Boolean = super.equals(other)
    
      /**
       * Get any environment variables that should be added to the users environment when running pipes
       * @return a Map with the environment variables and corresponding values, it could be empty
       */
      def getPipeEnvVars(): Map[String, String] = {
          ...
      }
    }
    

    Partition对性能的影响

    1. Partition数量太少

      资源不能被充分利用

    2. Partition数量太多

      导致Task过多,序列化和传输时间开销增大。

      根据Spark doc给出的建议:Typically you want 2-4 partitions for each CPU in your cluster.

    Partition调整

    1. reparation

      reparation是coalesce(numPartitions,shuffle=true),reparation不仅会调整Partition的数量,同时也会将Partitioner修改为HashPartitioner,产生shuffle操作。

       /**
          * 返回一个拥有numPartitons的 新RDD
          *
          * 可以增加或减小RDD的parallelism,相应的会进行shuffle
          *
          * 如果是要减少Partition数,使用coalesce能获得更好的性能
          */
       def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
           coalesce(numPartitions, shuffle = true)
       }
    
    1. coalesce

      coalesce可以控制是否进行shuffle,但是当shuffle=false时只能减小Partition数,不能增大。

       /**
       * 返回一个具有numPartitions的 新RDD
       * 当减小Partition数时不会 shuffle
       * When increase numPartitions you should set shuffle = true
       */
       def coalesce(numPartitions: Int, shuffle: Boolean = false,
                      partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
                     (implicit ord: Ordering[T] = null)
             : RDD[T] = withScope {
           require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
           if (shuffle) {
             /** Distributes elements evenly across output partitions, starting from a random partition. */
             val distributePartition = (index: Int, items: Iterator[T]) => {
               var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
               items.map { t =>
                 // Note that the hash code of the key will just be the key itself. The HashPartitioner
                 // will mod it with the number of total partitions.
                 position = position + 1
                 (position, t)
               }
             } : Iterator[(Int, T)]
       
             // include a shuffle step so that our upstream tasks are still distributed
             new CoalescedRDD(
               new ShuffledRDD[Int, T, T](
                 mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
                 new HashPartitioner(numPartitions)),
               numPartitions,
               partitionCoalescer).values
           } else {
             new CoalescedRDD(this, numPartitions, partitionCoalescer)
           }
         }
       
    

    Dependency

    Dependency记录了RDD在transformation过程中Partition的演化,通过dependency的类型可以判断如何处理数据,即pipeline或者shuffle。
    Spark中的Dependency分为Narrow Dependency和Wide Dependency。其中Narrow Dependency指的是父RDD的一个Partition最多对应子RDD的一个Partition,在处理过程中不需要进行shuffle,可以在同一个Map Task中完成;Wide Dependency中父RDD的一个Partition可能对应子RDD的多个Partition,因此处理时需要进行shuffle操作,Wide Dependency是Spark DAGScheduler划分Stage的依据。

    Dependency 的定义

    属性rdd对应RDD的父RDD。Dependency可以说是对父RDD的包装,通过Dependency的类型说明当前transformation对应的数据处理方式。

        @DeveloperApi
        abstract class Dependency[T] extends Serializable {
          def rdd: RDD[T]
        }
    

    Narrow Dependency

    
    @DeveloperApi
        abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
          /**
           * 根据RDD的Partition ID返回对应的父RDD的Partition ID
           * @param partitionId 子RDD的Partition ID
           * @return the partitions of the parent RDD that the child partition depends upon
           */
          def getParents(partitionId: Int): Seq[Int]
        
          override def rdd: RDD[T] = _rdd
        }
    
    1. OneToOneDependency

    OneToOneDependency表示子RDD和父RDD的Partition之间是1对1的关系,即子RDD的PartitionId和父RDD的PartitionId是一致性的。

    @DeveloperApi
    class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
          override def getParents(partitionId: Int): List[Int] = List(partitionId)
        }
    
    1. RangeDependency

    RangeDependency表示子RDD和父RDD的Partition之间的关系是一个区间内的1对1关系。

        /**
         * :: DeveloperApi ::
         * 如子RDD Partition index   3 4 5 6
         *   父RDD Partition index   9 10 11 12
         *  那么求子RDD的Partition中index为4对应的父RDD的Partition index就是
         *  4 - 3 + 9 = 10
         *  对应代码中的partitionId - outStart + inStart
         *
         * @param rdd 父RDD
         * @param inStart 父RDD Partition Range的起始位置
         * @param outStart 子RDD Partition Range的起始位置
         * @param length range的长度
         */
        @DeveloperApi
        class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
          extends NarrowDependency[T](rdd) {
        
          override def getParents(partitionId: Int): List[Int] = {
            if (partitionId >= outStart && partitionId < outStart + length) {
              List(partitionId - outStart + inStart)
            } else {
              Nil
            }
          }
        }
    

    ShuffleDependency

    ShuffleDependecy的实现要相对复杂一些,shuffle过程需要涉及到网络传输,所有需要有Serializer以减少网络传输,可以增加map端聚合,通过mapSideCombine和aggregator控制,还有和key排序相关的keyOrdering,以及重输出的数据如何分区的Partitioner,其他信息包括k,v和combiner的class信息以及shuffleId。Partition之间的关系在shuffle处戛然而止,因此shuffle是划分state的依据。

    /**
         * :: DeveloperApi ::
         * 代表shuffle阶段的Dependency。在shuffle阶段,RDD是瞬态的,因为executor段不需要它。
         *
         * @param _rdd 父 RDD
         * @param partitioner shuffle output的分区方式
         * @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. 可以通过spark.serializer设置
         * @param keyOrdering shuffle结果key如何排序
         * @param aggregator map/reduce-side aggregator for RDD's shuffle
         * @param mapSideCombine map端是否进行聚合
         */
        @DeveloperApi
        class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
            @transient private val _rdd: RDD[_ <: Product2[K, V]],
            val partitioner: Partitioner,
            val serializer: Serializer = SparkEnv.get.serializer,
            val keyOrdering: Option[Ordering[K]] = None,
            val aggregator: Option[Aggregator[K, V, C]] = None,
            val mapSideCombine: Boolean = false)
          extends Dependency[Product2[K, V]] {
        
          override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]
        
          private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
          private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
          // Note: It's possible that the combiner class tag is null, if the combineByKey
          // methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
          private[spark] val combinerClassName: Option[String] =
            Option(reflect.classTag[C]).map(_.runtimeClass.getName)
        
          val shuffleId: Int = _rdd.context.newShuffleId()
    
          val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle( shuffleId, _rdd.partitions.length, this)
         _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
        }
    

    Partitioner

    Partitioner在shuffle阶段起作用,在map阶段处理结果时需要根据RDD的Partitioner将结果写到不同的bucket中,不同的bucket后续被不同的reducer使用。

    Partitioner的定义

    抽象类Partitioner具有两个方法,numPartitions返回这个RDD具有多少个Partition,而getPartition则根据element的key返回其应该写入的Partition ID。

    /**
         * 决定key-value RDD的element如何根据key进行分区
         * Maps each key to a partition ID, from 0 to `numPartitions - 1`.
         *
         * Partitioner必须是确定的,partition id 和 对应的Partition key 必须返回相同的结果
         */
        abstract class Partitioner extends Serializable {
          def numPartitions: Int
          def getPartition(key: Any): Int
        }
    

    Partitioner的伴生对象中定义了defaultPartitioner方法,实现类似于cogroup这类操作中如何从父RDD中选择Partitioner。
    默认Partitioner的选取策略是先判断RDD中是否有Partitioner,如果有则选择其中Partition数最大的并判断这个分区数和在上游分区的最大数量是否在单个数量级内(大于或小于),或者最大的分区数大于默认的分区数据数(default partitions number),那么选择这个Partitioner,否则需要使用具有default partitions number的HashPartitioner。

    Object Partitioner{
        def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
            val rdds = (Seq(rdd) ++ others)
            val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
        
            val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) {
              Some(hasPartitioner.maxBy(_.partitions.length))
            } else {
              None
            }
    
            val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) {
              rdd.context.defaultParallelism
            } else {
              rdds.map(_.partitions.length).max
            }
    
            // If the existing max partitioner is an eligible one, or its partitions number is larger
            // than the default number of partitions, use the existing partitioner.
            if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) ||
                defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) {
              hasMaxPartitioner.get.partitioner.get
            } else {
              new HashPartitioner(defaultNumPartitions)
            }
          }
    
          /**
           * Returns true if the number of partitions of the RDD is either greater than or is less than and
           * within a single order of magnitude of the max number of upstream partitions, otherwise returns
           * false.
           */
          private def isEligiblePartitioner(
             hasMaxPartitioner: RDD[_],
             rdds: Seq[RDD[_]]): Boolean = {
            val maxPartitions = rdds.map(_.partitions.length).max
            log10(maxPartitions) - log10(hasMaxPartitioner.getNumPartitions) < 1
          }
        }
    }
    

    Partitioner在Spark中有两个具体的实现HashPartitioner和RangePartitioner。

    HashPartitioner

    numPartitions返回传入的分区数,getPartition方法使用传入的key的hashCode对numPartitions取模得到Partition ID。

    class HashPartitioner(partitions: Int) extends Partitioner {
          require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
        
          def numPartitions: Int = partitions
    
          def getPartition(key: Any): Int = key match {
            case null => 0
            case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
          }
    
          override def equals(other: Any): Boolean = other match {
            case h: HashPartitioner =>
              h.numPartitions == numPartitions
            case _ =>
              false
          }
        
          override def hashCode: Int = numPartitions
        }
    

    RangePartitioner

    RangePartitioner的运行机制可以概述为如何选取分区的分隔符
    Spark根据分隔符确定key属于哪个分区,分区内(key)数据无序,分区间有序

    1. 使用reservoir Sample抽样方法,对每个Partition进行抽样
    2. 计算权重, 对数据多(大于sampleSizePerPartition)的分区再进行抽样
    3. 由权重信息计算分区分隔符rangeBounds
    4. 由rangeBounds计算分区数和key属于哪个分区
    /**
      * [[org.apache.spark.Partitioner]]按范围将可排序记录划分为大致相等的范围。 范围通过对传入的RDD的内容进行采样来确定。
      * 在采样记录数小于`partitions`的情况下,RangePartitioner创建的实际分区数可能与`partitions`参数不同。
      * @param partitions 分区数
      * @param rdd RDD
      * @param ascending 升序或者降序
      * @param samplePointsPerPartitionHint 每个分区的采样数
      */
    class RangePartitioner[K : Ordering : ClassTag, V](
            partitions: Int,
            rdd: RDD[_ <: Product2[K, V]],
            private var ascending: Boolean = true,
            val samplePointsPerPartitionHint: Int = 20)
          extends Partitioner {
        
          def this(partitions: Int, rdd: RDD[_ <: Product2[K, V]], ascending: Boolean) = {
            this(partitions, rdd, ascending, samplePointsPerPartitionHint = 20)
          }
        
          // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.
          require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.")
          require(samplePointsPerPartitionHint > 0,
            s"Sample points per partition must be greater than 0 but found $samplePointsPerPartitionHint")
        
          private var ordering = implicitly[Ordering[K]]
    

    RangePartitioner的numPartitions和getPartition方法实现如下,其中getPartition使用线性查找或者二分查找确定key所在的分区;rangeBounds保存的是每个分区的上界,即分隔符。

          //返回分区数,partitions和rangeBounds.length+1可能不相等
          def numPartitions: Int = rangeBounds.length + 1
          //二分查找策略
          private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K]
          //返回key对应的分区索引
          def getPartition(key: Any): Int = {
            val k = key.asInstanceOf[K]
            var partition = 0
            if (rangeBounds.length <= 128) {
              // 如果分区数小于128,则直接使用线性查找,二则使用二分查找
              while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
                partition += 1
              }
            } else {
              // Determine which binary search method to use only once.
              partition = binarySearch(rangeBounds, k)
              // binarySearch either returns the match location or -[insertion point]-1
              if (partition < 0) {
                partition = -partition-1
              }
              if (partition > rangeBounds.length) {
                partition = rangeBounds.length
              }
            }
            // 根据升序或者降序返回Partition Id
            if (ascending) {
              partition
            } else {
              rangeBounds.length - partition
            }
          }
    

    分区间的间隔符通过下面的方法得到:

      // An array of upper bounds for the first (partitions - 1) partitions
          private var rangeBounds: Array[K] = {
            if (partitions <= 1) {
              Array.empty
            } else {
              // 确定采样的数量,上限为1M
              // Cast to double to avoid overflowing ints or longs
              val sampleSize = math.min(samplePointsPerPartitionHint.toDouble * partitions, 1e6)
              // 假设分区大致平衡并稍微过采样,保证分区数少时也能收集更多样本
              val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
              //对每个Partition进行抽样
              val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
              if (numItems == 0L) {
                Array.empty
              } else {
                // 如果分区包含的内容远远超过平均样本数(单个分区记录数*fraction > sampleSizePartition),我们将从中重新进行抽样
                // 以确保从该分区收集足够的样本。
                val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
                val candidates = ArrayBuffer.empty[(K, Float)]
                // 使用imbalancedPartitions存储,后续再进行采样,确保数据多的分区抽取足够多的样本
                val imbalancedPartitions = mutable.Set.empty[Int]
                sketched.foreach { case (idx, n, sample) =>
                  if (fraction * n > sampleSizePerPartition) {
                    imbalancedPartitions += idx
                  } else {
                    // The weight is 1 over the sampling probability.
                    val weight = (n.toDouble / sample.length).toFloat
                    for (key <- sample) {
                      candidates += ((key, weight))
                    }
                  }
                }
                if (imbalancedPartitions.nonEmpty) {
                  // 使用所需的采样概率重新采样不平衡分区,同时调整权重
                  val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
                  val seed = byteswap32(-rdd.id - 1)
                  val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
                  // 调整权重
                  val weight = (1.0 / fraction).toFloat
                  candidates ++= reSampled.map(x => (x, weight))
                }
                //根据(key,weight)选取边界
                RangePartitioner.determineBounds(candidates, math.min(partitions, candidates.size))
              }
            }
          }
    

    抽样方法sketch根据rdd的id获得抽样的seed(用于reservoir sampling中产生随机数),然后调用SamplingUtils.reservirSampleAndCount方法进行采样。

    
    private[spark] object RangePartitioner {
          /**
           * 核心抽样算法:reservoir sampling
           * Sketches the input RDD via reservoir sampling on each partition.
           *
           * @param rdd the input RDD to sketch
           * @param sampleSizePerPartition max sample size per partition
           * @return (total number of items, an array of (partitionId, number of items, sample))
           */
          def sketch[K : ClassTag](
              rdd: RDD[K],
              sampleSizePerPartition: Int): (Long, Array[(Int, Long, Array[K])]) = {
            val shift = rdd.id
            // val classTagK = classTag[K] 
            // to avoid serializing the entire partitioner object
            val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
              val seed = byteswap32(idx ^ (shift << 16))
              val (sample, n) = SamplingUtils.reservoirSampleAndCount(
                iter, sampleSizePerPartition, seed)
              Iterator((idx, n, sample))
            }.collect()
            val numItems = sketched.map(_._2).sum
            (numItems, sketched)
          }
    

    reservoir sampling

    1. 先获取大小为k的样本数到reservoir数组中,若果记录数小于k则直接返回样本。
    2. 否则使用seed生成一个随机数生成器,继续遍历数据,同时每个record对应生成一个随机数,如果随机数小于k,则把1步中的样本数据下标对应的数据替换。
          /**
           * Reservoir sampling implementation that also returns the input size.
           *
           * @param input input size
           * @param k reservoir size 分区样本数
           * @param seed random seed 随机种子
           * @return (samples, input size)
           */
          def reservoirSampleAndCount[T: ClassTag](
              input: Iterator[T],
              k: Int,
              seed: Long = Random.nextLong())
            : (Array[T], Long) = {
            val reservoir = new Array[T](k)
            // 把前k个records放到reservoir数组中,k为每个分区的采样数
            var i = 0
            while (i < k && input.hasNext) {
              val item = input.next()
              reservoir(i) = item
              i += 1
            }
        
            // I如果分区records数小于k,直接返回,否则进行随机替换
            if (i < k) {
              // If input size < k, trim the array to return only an array of input size.
              val trimReservoir = new Array[T](i)
              System.arraycopy(reservoir, 0, trimReservoir, 0, i)
              (trimReservoir, i)
            } else {
              // If input size > k, continue the sampling process.
              var l = i.toLong
              val rand = new XORShiftRandom(seed)
              while (input.hasNext) {
                val item = input.next()
                l += 1
                // There are k elements in the reservoir, and the l-th element has been
                // consumed. It should be chosen with probability k/l. The expression
                // below is a random long chosen uniformly from [0,l)
                val replacementIndex = (rand.nextDouble() * l).toLong
                if (replacementIndex < k) {
                  reservoir(replacementIndex.toInt) = item
                }
              }
              (reservoir, l)
            }
          }
    

    determinBounds返回分区间隔符,实现逻辑如下:
    先将candidate(Array[(key, weight)])按照key进行排序
    然后计算出总权重sumWeights;
    总权重除以分区数,得到每个分区的平均权重step;
    while循环遍历已排序的candidate,累加其权重cumWeight,每当累加的权重达到一个分区的平均权重step,就获取一个key作为分区间隔符;
    最后返回所有获取到的分隔符,determineBounds执行完毕,也就返回了变量rangeBounds(分区分隔符)。

          /**
           * 计算分区分隔符
           * Determines the bounds for range partitioning from candidates with weights indicating how many
           * items each represents. Usually this is 1 over the probability used to sample this candidate.
           *
           * @param candidates unordered candidates with weights
           * @param partitions number of partitions
           * @return selected bounds
           */
          def determineBounds[K : Ordering : ClassTag](
              candidates: ArrayBuffer[(K, Float)],
              partitions: Int): Array[K] = {
            val ordering = implicitly[Ordering[K]]
            // 将candidate按照key进行排序
            val ordered = candidates.sortBy(_._1)
            val numCandidates = ordered.size
            // 计算总权重
            val sumWeights = ordered.map(_._2.toDouble).sum
            // 每个分区的平均权重
            val step = sumWeights / partitions
            var cumWeight = 0.0
            var target = step
            val bounds = ArrayBuffer.empty[K]
            var i = 0
            var j = 0
            var previousBound = Option.empty[K]
            // 遍历已排序的candidate,累加其权重cumWeight,每当权重达到一个分区的
            // 平均权重step,就获取一个key作为分区的间隔符,最后返回所有获取到的分隔符
            while ((i < numCandidates) && (j < partitions - 1)) {
              val (key, weight) = ordered(i)
              cumWeight += weight
              if (cumWeight >= target) {
                // Skip duplicate values.
                if (previousBound.isEmpty || ordering.gt(key, previousBound.get)) {
                  bounds += key
                  target += step
                  j += 1
                  previousBound = Some(key)
                }
              }
              i += 1
            }
            bounds.toArray
          }
    

    自定义Partitioner

    Partitioner主要的作用实在shuffle过程中对数据的Partition进行重新分区,其主要实现的函数是:

    • 获得重新分区的分区个数
    • 针对某个k-v对根据其中的key,将它按照特定的方法进行分区
    class MyPartitioner(partitions: Int) extends Partitioner{
      def numPartitions: Int = partitions
      def getPartition(key: Any): Int = {
        val k = key.asInstanceOf[String]
        return k.length() % partitions
      }
    }
    

    具体调用如下:

    val data = sc.textFile("demo.txt")
    val dataAry = data.flatMap(_.split(",")).map((_,1))
                  .partitionBy(new MyPartitioner(10)).reduceByKey(_ + _)
                  .collect()
    

    PreferredLocations

    因为Spark的每个Partition运算都是由一个Task进行的,那么Partition的PreferredLocation会成为Task的PreferredLocation,这是data locality的任务调度,遵循移动计算比移动数据更加高效的原则。
    下面是HadoopRDD的getPreferredLocaltions的具体实现,可以看到HadoopRDD的PreferredLocaltions就是本地(localhost)。

     override def getPreferredLocations(split: Partition): Seq[String] = {
            val hsplit = split.asInstanceOf[HadoopPartition].inputSplit.value
            val locs = hsplit match {
              case lsplit: InputSplitWithLocationInfo =>
                HadoopRDD.convertSplitLocationInfo(lsplit.getLocationInfo)
              case _ => None
            }
            locs.getOrElse(hsplit.getLocations.filter(_ != "localhost"))
          }
    

    Reference:

    [1] Spark学习之路 (三)Spark之RDD

    [2] Spark RDD之Partition

    [3] <https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html>

    [4] <https://blog.csdn.net/u011564172/article/details/54380574>

    [5] [Spark RDD之Partitioner(https://blog.csdn.net/u011564172/article/details/54667057)

    [6] Spark RDD之Dependency

    [7] Spark核心RDD:计算函数compute

    相关文章

      网友评论

        本文标题:Spark RDD详解

        本文链接:https://www.haomeiwen.com/subject/ndoevqtx.html