[spark] Task执行流程

作者: BIGUFO | 来源:发表于2017-10-26 09:59 被阅读86次

    前言

    在文章TaskScheduler 任务提交与调度源码解析 中介绍了Task在executor上的逻辑分配,调用TaskSchedulerImpl的resourceOffers()方法,得到了TaskDescription序列的序列Seq[Seq[TaskDescription]],即对某个task需要在某个executor上执行的描述,仅仅是逻辑上的,还并未真正到executor上执行,本文将从源码角度解析Task是怎么被分配到executor上执行的。

    Driver端发送LaunchTask事件

    通过resourceOffers逻辑分配完task后,CoarseGrainedSchedulerBackend以Seq[Seq[TaskDescription]]参数调用了launchTasks方法:

    private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
          for (task <- tasks.flatten) {
           //序列化TaskDescription
            val serializedTask = ser.serialize(task)
            if (serializedTask.limit >= maxRpcMessageSize) {
              scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
                try {
                  var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
                    "spark.rpc.message.maxSize (%d bytes). Consider increasing " +
                    "spark.rpc.message.maxSize or using broadcast variables for large values."
                  msg = msg.format(task.taskId, task.index, serializedTask.limit, maxRpcMessageSize)
                  taskSetMgr.abort(msg)
                } catch {
                  case e: Exception => logError("Exception in error callback", e)
                }
              }
            }
            else {
              //根据executorId获取executor描述信息executorData
              val executorData = executorDataMap(task.executorId)
              //减少相应的freeCores
              executorData.freeCores -= scheduler.CPUS_PER_TASK
    
              logInfo(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
                s"${executorData.executorHost}.")
              //利用executorData中的executorEndpoint,发送LaunchTask事件,LaunchTask事件中包含序列化后的task 
              executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
            }
          }
        }
    

    先将TaskDescription序列化后判断其大小是否超过akka规定的上限,若没有则通过executorData的executorEndpoint来发送LaunchTask事件,executorEndpoint是Diver端和executor端通信的引用,发送LaunchTask事件给executor,将Task传递给executor执行。

    Executor端接收LaunchTask事件

    driver端向executor发送任务需要通过后台辅助进程CoarseGrainedSchedulerBackend,那么自然而然executor接收任务也有对应的后台辅助进程CoarseGrainedExecutorBackend,该进程与executor一一对应,提供了executor和driver通讯的功能。下面看看CoarseGrainedExecutorBackend接收到事件后是如何处理的:

    case LaunchTask(data) =>
          if (executor == null) {
            exitExecutor(1, "Received LaunchTask command but executor was null")
          } else {
            // 将TaskDescription反序列化
            val taskDesc = ser.deserialize[TaskDescription](data.value)
            logInfo("Got assigned task " + taskDesc.taskId)
            //调用executor的launchTask来加载该task
            executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,
              taskDesc.name, taskDesc.serializedTask)
          }
    

    将task反序列化后得到TaskDescription ,调用executor的launchTask来加载该task,继续跟进:

    def launchTask(
          context: ExecutorBackend,
          taskId: Long,
          attemptNumber: Int,
          taskName: String,
          serializedTask: ByteBuffer): Unit = {
        // 创建一个TaskRunner
        val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,
          serializedTask)
        runningTasks.put(taskId, tr)
        //将tr放到线程池中执行
        threadPool.execute(tr)
      }
    

    创建了一个TaskRunner(继承于 Runnable)并加入到线程池中执行,重点就是TaskRunner中的run方法了,代码太长保留只要逻辑代码:

    override def run(): Unit = {
           ...
          try {
            //反序列化task,得到taskFiles、jar包taskFiles和Task二进制数据taskBytes  
            val (taskFiles, taskJars, taskProps, taskBytes) =
              Task.deserializeWithDependencies(serializedTask)
    
            Executor.taskDeserializationProps.set(taskProps)
           //下载task依赖的文件和jar包
            updateDependencies(taskFiles, taskJars)
           //反序列化出task
            task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader)
            ...
            val value = try {
              //调用task的run方法,真正执行task
              val res = task.run(
                taskAttemptId = taskId,
                attemptNumber = attemptNumber,
                metricsSystem = env.metricsSystem)
              threwException = false
              //返回结果
              res
            } finally {
              val releasedLocks = env.blockManager.releaseAllLocksForTask(taskId)
              //通过任务内存管理器清理所有的分配的内存  
              val freedMemory = taskMemoryManager.cleanUpAllAllocatedMemory()
              if (freedMemory > 0 && !threwException) {
                val errMsg = s"Managed memory leak detected; size = $freedMemory bytes, TID = $taskId"
                if (conf.getBoolean("spark.unsafe.exceptionOnMemoryLeak", false)) {
                  throw new SparkException(errMsg)
                } else {
                  logWarning(errMsg)
                }
              }
            ...
           
            val resultSer = env.serializer.newInstance()
            val beforeSerialization = System.currentTimeMillis()
            //序列化task结果value
            val valueBytes = resultSer.serialize(value)
            val afterSerialization = System.currentTimeMillis()
            ...
            // 将序列化后的结果包装成DirectTaskResult对象
            val directResult = new DirectTaskResult(valueBytes, accumUpdates)
            //再将directResult 序列化,
            val serializedDirectResult = ser.serialize(directResult)
            val resultSize = serializedDirectResult.limit
    
            // directSend = sending directly back to the driver
            val serializedResult: ByteBuffer = {
              //若task结果大于所有maxResultSize(可配置,默认1G),则直接丢弃,driver在返回的对象中拿不到对应的结果
              if (maxResultSize > 0 && resultSize > maxResultSize) { 
                ser.serialize(new IndirectTaskResult[Any](TaskResultBlockId(taskId), resultSize))
              //若task结果大小超过akka最大能传输的大小(运行结果无法通过消息传递 ),则将结果写入BlockManager  
              } else if (resultSize > maxDirectResultSize) {
                val blockId = TaskResultBlockId(taskId)
                env.blockManager.putBytes(
                  blockId,
                  new ChunkedByteBuffer(serializedDirectResult.duplicate()),
                  StorageLevel.MEMORY_AND_DISK_SER)
                logInfo(
                  s"Finished $taskName (TID $taskId). $resultSize bytes result sent via BlockManager)")
                ser.serialize(new IndirectTaskResult[Any](blockId, resultSize))
              //结果比较小能以消息传递,直接返回
              } else {
                logInfo(s"Finished $taskName (TID $taskId). $resultSize bytes result sent to driver")
                serializedDirectResult
              }
            }
            // 向Driver端发状态更新
            execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)
    
          } catch { 
              ...
              //向Driver端发状态更新
              execBackend.statusUpdate(taskId, TaskState.FAILED, serializedTaskEndReason)
              ...
          } finally {
            // 不管成功与否,都需要将task从runningTasks中移除
            runningTasks.remove(taskId)
          }
        }
    
    • 通过Task的deserializeWithDependencies反序列化得到taskFiles、jar包taskFiles和Task二进制数据taskBytes
    • 下载task依赖的文件和jar包
    • 反序列化出task
    • 调用task的run方法,真正执行task,并返回结果
    • 清除分配内存
    • 序列化task的结果,包装成directResult,再次序列化,根据其结果大小将结果以不同的方式返回给driver
      • 若task结果大于所有maxResultSize(可配置,默认1G),则直接丢弃,driver在返回的对象中拿不到对应的结果
      • 若task结果大小超过akka最大能传输的大小(运行结果无法通过消息传递 ),则将结果写入BlockManager
      • 结果比较小能以消息传递,直接返回

    最后通过CoarseGrainedExecutorBackend的statusUpdate方法来返回结果给driver,该方法会使用driverRpcEndpointRef 发送一条包含 serializedResult 的 StatusUpdate 消息给 driver。

    我们再来看看task的run方法都干了什么?

    final def run(
          taskAttemptId: Long,
          attemptNumber: Int,
          metricsSystem: MetricsSystem): T = {
        SparkEnv.get.blockManager.registerTask(taskAttemptId)
        //创建一个task运行的上下文实例
        context = new TaskContextImpl(
          stageId,
          partitionId,
          taskAttemptId,
          attemptNumber,
          taskMemoryManager,
          localProperties,
          metricsSystem,
          metrics)
        TaskContext.setTaskContext(context)
        taskThread = Thread.currentThread()
        if (_killed) {
          kill(interruptThread = false)
        }
        try {
          runTask(context)
        } catch { 
         ...
        } finally { 
         ... //标记完成,释放内存
        }
      }
    

    再继续看runTask方法,task有两种实现,分别是ResultTask(ResultStage的task,个数为最后FinalRdd的partition个数)、ShuffleMapTask(ShuffleMapStage的task,个数为最后FinalRdd的partition个数),两者对应的runTask也有不同的实现,先看ResultTask:

    override def runTask(context: TaskContext): U = { 
        val deserializeStartTime = System.currentTimeMillis()
        val ser = SparkEnv.get.closureSerializer.newInstance()
        //反序列化
        val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](
          ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
        _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
        //对rdd的指定分区的迭代器执行func函数,并返回结果
        func(context, rdd.iterator(partition, context))
      }
    
    • 使用广播变量反序列化得到rdd和func,数据来源于taskBinary
    • 对rdd的指定分区的迭代器执行func函数,并返回结果

    这里的func函数根据具体操作而不同,遍历分区的每条记录是通过迭代器iterator来获取的。

    再来看ShuffleMapTask的实现,shuffleMapTask的输出直接通过Shuffle write写磁盘,为下游的stage的Shuffle Read准备数据,:

    override def runTask(context: TaskContext): MapStatus = {
        // Deserialize the RDD using the broadcast variable.
        val deserializeStartTime = System.currentTimeMillis()
        val ser = SparkEnv.get.closureSerializer.newInstance()
        // 使用广播变量反序列化出rdd和ShuffleDependency
        val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
          ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
        _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
    
        var writer: ShuffleWriter[Any, Any] = null
        try {
          // 获取shuffleManager
          val manager = SparkEnv.get.shuffleManager
          // 通过shuffleManager的getWriter()方法,获得shuffle的writer  
          writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
          // 通过rdd指定分区的迭代器iterator方法来遍历每一条数据,再之上再调用writer的write方法以写数据
          writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
          writer.stop(success = true).get
        } catch {
          case e: Exception =>
            try {
              if (writer != null) {
                writer.stop(success = false)
              }
            } catch {
              case e: Exception =>
                log.debug("Could not stop writer", e)
            }
            throw e
        }
      }
    
    • 通过广播变量反序列化出rdd和ShuffleDependency,数据来源于taskBinary
    • 获取ShuffleManager的writer对象的write方法来将一个rdd的某个分区写入到磁盘
    • 通过rdd的iterator方法能遍历对应分区的所有数据

    Driver端接收到结果后的处理在后续文章中再解析……

    相关文章

      网友评论

        本文标题:[spark] Task执行流程

        本文链接:https://www.haomeiwen.com/subject/ovuqpxtx.html