Spark - 源码分析（三）

生成Task

上篇中我们讲到Spark在提交一个Stage后，DAGScheduler中的submitStage()方法会以递归的方式找到该Stage依赖的最上层的父Stage，找到后会将这个最上层的Stage传给submitMissingTasks()方法，该方法定义如下：

...
  /** Called when stage's parents are available and we can now do its task. */
  private def submitMissingTasks(stage: Stage, jobId: Int) {
    logDebug("submitMissingTasks(" + stage + ")")
    ...
    runningStages += stage
    stage match {
      case s: ShuffleMapStage =>
        outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
      case s: ResultStage =>
        outputCommitCoordinator.stageStart(
          stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
    }
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          partitionsToCompute.map { id =>
            val p = s.partitions(id)
            (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    ...
    taskBinary = sc.broadcast(taskBinaryBytes)
...

上面只贴出了该方法的一部分，我们刚才说的父Stage被传进来后，做了几件事情，我们分步骤说一下：

首先先将该Stage加入一个正在运行的Stage任务队列，然后根据这个Stage的所属类型（上篇中说到Stage有两个子类，ResultStage和ShuffleMapStage）设置这个Stage最大分区ID，也就是outputCommitCoordinator.stageStart()方法，它并不用来启动Stage线程的
第二步，根据RDD的分区数生成Task的ID，并决定了每个Task将被放在哪个Executor中执行，这些对应信息被放在taskIdToLocations这个MAP集合中。
第三步，根据Stage类型创建一个Task序列化器。
第四步，根据Stage类型来生成不同类型的Task，Task类型有两种：ResultTask和ShuffleMapTask，再根据RDD的分区数这个关键因素，来生成多个Task，并将每个Task序列化生二进制。
第五步，通过taskScheduler.submitTasks()方法来提交这些生成好的Task。

提交Task

TaskSchedulerImpl是TaskScheduler的子类，它重写了父类的submitTasks方法，上面的taskScheduler.submitTasks()调用的其实就是TaskSchedulerImpl类重写后的方法，原型如下：

override def submitTasks(taskSet: TaskSet) {
    val tasks = taskSet.tasks
    logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
    this.synchronized {
      val manager = createTaskSetManager(taskSet, maxTaskFailures)
      val stage = taskSet.stageId
      val stageTaskSets =
        taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
      stageTaskSets(taskSet.stageAttemptId) = manager
      val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
        ts.taskSet != taskSet && !ts.isZombie
      }
      if (conflictingTaskSet) {
        throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
          s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
      }
      schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
 
      if (!isLocal && !hasReceivedTask) {
        starvationTimer.scheduleAtFixedRate(new TimerTask() {
          override def run() {
            if (!hasLaunchedTask) {
              logWarning("Initial job has not accepted any resources; " +
                "check your cluster UI to ensure that workers are registered " +
                "and have sufficient resources")
            } else {
              this.cancel()
            }
          }
        }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
      }
      hasReceivedTask = true
    }
    backend.reviveOffers()
  }

这个方法最后用到了backend对象，这个对象是SchedulerBackend类型，这一个特质（相当于Java中的接口），其下面有多个实现，它是在创建SparkContext的时候，根据用户指定的–master参数来创建的，如果指定的Yarn模式的话，这个具体的实现类是YarnClusterSchedulerBackend，它又是CoarseGrainedSchedulerBackend的子类，这个reviveOffers()就是定义在reviveOffers()中的，YarnClusterSchedulerBackend中并没有重写这个方法，也就是说这里其实是调用了CoarseGrainedSchedulerBackend中的reviveOffers()方法

override def reviveOffers() {
  driverEndpoint.send(ReviveOffers)
}

方法中的ReviveOffers是一个样例类，在这里被当作一种消息经序列化后被RPC框架传输，这个ReviveOffers会被发送给SparkDriver，SparkDriver端也有一个相应的方法来接收消息

override def receive: PartialFunction[Any, Unit] = {
  case StatusUpdate(executorId, taskId, state, data) =>
    scheduler.statusUpdate(taskId, state, data.value)
    if (TaskState.isFinished(state)) {
      executorDataMap.get(executorId) match {
        case Some(executorInfo) =>
          executorInfo.freeCores += scheduler.CPUS_PER_TASK
          makeOffers(executorId)
        case None =>
          // Ignoring the update since we don't know about the executor.
          logWarning(s"Ignored task status update ($taskId state $state) " +
            s"from unknown executor with ID $executorId")
      }
    }
 
  case ReviveOffers =>
    makeOffers()

Driver在收到ReviveOffers后调用了makeOffers()方法

// Make fake resource offers on all executors
private def makeOffers() {
  // Make sure no executor is killed while some task is launching on it
  val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
    // Filter out executors under killing
    val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
    val workOffers = activeExecutors.map { case (id, executorData) =>
      new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
    }.toIndexedSeq
    scheduler.resourceOffers(workOffers)
  }
  if (!taskDescs.isEmpty) {
    launchTasks(taskDescs)
  }
}

取出所有活着的Executor，activeExecutors里面存放着executorData对象，每个executorData里记录着自已所在的主机，RPC地址，剩余的Core数等，然后从中提取出每个Executor的资源信息，封装为WorkerOffer返回，还是上代码吧，没有什么比这更直观了

private[org.apache.spark.scheduler.cluster] 
class ExecutorData(val executorEndpoint: RpcEndpointRef,
				   val executorAddress: RpcAddress,
				   override val executorHost: String,
				   var freeCores: Int,
				   override val totalCores: Int,
				   override val logUrlMap: Map[String, String])
extends ExecutorInfo

为Task分配资源

取出资源信后就开始为所有Task分配资源了，也就是上面的resourceOffers(workOffers)方法

def resourceOffers(offers: IndexedSeq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
  // Mark each slave as alive and remember its hostname
  // Also track if new executor is added
  var newExecAvail = false
  for (o <- offers) {
    if (!hostToExecutors.contains(o.host)) {
      hostToExecutors(o.host) = new HashSet[String]()
    }
    if (!executorIdToRunningTaskIds.contains(o.executorId)) {
      hostToExecutors(o.host) += o.executorId
      executorAdded(o.executorId, o.host)
      executorIdToHost(o.executorId) = o.host
      executorIdToRunningTaskIds(o.executorId) = HashSet[Long]()
      newExecAvail = true
    }
    for (rack <- getRackForHost(o.host)) {
      hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
    }
  }
 
  // Before making any offers, remove any nodes from the blacklist whose blacklist has expired. Do
  // this here to avoid a separate thread and added synchronization overhead, and also because
  // updating the blacklist is only relevant when task offers are being made.
  blacklistTrackerOpt.foreach(_.applyBlacklistTimeout())
 
  val filteredOffers = blacklistTrackerOpt.map { blacklistTracker =>
    offers.filter { offer =>
      !blacklistTracker.isNodeBlacklisted(offer.host) &&
        !blacklistTracker.isExecutorBlacklisted(offer.executorId)
    }
  }.getOrElse(offers)
 
  val shuffledOffers = shuffleOffers(filteredOffers)
  // Build a list of tasks to assign to each worker.
  val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))
  val availableCpus = shuffledOffers.map(o => o.cores).toArray
  val sortedTaskSets = rootPool.getSortedTaskSetQueue
  for (taskSet <- sortedTaskSets) {
    logDebug("parentName: %s, name: %s, runningTasks: %s".format(
      taskSet.parent.name, taskSet.name, taskSet.runningTasks))
    if (newExecAvail) {
      taskSet.executorAdded()
    }
  }
 
  // Take each TaskSet in our scheduling order, and then offer it each node in increasing order
  // of locality levels so that it gets a chance to launch local tasks on all of them.
  // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
  for (taskSet <- sortedTaskSets) {
    var launchedAnyTask = false
    var launchedTaskAtCurrentMaxLocality = false
    for (currentMaxLocality <- taskSet.myLocalityLevels) {
      do {
        launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(
          taskSet, currentMaxLocality, shuffledOffers, availableCpus, tasks)
        launchedAnyTask |= launchedTaskAtCurrentMaxLocality
      } while (launchedTaskAtCurrentMaxLocality)
    }
    if (!launchedAnyTask) {
      taskSet.abortIfCompletelyBlacklisted(hostToExecutors)
    }
  }
 
  if (tasks.size > 0) {
    hasLaunchedTask = true
  }
  return tasks
}

在这个方法中大概做的事情如下：在所有Executor中过滤掉被加入黑名单的Executor 随机打乱Executor的顺序为每个Task分配资源

分发Task到对应的Executor

分配完资源后就回到了makeOffers()方法，下一步就是发射所有Task了，也就是launchTasks(taskDescs)方法

// Launch tasks returned by a set of resource offers
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
  for (task <- tasks.flatten) {
    val serializedTask = TaskDescription.encode(task)
    if (serializedTask.limit >= maxRpcMessageSize) {
      scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
        try {
          var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
            "spark.rpc.message.maxSize (%d bytes). Consider increasing " +
            "spark.rpc.message.maxSize or using broadcast variables for large values."
          msg = msg.format(task.taskId, task.index, serializedTask.limit, maxRpcMessageSize)
          taskSetMgr.abort(msg)
        } catch {
          case e: Exception => logError("Exception in error callback", e)
        }
      }
    }
    else {
      val executorData = executorDataMap(task.executorId)
      executorData.freeCores -= scheduler.CPUS_PER_TASK
 
      logDebug(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
        s"${executorData.executorHost}.")
 
      executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
    }
  }
}

循环取出每个Task，将其序列化，然后封成LaunchTask发送到某个Executor，至于是哪个Executor，这是在前面生成Task那一步已经分配好的，如果忘了可以翻回去看一下，这个LaunchTask被RPC框架发送到Executor进行处理，具体实现在CoarseGrainedExecutorBackend中

  override def receive: PartialFunction[Any, Unit] = {
    case RegisteredExecutor =>
      logInfo("Successfully registered with driver")
      try {
        executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
      } catch {
        case NonFatal(e) =>
          exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
      }
 
    case RegisterExecutorFailed(message) =>
      exitExecutor(1, "Slave registration failed: " + message)
 
    case LaunchTask(data) =>
      if (executor == null) {
        exitExecutor(1, "Received LaunchTask command but executor was null")
      } else {
        val taskDesc = TaskDescription.decode(data.value)
        logInfo("Got assigned task " + taskDesc.taskId)
        executor.launchTask(this, taskDesc)
      }
 
    case KillTask(taskId, _, interruptThread) =>
      if (executor == null) {
        exitExecutor(1, "Received KillTask command but executor was null")
      } else {
        executor.killTask(taskId, interruptThread)
      }
...

将接收到的Task反序列化，然后又调用了Executor类中的launchTask方法

def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
  val tr = new TaskRunner(context, taskDescription)
  runningTasks.put(taskDescription.taskId, tr)
  threadPool.execute(tr)
}

可以看到它创建了一个TaskRunner对象并将它加入线程池开始执行，这个TaskRunner其实就是实现了Java的Runnable接口，并且可以看出，每一个Task都持有一个Executor的上下文，也就是那个context对象，即然TaskRunner实现了Runner接口，那也就重写了run()方法，接下来我们进run()方法看看TaskRunner中的执行逻辑。

今天就分析先到这里，下一篇继续分析。

编写日期：2017-04-29