美文网首页
flink1.8 心跳服务

flink1.8 心跳服务

作者: todd5167 | 来源:发表于2019-10-06 14:00 被阅读0次

    心跳服务

    Flink对各组件服务状态的监控统一使用心跳服务来管理,如同其他的服务一样,这一部分的代码也是解耦的,被多个地方重复使用。本次重点是学习下Flink是如何封装心跳管理的,不涉及对心跳超时、传递心跳信息的后续处理。先回答如下问题,在看具体代码。

    • 如何判定心跳超时?
      心跳服务启动后,Flink会启动一个线程来处理心跳超时事件,在设定的心跳超时时间到达后才执行线程。如果接收到组件的心跳消息,会先将该线程取消而后重新开启,重置心跳超时事件的触发。
    • 何时调用双方才发起心跳检查?
      心跳检查是双向的,一方会主动发起心跳请求,而另一方则是对心跳做出响应,两者通过RPC相互调用,重置对方的超时线程。以JobManager和TaskManager为例,JM在启动时会开启周期调度,向已经注册到JM中的TM发起心跳检查,通过RPC调用TM的requestHeartbeat方法,重置对JM超时线程的调用,表示当前JM状态正常。在TM的requestHeartbeat方法被调用后,通过RPC调用JM的receiveHeartbeat,重置对TM超时线程的调用,表示TM状态正常。
    • 如何处理心跳超时?
      心跳服务依赖HeartbeatListener,当在timeout时间范围内未接收到心跳响应,则会触发超时处理线程,该线程通过调用HeartbeatListener的notifyHeartbeatTimeout方法做后续重连操作。

    心跳服务使用的主要接口和类如下图所示:

    HeartbeatTarget:用来发送心跳信息,也用来接收心跳响应。心跳发送者和接收者都是该接口的子类。两者都可以携带Payload信息。

    public interface HeartbeatTarget<I> {
        // 接收监控目标发送来的心跳请求信息
       void receiveHeartbeat(ResourceID heartbeatOrigin, I heartbeatPayload);
        // 向监控目标发送心跳请求
        void requestHeartbeat(ResourceID requestOrigin, I heartbeatPayload);
    }
    

    HeartbeatManager:心跳管理器用来启动或停止监视HeartbeatTarget,并报告该目标心跳超时事件。通过monitorTarget来传递并监控HeartbeatTarget,这个方法可以看做是整个服务的输入,告诉心跳服务去管理哪些目标。

    public interface HeartbeatManager<I, O> extends HeartbeatTarget<I> {
            // 开始监控心跳目标,当目标心跳超时,会报告给与HeartbeatManager关联的HeartbeatListener
        void monitorTarget(ResourceID resourceID, HeartbeatTarget<O> heartbeatTarget);
            //  取消监控心跳目标,ResourceID是心跳目标的标识
        void unmonitorTarget(ResourceID resourceID);
            // 停止当前心跳管理器
        void stop();
           //返回最近一次心跳时间,如果心跳目标被移除了则返回-1
        long getLastHeartbeatFrom(ResourceID resourceId);
    }
    

    HeartbeatListener:是和HeartbeatManager密切相关的接口,可以看做服务的输出。主要有以下作用:

    • 心跳超时通知
    • 接收心跳信息中的Payload
    • 检索作为心跳响应输出的Payload
    public interface HeartbeatListener<I, O> {
          // 心跳超时会调用该方法
        void notifyHeartbeatTimeout(ResourceID resourceID);
          // 接收到有关心跳的payload就会执行该方法
        void reportPayload(ResourceID resourceID, I payload);
        // 检索下一个心跳消息的Payload
        O retrievePayload(ResourceID resourceID);
    }
    
    

    相关代码

    • 心跳服务创建入口
      集群启动时会初始化一些服务,在ClusterEntrypoint#initializeServices方法中创建心跳管理服务。
    heartbeatServices = createHeartbeatServices(configuration);
    
    protected HeartbeatServices createHeartbeatServices(Configuration configuration) {
      return HeartbeatServices.fromConfiguration(configuration);
    }
    
    • 从配置文件中提取心跳间隔heartbeat.interval和心跳超时时间heartbeat.timeout配置,并创建HeartbeatServices
    public static HeartbeatServices fromConfiguration(Configuration configuration) {
        // 心跳间隔,默认10s
        long heartbeatInterval = configuration.getLong(HeartbeatManagerOptions.HEARTBEAT_INTERVAL);
        // 心跳超时时间,50s
        long heartbeatTimeout = configuration.getLong(HeartbeatManagerOptions.HEARTBEAT_TIMEOUT);
    
        return new HeartbeatServices(heartbeatInterval, heartbeatTimeout);
    }
    
    • createHeartbeatManager和createHeartbeatManagerSender核心方法

      这两个方法使用的两个类HeartbeatManagerImpl、HeartbeatManagerSenderImpl是整个心跳服务的关键。

      HeartbeatManagerImpl由心跳被发起方(例如TM)创建,接收发起方的(JM)心跳发送请求,包含两个重要属性heartbeatListener、heartbeatTargets。heartbeatTargets是一个Map集合,key代表要发送心跳组件(例如:TM)的ID,value则是为当前组件创建的触发心跳超时的线程HeartbeatMonitor,两者一一对应,心跳超会触发 heartbeatListener的notifyHeartbeatTimeout方法。注意:被发起方监控线程的开启是在接收到请求心跳(requestHeartbeat被调用后)以后才触发的,属于被动触发。

    //  外部调用者传递heartbeatTarget,并为其创建一个HeartbeatMonitor
    public void monitorTarget(ResourceID resourceID, HeartbeatTarget<O> heartbeatTarget) {
       if (!stopped) {
           if (heartbeatTargets.containsKey(resourceID)) {
               log.debug("The target with resource ID {} is already been monitored.", resourceID);
           } else {
               HeartbeatManagerImpl.HeartbeatMonitor<O> heartbeatMonitor = new HeartbeatManagerImpl.HeartbeatMonitor<>(
                   resourceID,
                   heartbeatTarget,
                   mainThreadExecutor,
                   heartbeatListener,
                   heartbeatTimeoutIntervalMs);
    
               heartbeatTargets.put(
                   resourceID,
                   heartbeatMonitor);
    
               // check if we have stopped in the meantime (concurrent stop operation)
               if (stopped) {
                   heartbeatMonitor.cancel();
    
                   heartbeatTargets.remove(resourceID);
               }
           }
       }
    }
    

    Heartbeat monitor管理心跳目标,在timeout时间内没有接收到心跳信号,则判定心跳超时,通知给HeartbeatListener,每次接收到心跳信号则重置当前timer。

    static class HeartbeatMonitor<O> implements Runnable {
    
        /** Resource ID of the monitored heartbeat target. */
        private final ResourceID resourceID;
    
        /** Associated heartbeat target. */
        private final HeartbeatTarget<O> heartbeatTarget;
    
        private final ScheduledExecutor scheduledExecutor;
    
        /** Listener which is notified about heartbeat timeouts. */
        private final HeartbeatListener<?, ?> heartbeatListener;
    
        /** Maximum heartbeat timeout interval. */
        private final long heartbeatTimeoutIntervalMs;
    
        private volatile ScheduledFuture<?> futureTimeout;
        //  AtomicReference  使用
        private final AtomicReference<State> state = new AtomicReference<>(State.RUNNING);
        //  最近一次接收到心跳的时间
        private volatile long lastHeartbeat;
    
        HeartbeatMonitor(
            ResourceID resourceID,
            HeartbeatTarget<O> heartbeatTarget,
            ScheduledExecutor scheduledExecutor,
            HeartbeatListener<?, O> heartbeatListener,
            long heartbeatTimeoutIntervalMs) {
            // 被监控的机器ID
            this.resourceID = Preconditions.checkNotNull(resourceID);
            // 心跳目录
            this.heartbeatTarget = Preconditions.checkNotNull(heartbeatTarget);
            this.scheduledExecutor = Preconditions.checkNotNull(scheduledExecutor);
            // 心跳监听器
            this.heartbeatListener = Preconditions.checkNotNull(heartbeatListener);
    
            Preconditions.checkArgument(heartbeatTimeoutIntervalMs > 0L, "The heartbeat timeout interval has to be larger than 0.");
            this.heartbeatTimeoutIntervalMs = heartbeatTimeoutIntervalMs;
    
            lastHeartbeat = 0L;
    
            resetHeartbeatTimeout(heartbeatTimeoutIntervalMs);
        }
    
        HeartbeatTarget<O> getHeartbeatTarget() {
            return heartbeatTarget;
        }
    
        ResourceID getHeartbeatTargetId() {
            return resourceID;
        }
    
        public long getLastHeartbeat() {
            return lastHeartbeat;
        }
        // 报告心跳
        void reportHeartbeat() {
            //  保留最近一次接收心跳时间
            lastHeartbeat = System.currentTimeMillis();
            //  接收心跳后, 重置timeout线程
            resetHeartbeatTimeout(heartbeatTimeoutIntervalMs);
        }
        //  重置TIMEOUT
        void resetHeartbeatTimeout(long heartbeatTimeout) {
            if (state.get() == State.RUNNING) {
                //先取消线程,在重新开启
                cancelTimeout();
                // 启动超时线程
                futureTimeout = scheduledExecutor.schedule(this, heartbeatTimeout, TimeUnit.MILLISECONDS);
    
                // Double check for concurrent accesses (e.g. a firing of the scheduled future)
                if (state.get() != State.RUNNING) {
                    cancelTimeout();
                }
            }
        }
    
        void cancel() {
            // we can only cancel if we are in state running
            if (state.compareAndSet(State.RUNNING, State.CANCELED)) {
                cancelTimeout();
            }
        }
    
        private void cancelTimeout() {
            if (futureTimeout != null) {
                futureTimeout.cancel(true);
            }
        }
    
        public boolean isCanceled() {
            return state.get() == State.CANCELED;
        }
        // 心跳超时,触发lister的notifyHeartbeatTimeout
        @Override
        public void run() {
            // The heartbeat has timed out if we're in state running
            if (state.compareAndSet(State.RUNNING, State.TIMEOUT)) {
                heartbeatListener.notifyHeartbeatTimeout(resourceID);
            }
        }
    
        private enum State {
            RUNNING,
            TIMEOUT,
            CANCELED
        }
    
    }
    

    HeartbeatManagerSenderImpl是HeartbeatManagerImpl的子类,由心跳管理的一方(例如JM)创建,创建后立即开启周期调度线程,每次遍历自己管理的heartbeatTarget,触发heartbeatTarget.requestHeartbeat,属于主动触发。

    this.heartbeatPeriod = heartbeatPeriod;
    mainThreadExecutor.schedule(this, 0L, TimeUnit.MILLISECONDS);
    
    public void run() {
        if (!stopped) {
            log.debug("Trigger heartbeat request.");
            for (HeartbeatMonitor<O> heartbeatMonitor : getHeartbeatTargets()) {
                requestHeartbeat(heartbeatMonitor);
            }
            // 周期调度
            getMainThreadExecutor().schedule(this, heartbeatPeriod, TimeUnit.MILLISECONDS);
        }
    }
    //  主动发起心跳检查
    private void requestHeartbeat(HeartbeatMonitor<O> heartbeatMonitor) {
        O payload = getHeartbeatListener().retrievePayload(heartbeatMonitor.getHeartbeatTargetId());
        final HeartbeatTarget<O> heartbeatTarget = heartbeatMonitor.getHeartbeatTarget();
        heartbeatTarget.requestHeartbeat(getOwnResourceID(), payload);
    }
    
    • TM中HeartbeatManagerImpl的使用
    1. TM启动后会和JM建立连接,连接成功后为JM创建HeartbeatTarget,并重写receiveHeartbeat方法。此时,HeartbeatManagerImpl中已经创建好对应monitor线程,只有在JM执行requestHeartbeat后,才会触发该线程的执行。
    TaskExecutor#establishJobManagerConnection
    
    private void establishJobManagerConnection(JobID jobId, final JobMasterGateway jobMasterGateway, JMTMRegistrationSuccess registrationSuccess) {
    
        ResourceID jobManagerResourceID = registrationSuccess.getResourceID();
        // monitor the job manager as heartbeat target
        jobManagerHeartbeatManager.monitorTarget(jobManagerResourceID, new HeartbeatTarget<AccumulatorReport>() {
            //  tm只接收心跳请求
            @Override
            public void receiveHeartbeat(ResourceID resourceID, AccumulatorReport payload) {
                jobMasterGateway.heartbeatFromTaskManager(resourceID, payload);
            }
    
            @Override
            public void requestHeartbeat(ResourceID resourceID, AccumulatorReport payload) {
                // request heartbeat will never be called on the task manager side
            }
        });
    }
    
    1. 在receiveHeartbeat方法内部,直接通过RPC调用JM的heartbeatFromTaskManager方法,最终进入HeartbeatManagerImpl#receiveHeartbeat中,在reportHeartbeat重置JM monitor线程的触发,代表TM正常执行。
    ## jobMaster
    public void heartbeatFromTaskManager(final ResourceID resourceID, AccumulatorReport accumulatorReport) {
        taskManagerHeartbeatManager.receiveHeartbeat(resourceID, accumulatorReport);
    }
    
    ## taskManagerHeartbeatManager的创建
    taskManagerHeartbeatManager = heartbeatServices.createHeartbeatManagerSender(
        resourceId,
        new TaskManagerHeartbeatListener(),
        getMainThreadExecutor(),
        log);
    
    ##  JM接收到心跳
    public void receiveHeartbeat(ResourceID heartbeatOrigin, I heartbeatPayload) {
        if (!stopped) {
            log.debug("Received heartbeat from {}.", heartbeatOrigin);
            //接收到心跳后的操作
            reportHeartbeat(heartbeatOrigin);
        
            if (heartbeatPayload != null) {
                heartbeatListener.reportPayload(heartbeatOrigin, heartbeatPayload);
            }
        }
    }
    
    
    • JM中HeartbeatManagerSenderImpl使用
    1. 接收TM的注册后,加入到心跳目标的集合中,在下一个周期会触发TM的requestHeartbeat。
    public CompletableFuture<RegistrationResponse> registerTaskManager(
                final String taskManagerRpcAddress,
                final TaskManagerLocation taskManagerLocation,
                final Time timeout) {
    
    final ResourceID taskManagerId = taskManagerLocation.getResourceID();
    
    if (registeredTaskManagers.containsKey(taskManagerId)) {
        final RegistrationResponse response = new JMTMRegistrationSuccess(resourceId);
        return CompletableFuture.completedFuture(response);
    } else {
        return getRpcService()
            .connect(taskManagerRpcAddress, TaskExecutorGateway.class)
            .handleAsync(
                (TaskExecutorGateway taskExecutorGateway, Throwable throwable) -> {
                    if (throwable != null) {
                        return new RegistrationResponse.Decline(throwable.getMessage());
                    }
    
                    slotPool.registerTaskManager(taskManagerId);
                    registeredTaskManagers.put(taskManagerId, Tuple2.of(taskManagerLocation, taskExecutorGateway));
                    // 加入心跳目标
                    // monitor the task manager as heartbeat target
                    taskManagerHeartbeatManager.monitorTarget(taskManagerId, new HeartbeatTarget<AllocatedSlotReport>() {
                        @Override
                        public void receiveHeartbeat(ResourceID resourceID, AllocatedSlotReport payload) {
                            // the task manager will not request heartbeat, so this method will never be called currently
                        }
                        // JM要求TM发送心跳请求
                        @Override
                        public void requestHeartbeat(ResourceID resourceID, AllocatedSlotReport allocatedSlotReport) {
                            taskExecutorGateway.heartbeatFromJobManager(resourceID, allocatedSlotReport);
                        }
                    });
    
                    return new JMTMRegistrationSuccess(resourceId);
                },
                getMainThreadExecutor());
        }
    }
    
    1. 在requestHeartbeat中RPC调用taskExecutor#heartbeatFromJobManager,最终调用HeartbeatManagerImpl中的requestHeartbeat,启动或重置超时线程,表示JM状态正常。在该方法中又通过RPC调用JM的receiveHeartbeat。
    public void requestHeartbeat(final ResourceID requestOrigin, I heartbeatPayload) {
        if (!stopped) {
            log.debug("Received heartbeat request from {}.", requestOrigin);
            //启动超时线程 ,并获取heartbeatTarget,此时的目标是JM
            final HeartbeatTarget<O> heartbeatTarget = reportHeartbeat(requestOrigin);
                  
            if (heartbeatTarget != null) {
                if (heartbeatPayload != null) {
                    heartbeatListener.reportPayload(requestOrigin, heartbeatPayload);
                }
                // RPC调用JM的receiveHeartbeat
                heartbeatTarget.receiveHeartbeat(getOwnResourceID(), heartbeatListener.retrievePayload(requestOrigin));
            }
        }
    }
    

    涉及的类不是很多,麻烦的是RPC过程中确定方法的调用方。

    项目中的实现

    我们项目中使用的心跳检查机制是通过ZK进行消息传递实现的。Slave心跳服务运行时,会将节点心跳信息以数字形式定期同步到Zookeeper中。Master节点会在设定的调度周期内从Zookeeper中拉取节点心跳信息,初次获取时将节点心跳信息缓存到本地内存,再次获取时判断是否与本地内存缓存中的心跳信息相等,如果不等则代表工作节点正常工作,新的心跳信息覆盖本地缓存心跳信息。如果相等则表示上一个心跳检测周期内,节点未将心跳信息同步到Zookeeper中,此时心跳异常次数递增。如果心跳异常次数达到设定的阀值,则Master判定该Slave节点宕机并禁用该节点,同时进行任务迁移。

    Flink则是通过RPC相互调用的方式,并重置对方超时线程的调度。相较于我们那种方式,Flink把心跳管理封装成一个单独的服务来使用,做到了解耦,扩展起来也比较方便,也确实在很多地方都使用了这部分代码,不过要依赖RPC之间的通信。

    相关文章

      网友评论

          本文标题:flink1.8 心跳服务

          本文链接:https://www.haomeiwen.com/subject/jvyspctx.html