美文网首页
Eureka续约之定期剔除

Eureka续约之定期剔除

作者: 0爱上1 | 来源:发表于2019-05-01 21:37 被阅读0次

    前言

    本文会以Server端角度解析过期实例定期剔除原理

    针对正常下线的Client实例,在其应用程序关闭前会触发调用unregister()主动下线请求Eureka Server

    但是对于那些非正常下线的eureka client实例(如内存溢出,进程被kill,或服务器宕机等),在应用关闭前并不会触发unregister() 主动下线

    所以eureka采用了失效剔除的方式主动剔除掉那些已经不能提供服务的client实例,这种机制就是失效剔除

    失效剔除

    eureka server 会以60s(默认)为间隔时间,以后台任务的方式定期清除掉在90s?(默认)内未接收到心跳续约的Client

    这里埋下伏笔,是否真的会在90s内未收到心跳续约就剔除呢?

    关键参数

    • Server端

    该参数用于指定Server端失效剔除定时任务的执行间隔时间,不配置默认为60s执行一次,可通过修改配置自定义间隔时间

      eureka:
        server:
          eviction-interval-timer-in-ms: 6000 # 默认值
    
    • Client端

    该参数通过client端设置,并在register注册时通过POST请求传至Server端,若Client并未配置该参数,则采用server端的默认值90s

    官方表示:

    该值设置的太大,会造成流量依旧会打到某实例,及时该实例已经不能提供服务了
    若该值设置的太小,会造成由于网络抖动,造成实例服务明明还是正常的,结果server因为没有在参数指定事件内收到心跳续约,误将其剔除,造成服务下线的假象存在

      eureka:
         instance:
            lease-expiration-duration-in-seconds: 90 # 默认值
    

    时序图

    Eureka失效剔除任务

    大致说下剔除任务的启动流程

    • 版本

    spring-cloud-netflix-eureka-server-2.1.1.RELEASE.jar

    1. META-INF/spring.factories文件

       org.springframework.boot.autoconfigure.EnableAutoConfiguration=\
       org.springframework.cloud.netflix.eureka.server.EurekaServerAutoConfiguration
      
    2. 由spring加载自动装配,且@Import了EurekaServerInitializerConfiguration配置类

    @Configuration
    @Import(EurekaServerInitializerConfiguration.class)
    @ConditionalOnBean(EurekaServerMarkerConfiguration.Marker.class)
    @EnableConfigurationProperties({ EurekaDashboardProperties.class,
        InstanceRegistryProperties.class })
    @PropertySource("classpath:/eureka/server.properties")
    public class EurekaServerAutoConfiguration extends WebMvcConfigurerAdapter {
    ...
    }
    
    1. EurekaServerInitializerConfiguration 实现了SmartLifecycle ,因此在spring容器的bean加载和初始化完毕后会执行所有实现Lifecycle接口的类的start方法

    2. 后续的初始化EvictionTask任务的schedule的逻辑已经在时序图上画出了

    源码

    • UML类图
    EvictionTask
    • TimerTask + Timer

    JDK1.3 提供的用于被定时器计划调度一次或多次执行的任务抽象类,其结合Timer类完成计划任务的调度,Eureka的EvictionTask就是基于二者实现的任务后台调度

    • EvictionTask

    EvictionTask 是 AbstractInstanceRegistry的内部类,且实现了java.util.TimerTask 抽象类

            public abstract class AbstractInstanceRegistry implements InstanceRegistry {
    
        // 失效剔除定时器,以后台方式运行
        private Timer evictionTimer = new Timer("Eureka-EvictionTimer", true);
    
        // 启动失效剔除定时器
        protected void postInit() {
            renewsLastMin.start();
            if (evictionTaskRef.get() != null) {
                evictionTaskRef.get().cancel();
            }
            evictionTaskRef.set(new EvictionTask());
            // 初次调度延迟时间为evictionIntervalTimerInMs,调度间隔时间也是:EvictionIntervalTimerInMs
            evictionTimer.schedule(evictionTaskRef.get(),
                    serverConfig.getEvictionIntervalTimerInMs(),
                    serverConfig.getEvictionIntervalTimerInMs());
        }
    
        // 失效剔除任务内部类,继承TimerTask,并重写run方法
        class EvictionTask extends TimerTask {
    
            // 上次执行纳米级毫秒数引用
            private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l);
    
            @Override
            public void run() {
                try {
                    // 获取补偿时间
                    long compensationTimeMs = getCompensationTimeMs();
                    logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
                    
                    // 执行失效剔除
                    evict(compensationTimeMs);
                } catch (Throwable e) {
                    logger.error("Could not run the evict task", e);
                }
            }
    
            /**
            * compute a compensation time defined as the actual time this task was executed since the prev iteration,
            * vs the configured amount of time for execution. This is useful for cases where changes in time (due to
            * clock skew or gc for example) causes the actual eviction task to execute later than the desired time
            * according to the configured cycle.
            * 计算本次任务执行的时间和上次任务执行的时间差,若不超过默认的60s,则返回0,超过则返回超过的时间差作为补偿时间
            */
            long getCompensationTimeMs() {
                // 获取当前时间纳米毫秒数
                long currNanos = getCurrentTimeNano();
                
                // 利用AtomicLong的getAndSet 先获取上次执行时的毫秒数,如果时第一次执行run方法的调度,则返回0, 并将lastExecutionNanosRef值设为当前时间纳米毫秒数
                long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);
    
                if (lastNanos == 0l) {
                    // 第一次执行失效剔除任务时进入
                    return 0l;
                }
    
                // 计算此次执行与上次执行的时间差
                long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
                
                // 查看时间差是否大于失效剔除任务间隔执行时间,即默认60s
                long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
                
                // 如果未超过默认的60S, 返回0; 否则返回超过的时间差
                return compensationTime <= 0l ? 0l : compensationTime;
            }
    
            // 获取当前时间纳米毫秒数
            long getCurrentTimeNano() {  // for testing
                return System.nanoTime();
            }
    
        }
    
        // 真正的失效剔除方法
        public void evict(long additionalLeaseMs) {
    
            logger.debug("Running the evict task");
    
            // 1. 判断是否启用租约到期,启用租约到期(返回true)才能执行下面的失效剔除,false则直接return
            // isLeaseExpirationEnabled()方法由PeerAwareInstanceRegistryImpl实例执行,内部需要满足两种条件下才能返回true
            // 1:Server关闭了自我保护模式,即不需要自动保护机制,启用租约到期失效,直接返回true
            // 2:Server启用了自我保护模式,但还没有触发自动保护机制时,也会返回true,也就是满足期望最小每分钟续租次数numberOfRenewsPerMinThreshold > 0 且 
            // 每分钟心跳次数 > 期望最小每分钟续租次数numberOfRenewsPerMinThreshold
            // 
            // 
            // 这里另外提一下自我保护机制的触发规则:
            // 期望最小每分钟续租次数即自我保护阀值(numberOfRenewsPerMinThreshold)= 
            // 服务总数(expectedNumberOfClientsSendingRenews,每有一个client注册,该值就会 + 1) * 
            // 每分钟续约数(根据60.0 / Client配置的RenewalIntervalSeconds值计算出来) * 
            // 自我保护续约百分比阀值因子(默认值0.85)
            当 每分钟实际的续约次数 <= numberOfRenewsPerMinThreshold时,就会触发自我保护机制,不再剔除失效过期的实例
            if (!isLeaseExpirationEnabled()) {
                logger.debug("DS: lease expiration is currently disabled.");
                return;
            }
    
            // We collect first all expired items, to evict them in random order. For large eviction sets,
            // if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
            // the impact should be evenly distributed across all applications.
    
            // 2. 定义一个失效租约的集合
            List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
    
            // 2.1 遍历所有注册表租约信息,
            for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
                Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
                if (leaseMap != null) {
                    for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
                        Lease<InstanceInfo> lease = leaseEntry.getValue();
                        // 2.2. 判断lease租约信息是否失效
                        if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
                            // 2.3. 将失效的租约添加到失效租约集合中
                            expiredLeases.add(lease);
                        }
                    }
                }
            }
    
            // To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
            // triggering self-preservation. Without that we would wipe out full registry.
    
            // 补偿由于GC或本地时间漂移等原因造成的情况,我们需要使用当前注册表大小作为基础为了不触发自我保护,
            // 如果没有它,我们就会消灭完整的注册表
    
            // 3. 获取当前注册表大小
            int registrySize = (int) getLocalRegistrySize();
    
            // 4. 注册大小阈值:注册表大小 * 自我保护阀值因子(默认是0.85)
            int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
    
            // 5. 剔除限制:当前注册表大小 - 注册大小阈值
            int evictionLimit = registrySize - registrySizeThreshold;
    
            // 6. 获取需要去剔除的数量:已失效租约数量和剔除限制两者中小的那个值
            int toEvict = Math.min(expiredLeases.size(), evictionLimit);
            if (toEvict > 0) {
                logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);
                // 6.1. 获取随机数
                Random random = new Random(System.currentTimeMillis());
                for (int i = 0; i < toEvict; i++) {
                    // Pick a random item (Knuth shuffle algorithm)
                    // 通过洗牌算法,选择一个随机的失效租约
                    int next = i + random.nextInt(expiredLeases.size() - i);
                    Collections.swap(expiredLeases, i, next);
                    Lease<InstanceInfo> lease = expiredLeases.get(i);
    
                    // 6.3. 获取失效租约持有实例的appName以及instanceId
                    String appName = lease.getHolder().getAppName();
                    String id = lease.getHolder().getId();
                    
                    // 6.4. 增加失效剔除实例数量
                    EXPIRED.increment();
                    logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
                    
                    // 6.5. 调用内部cancel方法取消注册,等价于Client主动下线了
                    internalCancel(appName, id, false);
                }
            }
        }
    }
    
    • 总结一下失效剔除任务内部执行的流程
    剔除任务内部执行流程图

    判断注册表中租约是否过期逻辑

    • Lease 租约类
        /**
    * 租约类,用于描述基于时间的T(InstanceInfo注册实例)的可用性信息
    */
    public class Lease<T> {
    
        // 定义枚举类,描述租约行为(注册,取消注册,续约)
        enum Action {
            Register, Cancel, Renew
        };
    
        // 默认租约持续时间 - 90秒
        public static final int DEFAULT_DURATION_IN_SECS = 90;
    
        // 租约持有的实例信息
        private T holder;
    
        // 剔除时间
        private long evictionTimestamp;
        
        // 实例注册时间
        private long registrationTimestamp;
        
        // 服务启动时间
        private long serviceUpTimestamp;
        
        // Make it volatile so that the expiration task would see this quicker
        // 上次心跳更新时间,采用volatile修饰,以便失效剔除任务可以立即看到该值,保证多线程下的可见性
        private volatile long lastUpdateTimestamp;
    
        // 租约持续时间毫秒数表示
        private long duration;
    
        public Lease(T r, int durationInSecs) {
            holder = r;
            registrationTimestamp = System.currentTimeMillis();
            lastUpdateTimestamp = registrationTimestamp;
            duration = (durationInSecs * 1000);
    
        }
        
      /**
     * Cancels the lease by updating the eviction time.
     * 取消租约被调用,则更新evictionTimestamp值为当前时间
     */
      public void cancel() {
        if (evictionTimestamp <= 0) {
            evictionTimestamp = System.currentTimeMillis();
        }
      }
        
        /**
        * 续约租约,即更新其lastUpdateTimestamp值为当前时间戳 + 租约持续时间毫秒数
        *
        * Renew the lease, use renewal duration if it was specified by the
        * associated {@link T} during registration, otherwise default duration is
        * {@link #DEFAULT_DURATION_IN_SECS}.
        */
        public void renew() {
            lastUpdateTimestamp = System.currentTimeMillis() + duration;
    
        }
    
        /**
        * 判断是否租约已过期
        * Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
        */
        public boolean isExpired() {
            return isExpired(0l);
        }
    
        /**
        * Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
        *
        * Note that due to renew() doing the 'wrong" thing and setting lastUpdateTimestamp to +duration more than
        * what it should be, the expiry will actually be 2 * duration. This is a minor bug and should only affect
        * instances that ungracefully shutdown. Due to possible wide ranging impact to existing usage, this will
        * not be fixed.
        *
        * 注意由于补偿时间的存在,判断是否过期时,需要把这个时间加上去
        *
        *
        * @param additionalLeaseMs any additional lease time to add to the lease evaluation in ms.
        */
        public boolean isExpired(long additionalLeaseMs) {
            return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
        }
    }    
    
    1. 当Client发送心跳续约时,会触发Lease的renew()方法,即更新lastUpdateTimestamp值为:当前时间戳 + 租约持续时间

    2. 判断租约是否过期逻辑:

    若失效剔除时间(evictionTimestamp值)大于0,即表示Lease的cancel()被触发,则表示已失效
    或当前时间戳大于上次更新时间 + 租约持续时间 + 补偿时间

    真正的过期失效剔除时间并不是默认的90s

    /**
     * Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
     *
     * Note that due to renew() doing the 'wrong" thing and setting lastUpdateTimestamp to +duration more than
     * what it should be, the expiry will actually be 2 * duration. This is a minor bug and should only affect
     * instances that ungracefully shutdown. Due to possible wide ranging impact to existing usage, this will
     * not be fixed.
     *
     * @param additionalLeaseMs any additional lease time to add to the lease evaluation in ms.
     */
    public boolean isExpired(long additionalLeaseMs) {
        return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
    }
    

    the expiry will actually be 2 * duration. This is a minor bug and should only affect instances that ungracefully shutdown.
    Due to possible wide ranging impact to existing usage, this will not be fixed

    真实的过期时间有效期限实际上是2倍的duration时间

    方法注释说明了这是一个小bug,而且只会影响不正常关闭的实例(没有在应用挺值钱主动发送下线cancel请求的client实例),由于可能对现有使用产生大范围的影响,官方表示这个小bug不会修复掉

    剔除动作

    • internalCancel(appName, id, false)
    /**
     * {@link #cancel(String, String, boolean)} method is overridden by {@link PeerAwareInstanceRegistry}, so each
     * cancel request is replicated to the peers. This is however not desired for expires which would be counted
     * in the remote peers as valid cancellations, so self preservation mode would not kick-in.
     */
    protected boolean internalCancel(String appName, String id, boolean isReplication) {
        try {
            // 1. 获取读锁
            read.lock();
    
            // 2. 增加取消实例数量
            CANCEL.increment(isReplication);
    
            // 3. 获取注册appName对应的子Map信息
            Map<String, Lease<InstanceInfo>> gMap = registry.get(appName);
            Lease<InstanceInfo> leaseToCancel = null;
            if (gMap != null) {
    
                // 3.1 从子Map中删除该实例对应的租约信息,并返回该租约信息
                leaseToCancel = gMap.remove(id);
            }
    
            // 4. 同步增加最近取消的实例到canceledQueue中
            synchronized (recentCanceledQueue) {
                recentCanceledQueue.add(new Pair<Long, String>(System.currentTimeMillis(), appName + "(" + id + ")"));
            }
    
            InstanceStatus instanceStatus = overriddenInstanceStatusMap.remove(id);
            if (instanceStatus != null) {
                logger.debug("Removed instance id {} from the overridden map which has value {}", id, instanceStatus.name());
            }
            if (leaseToCancel == null) {
                CANCEL_NOT_FOUND.increment(isReplication);
                logger.warn("DS: Registry: cancel failed because Lease is not registered for: {}/{}", appName, id);
                return false;
            } else {
    
                // 5. 执行租约信息的cancel方法,就是更新租约信息的evictionTimestamp值为当前时间戳
                leaseToCancel.cancel();
    
                // 6. 获取租约持有的实例信息
                InstanceInfo instanceInfo = leaseToCancel.getHolder();
                String vip = null;
                String svip = null;
                if (instanceInfo != null) {
                    instanceInfo.setActionType(ActionType.DELETED);
                    recentlyChangedQueue.add(new RecentlyChangedItem(leaseToCancel));
                    instanceInfo.setLastUpdatedTimestamp();
                    vip = instanceInfo.getVIPAddress();
                    svip = instanceInfo.getSecureVipAddress();
                }
                // 7. 失效该实例对应的Guava缓存
                invalidateCache(appName, vip, svip);
                logger.info("Cancelled instance {}/{} (replication={})", appName, id, isReplication);
                return true;
            }
        } finally {
            // 8. 释放读锁
            read.unlock();
        }
    }
    

    简述一下

    1. 获取注册表Map中该实例对应的子Map,并remove掉该实例

    2. 调用该删除租约信息的cancel方法,更新其evictionTimestamp值为当前时间戳,即记录实例何时被剔除的

    3. 失效该实例的所在的responseCache,其他客户端在抓取注册表信息时就会拉取不到失效的服务实例了


    写在最后

    文章最后给自己也给其他人提几个问题,如果能够回答上,代表你已经明白了Eureka Server在失效剔除这里的原理

    1:何为失效剔除?为什么需要有失效剔除任务?

    2:失效剔除任务默认多久触发一次,可以通过哪个参数自定义?

    3:何为自我保护机制?为什么需要有自我保护机制?什么情况下Server会触发自我保护?

    4:真实的实例失效剔除时间默认是90s吗?为什么?

    相关文章

      网友评论

          本文标题:Eureka续约之定期剔除

          本文链接:https://www.haomeiwen.com/subject/eegdnqtx.html