配置中心健康检查配置不合理导致的全局事故

作者: Shaman | 来源:发表于2019-03-29 20:44 被阅读0次

配置中心健康检查配置不合理导致的全局事故
k8s健康检查
Haproxy的TCP层健康检查
2018-04-26
maven中如何指定jdk的版本
【Jenkins】Jenkins集成IOS全自动打包专题
全局/单个仓库的用户和邮箱配置
交换机上配置DHCP（基于接口和全局配置）
jQuery的ajax方法
git常用命令

前段时间遇到一场事故，配置中心服务依赖的 git 数据源不可访问，K8s deployment 里配置的健康检查超时时间较短，（如果超时时间设置为10s，是不会触发这次故障的）导致配置中心服务健康检查挂掉，网关默认强依赖配置中心服务，所以网关健康检查接口也不通过，所以在负载均衡看来，网关也不可用，导致整体服务中断。

为了实现服务高可用，我们会做以下2 点优化：

去除网关对于配置中心的强依赖
去除配置中心对 git 服务的强依赖

disable config client health indicator

https://github.com/spring-cloud/spring-cloud-config/issues/435

The Config Client supplies a Spring Boot Health Indicator that attempts to load configuration from Config Server. The health indicator can be disabled by setting health.config.enabled=false. The response is also cached for performance reasons. The default cache time to live is 5 minutes. To change that value set the health.config.time-to-live property (in milliseconds).

management.health.hystrix.enabled: false

health.config.enabled: false

上面是spring boot 的配置方法 (https://docs.spring.io/spring-boot/docs/current/reference/html/common-application-properties.html)
下面是spring cloud

反思