Consul+Prometheus系统监控之服务发现
本文主要介绍Consul以及与Prometheus集成的统一配置管理
一. Consul
Consul是一个服务发现和注册的工具,其具有分布式、高扩展性能特点。
什么是服务发现,对比下方两张图:

图1.1中,客户端的一个接口,需要调用服务A-N。客户端必须要知道所有服务的网络位置的,服务多了,配置很麻烦,运维人员会很吃力,也难免会出错。

图1.2中,加了个服务发现模块。服务发现模块定时的轮询查看这些服务能不能访问(这就是健康检查)。客户端在调用服务A-N的时候,就跑去服务发现模块问下它们的网络位置,然后再调用它们的服务。客户端完全不需要记录这些服务网络位置,实现了客户端与服务端的解耦。
Consul的基本使用
安装
官方下载地址: https://www.consul.io/downloads.html
我的机器是windows的系统,把服务装在了本地,把压缩包下载下来后,解压在任意位置,只有一个可执行文件,这时候把这个地址配置在环境变量的path中。这时候打开CMD输入consul,看到这样的帮助信息,就是安装并配置成功了。
C:\Users\Felix>consul
Usage: consul [--version] [--help] <command> [<args>]
Available commands are:
agent Runs a Consul agent
catalog Interact with the catalog
event Fire a new event
exec Executes a command on Consul nodes
force-leave Forces a member of the cluster to enter the "left" state
info Provides debugging information for operators.
join Tell Consul agent to join cluster
keygen Generates a new encryption key
keyring Manages gossip layer encryption keys
kv Interact with the key-value store
leave Gracefully leaves the Consul cluster and shuts down
lock Execute a command holding a lock
maint Controls node or service maintenance mode
members Lists the members of a Consul cluster
monitor Stream logs from a Consul agent
operator Provides cluster-level tools for Consul operators
reload Triggers the agent to reload configuration files
rtt Estimates network round trip time between nodes
snapshot Saves, restores and inspects snapshots of Consul server state
validate Validate config files/directories
version Prints the Consul version
watch Watch for changes in Consul
运行Agent
完成Consul的安装后,必须运行agent. agent可以运行为server或client模式.每个数据中心至少必须拥有一台server . 建议在一个集群中有3或者5个server.部署单一的server,在出现失败时会不可避免的造成数据丢失.
其他的agent运行为client模式.一个client是一个非常轻量级的进程.用于注册服务,运行健康检查和转发对server的查询.agent必须在集群中的每个主机上运行.
启动Agent
为了更简单,现在我们将启动Consul agent的开发模式.这个模式快速和简单的启动一个单节点的Consul.这个模式不能用于生产环境,因为他不持久化任何状态.
C:\Users\Felix>consul agent -dev
==> Starting Consul agent...
==> Consul agent running!
Version: 'v1.0.6'
Node ID: '5af803ed-e3dd-2d36-ec64-c7c515445e5b'
Node name: 'felix'
Datacenter: 'dc1' (Segment: '<all>')
Server: true (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, DNS: 8600)
Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
==> Log data will now stream in as it occurs:
2018/04/03 15:34:31 [DEBUG] Using random ID "5af803ed-e3dd-2d36-ec64-c7c515445e5b" as node ID
2018/04/03 15:34:31 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:5af803ed-e3dd-2d36-ec64-c7c515445e5b Address:127.0.0.1:8300}]
2018/04/03 15:34:31 [INFO] raft: Node at 127.0.0.1:8300 [Follower] entering Follower state (Leader: "")
2018/04/03 15:34:31 [INFO] serf: EventMemberJoin: felix.dc1 127.0.0.1
2018/04/03 15:34:31 [INFO] serf: EventMemberJoin: felix 127.0.0.1
2018/04/03 15:34:31 [INFO] consul: Adding LAN server felix (Addr: tcp/127.0.0.1:8300) (DC: dc1)
2018/04/03 15:34:31 [INFO] consul: Handled member-join event for server "felix.dc1" in area "wan"
2018/04/03 15:34:31 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2018/04/03 15:34:31 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2018/04/03 15:34:31 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
2018/04/03 15:34:31 [INFO] agent: started state syncer
2018/04/03 15:34:31 [WARN] raft: Heartbeat timeout from "" reached, starting election
2018/04/03 15:34:31 [INFO] raft: Node at 127.0.0.1:8300 [Candidate] entering Candidate state in term 2
2018/04/03 15:34:31 [DEBUG] raft: Votes needed: 1
2018/04/03 15:34:31 [DEBUG] raft: Vote granted from 5af803ed-e3dd-2d36-ec64-c7c515445e5b in term 2. Tally: 1
2018/04/03 15:34:31 [INFO] raft: Election won. Tally: 1
2018/04/03 15:34:31 [INFO] raft: Node at 127.0.0.1:8300 [Leader] entering Leader state
2018/04/03 15:34:31 [INFO] consul: cluster leadership acquired
2018/04/03 15:34:31 [INFO] consul: New leader elected: felix
2018/04/03 15:34:31 [DEBUG] consul: Skipping self join check for "felix" since the cluster is too small
2018/04/03 15:34:31 [INFO] consul: member 'felix' joined, marking health alive
2018/04/03 15:34:31 [DEBUG] Skipping remote check "serfHealth" since it is managed automatically
2018/04/03 15:34:31 [INFO] agent: Synced node info
2018/04/03 15:34:31 [DEBUG] agent: Node info in sync
正如你所看到的,Consul Agent 启动并输出了一些日志数据.从这些日志中你可以看到,我们的agent运行在server模式并且声明作为一个集群的领袖.此外,本地成员已被标记为该群集的健康成员。
停止Agent
你可以使用Ctrl-C 优雅的关闭Agent. 中断Agent之后你可以看到他离开了集群并关闭.
在退出中,Consul提醒其他集群成员,这个节点离开了.如果你强行杀掉进程.集群的其他成员应该能检测到这个节点失效了.当一个成员离开,他的服务和检测也会从目录中移除.当一个成员失效了,他的健康状况被简单的标记为危险,但是不会从目录中移除.Consul会自动尝试对失效的节点进行重连.允许他从某些网络条件下恢复过来.离开的节点则不会再继续联系.
此外,如果一个agent作为一个服务器,一个优雅的离开是很重要的,可以避免引起潜在的可用性故障影响达成一致性协议.
定义一个服务
我们可以创建一个目录,然后放入一个配置文件,我配置的内容如下:
{
"service": {
"name": "shiji",
"tags": ["master"],
"address": "127.0.0.1",
"port": 8080,
"enableTagOverride": false,
"check": {
"id": "shiji",
"name": "shiji on port 8080",
"tcp": "localhost:8080",
"interval": "10s",
"timeout": "1s"
}
}
}
这之后我们启动服务:consul agent -dev -config-dir=F:\Dwork\tools\consul\consul.d
C:\Users\Felix>consul agent -dev -config-dir=F:\Dwork\tools\consul\consul.d
==> Starting Consul agent...
==> Consul agent running!
Version: 'v1.0.6'
Node ID: '64c6149a-389a-2f97-fe81-b80813c1bc86'
Node name: 'felix'
Datacenter: 'dc1' (Segment: '<all>')
Server: true (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, DNS: 8600)
Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
==> Log data will now stream in as it occurs:
2018/04/03 15:56:24 [DEBUG] Using random ID "64c6149a-389a-2f97-fe81-b80813c1bc86" as node ID
2018/04/03 15:56:24 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:64c6149a-389a-2f97-fe81-b80813c1bc86 Address:127.0.0.1:8300}]
2018/04/03 15:56:24 [INFO] raft: Node at 127.0.0.1:8300 [Follower] entering Follower state (Leader: "")
2018/04/03 15:56:24 [INFO] serf: EventMemberJoin: felix.dc1 127.0.0.1
2018/04/03 15:56:24 [INFO] serf: EventMemberJoin: felix 127.0.0.1
2018/04/03 15:56:24 [INFO] consul: Adding LAN server felix (Addr: tcp/127.0.0.1:8300) (DC: dc1)
2018/04/03 15:56:24 [INFO] consul: Handled member-join event for server "felix.dc1" in area "wan"
2018/04/03 15:56:24 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2018/04/03 15:56:24 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2018/04/03 15:56:24 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
2018/04/03 15:56:24 [INFO] agent: started state syncer
2018/04/03 15:56:24 [WARN] raft: Heartbeat timeout from "" reached, starting election
2018/04/03 15:56:24 [INFO] raft: Node at 127.0.0.1:8300 [Candidate] entering Candidate state in term 2
2018/04/03 15:56:24 [DEBUG] raft: Votes needed: 1
2018/04/03 15:56:24 [DEBUG] raft: Vote granted from 64c6149a-389a-2f97-fe81-b80813c1bc86 in term 2. Tally: 1
2018/04/03 15:56:24 [INFO] raft: Election won. Tally: 1
2018/04/03 15:56:24 [INFO] raft: Node at 127.0.0.1:8300 [Leader] entering Leader state
2018/04/03 15:56:24 [INFO] consul: cluster leadership acquired
2018/04/03 15:56:24 [INFO] consul: New leader elected: felix
2018/04/03 15:56:24 [DEBUG] consul: Skipping self join check for "felix" since the cluster is too small
2018/04/03 15:56:24 [INFO] consul: member 'felix' joined, marking health alive
2018/04/03 15:56:25 [DEBUG] Skipping remote check "serfHealth" since it is managed automatically
2018/04/03 15:56:25 [INFO] agent: Synced service "shiji"
2018/04/03 15:56:25 [DEBUG] agent: Check "shiji" in sync
2018/04/03 15:56:25 [DEBUG] agent: Node info in sync
2018/04/03 15:56:25 [DEBUG] agent: Check "shiji" is passing
2018/04/03 15:56:26 [DEBUG] Skipping remote check "serfHealth" since it is managed automatically
2018/04/03 15:56:26 [DEBUG] agent: Service "shiji" in sync
2018/04/03 15:56:26 [INFO] agent: Synced check "shiji"
2018/04/03 15:56:26 [DEBUG] agent: Node info in sync
2018/04/03 15:56:26 [DEBUG] agent: Service "shiji" in sync
2018/04/03 15:56:26 [DEBUG] agent: Check "shiji" in sync
2018/04/03 15:56:26 [DEBUG] agent: Node info in sync
2018/04/03 15:56:35 [DEBUG] agent: Check "shiji" is passing
2018/04/03 15:56:46 [WARN] agent: Check "shiji" socket connection failed: dial tcp [::1]:8080: i/o timeout
2018/04/03 15:56:46 [DEBUG] agent: Service "shiji" in sync
2018/04/03 15:56:46 [INFO] agent: Synced check "shiji"
2018/04/03 15:56:46 [DEBUG] agent: Node info in sync
我们可以看出来,consul按照配置,每隔10s做一次健康检查,第二次检查之前我停了8080的服务,所以就返回timeout了。
同样,我们在浏览器中输入localhost:8500,这是consul默认占用的端口,不需要登陆,可以看到UI展示,在其中我们也可以做一些配置。

关于服务注册与删除,除了配置文件,我们可以通过修改配置文件,之后执行consul reload,或者通过DNS,HTTP API,客户端操作等来完成。当然关于集群,更多的健康检查,KV存储还需要更深入的学习。这里附一张consul官方架构图:

二. Consul和Prometheus
首先附上一张来自网上的图:

GPE是作者的说法,意在Grafana + Prometheus + Exporter,这里主要介绍一下Prometheus和Consul的集成。
Prometheus可以做系统监控、报警等事情,但如果我们的服务特别多,我们需要维护一个很大的配置列表,根据实际业务或者是其他突发状况,我们可能就需要总是远程到Prometheus的机器上去修改配置,重启服务,这样就会很麻烦,很耗时。这里我们可以使用Consul做统一的配置管理,由Consul去做服务发现,健康检查,动态增减服务。
因为新部署一个node.js的服务比较容易,所以选择用node.js做测试,首先部署一个服务,安装Prometheus的客户端,可以返回Prometheus需要的格式。
Consul配置文件:
{
"service": {
"name": "shiji",
"tags": ["master"],
"address": "127.0.0.1",
"port": 8080,
"enableTagOverride": false,
"check": {
"id": "shiji",
"name": "shiji on port 8080",
"tcp": "localhost:8080",
"interval": "10s",
"timeout": "1s"
}
}
}
Prometheus exporter配置部分:
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'consul-prometheus'
consul_sd_configs:
#consul 地址
- server: '127.0.0.1:8500'
services: []
如配置文件,指定了Consul的config,会自动去找这个服务。如图:

Consul配置的服务就会在Prometheus中的Target中查看得到,第一个是我通过HTTP随意加入的不存在的服务,第二个是之前提到的node的服务,第三个是consul自己占用的端口,此端口用于服务器节点。客户端通过该端口 RPC 协议调用服务端节点。服务器节点之间相互调用。建议给consul的client也预留这个端口,以免会有把client转成server的时候。
参考链接:
https://www.consul.io/intro/
https://consul.docs.apiary.io/
https://www.jianshu.com/p/f8746b81d65d
https://blog.52itstyle.com/archives/2071/
想写一些东西分享,欢迎转载,请注明出处。
简书-板凳儿儿
https://www.jianshu.com/p/242c25332374
网友评论