【实践】2.Prometheus命令和配置详解

作者: 笔名辉哥 | 来源:发表于2021-03-27 15:20 被阅读0次

1.摘要

Prometheus配置方式有两种：
（1）命令行，用来配置不可变命令参数，主要是Prometheus运行参数，比如数据存储位置
（2）配置文件，用来配置Prometheus应用参数，比如数据采集，报警对接

不重启进程配置生效方式也有两种：
（1）对进程发送信号SIGHUP
（2）HTTP POST请求，需要开启--web.enable-lifecycle选项curl -X POST http://192.168.66.112:9091/-/reload

配置文件格式是yaml格式，说明：
.yml或者.yaml 都是 yaml格式的文件，
yaml格式的好处: 和json交互比较容易
python/go/java/php 有yaml格式库，方便语言之间解析,并且这种格式存储的信息量很大。

2. 命令行

命令行可用配置可通过prometheus -h来查看。

-h, --help                     Show context-sensitive help (also try --help-long and --help-man).
    --version                  Show application version.
    --config.file="prometheus.yml"
                               Prometheus configuration file path.
    --web.listen-address="0.0.0.0:9090"
                               Address to listen on for UI, API, and telemetry.
    --web.read-timeout=5m      Maximum duration before timing out read of the request, and closing idle
                               connections.
    --web.max-connections=512  Maximum number of simultaneous connections.
    --web.route-prefix=<path>  Prefix for the internal routes of web endpoints. Defaults to path of
                               --web.external-url.
    --web.user-assets=<path>   Path to static asset directory, available at /user.
    --web.enable-lifecycle     Enable shutdown and reload via HTTP request.

3. 配置文件

配置文件使用yml格式，配置文件中一级配置项如下，说明参考#备注内容。

＃全局配置 （如果有内部单独设定，会覆盖这个参数）
global:

＃告警插件定义。这里会设定alertmanager这个报警插件。
alerting:

＃告警规则。 按照设定参数进行扫描加载，用于自定义报警规则，其报警媒介和route路由由alertmanager插件实现。
rule_files:

＃采集配置。配置数据源，包含分组job_name以及具体target。又分为静态配置和服务发现
scrape_configs:

＃用于远程存储写配置
remote_write:

＃用于远程读配置
remote_read:

配置文件中通用字段值格式
<boolean>: 布尔类型值为true和false
<scheme>: 协议方式包含http和https

原始配置文件内容：

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
 
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093
 
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
 
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.
  - job_name: 'prometheus'
 
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
 
    static_configs:
    - targets: ['localhost:9090']

3.1 global字段

scrape_interval

全局默认的数据拉取间隔

[ scrape_interval: <duration> | default = 1m ]

scrape_timeout

全局默认的单次数据拉取超时，当报context deadline exceeded错误时需要在特定的job下配置该字段。

[ scrape_timeout: <duration> | default = 10s ]

evaluation_interval

全局默认的规则(主要是报警规则)拉取间隔

[ evaluation_interval: <duration> | default = 1m ]

external_labels

该服务端在与其他系统对接所携带的标签

[ <labelname>: <labelvalue> ... ]

3.2 alerting 字段

该字段配置与Alertmanager进行对接的配置
样例：

alerting:
  alert_relabel_configs: # 动态修改 alert 属性的规则配置。
    - source_labels: [dc] 
      regex: (.+)\d+
      target_label: dc1
  alertmanagers:
    - static_configs:
        - targets: ['127.0.0.1:9093'] # 单实例配置
        #- targets: ['172.31.10.167:19093','172.31.10.167:29093','172.31.10.167:39093'] # 集群配置
    - job_name: 'Alertmanager'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    - static_configs:
      - targets: ['localhost:19093']

上面的配置中的 alert_relabel_configs是指警报重新标记在发送到Alertmanager之前应用于警报。它具有与目标重新标记相同的配置格式和操作，外部标签标记后应用警报重新标记，主要是针对集群配置。

这个设置的用途是确保具有不同外部label的HA对Prometheus服务端发送相同的警报信息。

Alertmanager 可以通过 static_configs 参数静态配置，也可以使用其中一种支持的服务发现机制动态发现，我们上面的配置是静态的单实例。

此外，relabel_configs 允许从发现的实体中选择 Alertmanager，并对使用的API路径提供高级修改，该路径通过 __alerts_path__ 标签公开。

完成以上配置后，重启Prometheus服务，用以加载生效，也可以使用热加载功能，使其配置生效。然后通过浏览器，访问 http://192.168.1.220:19090/alerts 就可以看 inactive pending firing 三个状态，没有警报信息是因为我们还没有配置警报规则 rules。

这里定义和prometheus集成的alertmanager插件，用于监控报警。后续会单独进行alertmanger插件的配置、配置说明、报警媒介以及route路由规则记录。

3.2.1 alert_relabel_configs

此项配置和scrape_configs字段中relabel_configs配置一样，用于对需要报警的数据进行过滤后发向Alertmanager

说明
relabel-configs的配置允许你选择你想抓取的目标和这些目标的标签是什么。所以说如果你想要抓取这种类型的服务器而不是那种，可以使用relabel_configs

相比之下，metric_relabel_configs是发生在抓取之后，但在数据被插入存储系统之前使用。因此如果有些你想过滤的指标，或者来自抓取本身的指标（比如来自/metrics页面）你就可以使用metric_relabel_configs来处理。

3.2.2 alertmanagers

该项目主要用来配置不同的alertmanagers服务，以及Prometheus服务和他们的链接参数。alertmanagers服务可以静态配置也可以使用服务发现配置。Prometheus以pushing 的方式向alertmanager传递数据。

alertmanager 服务配置和target配置一样，可用字段如下

[ timeout: <duration> | default = 10s ]
[ path_prefix: <path> | default = / ]
[ scheme: <scheme> | default = http ]
basic_auth:
  [ username: <string> ]
  [ password: <string> ]
  [ password_file: <string> ]
[ bearer_token: <string> ]
[ bearer_token_file: /path/to/bearer/token/file ]
tls_config:
  [ <tls_config> ]
[ proxy_url: <string> ]
azure_sd_configs:
  [ - <azure_sd_config> ... ]
consul_sd_configs:
  [ - <consul_sd_config> ... ]
dns_sd_configs:
  [ - <dns_sd_config> ... ]
ec2_sd_configs:
  [ - <ec2_sd_config> ... ]
file_sd_configs:
  [ - <file_sd_config> ... ]
gce_sd_configs:
  [ - <gce_sd_config> ... ]
kubernetes_sd_configs:
  [ - <kubernetes_sd_config> ... ]
marathon_sd_configs:
  [ - <marathon_sd_config> ... ]
nerve_sd_configs:
  [ - <nerve_sd_config> ... ]
serverset_sd_configs:
  [ - <serverset_sd_config> ... ]
triton_sd_configs:
  [ - <triton_sd_config> ... ]
static_configs:
  [ - <static_config> ... ]
relabel_configs:
  [ - <relabel_config> ... ]

3.3 rule_files

这个主要是用来设置告警规则，基于设定什么指标进行报警（类似触发器trigger）。这里设定好规则以后，prometheus会根据全局global设定的evaluation_interval参数进行扫描加载，规则改动后会自动加载。其报警媒介和route路由由alertmanager插件实现。
样例：

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

"first_rules.yml"样例：

groups:
 - name: test-rules
   rules:
   - alert: InstanceDown # 告警名称
     expr: up == 0 # 告警的判定条件，参考Prometheus高级查询来设定
     for: 10s # 满足告警条件持续时间多久后，才会发送告警
     labels: #标签项
      severity: error
     annotations: # 解析项，详细解释告警信息
      summary: "{{$labels.instance}}: has been down"
      description: "{{$labels.instance}}: job {{$labels.job}} has been down "

Prometheus 支持两种类型的 Rules ，可以对其进行配置，然后定期进行运算：recording rules 记录规则与 alerting rules 警报规则，规则文件的计算频率与警报规则计算频率一致，都是通过全局配置中的 evaluation_interval 定义。

规则分组rule_group

不论是recording rules还是alerting rules都要在组里面。

groups:
  
  - name: example
    #该组下的规则
    rules:
      [ - <rule> ... ]

alerting rules

要在Prometheus中使用Rules规则，就必须创建一个包含必要规则语句的文件，并让Prometheus通过Prometheus配置中的rule_files字段加载该文件，前面我们已经讲过了。其实语法都一样，除了 recording rules 中的收集的指标名称 record: <string> 字段配置方式略有不同，其他都是一样的。

配置范例：

- alert: ServiceDown
  expr: avg_over_time(up[5m]) * 100 < 50
  annotations:
      description: The service {{ $labels.job }} instance {{ $labels.instance }} is
        not responding for more than 50% of the time for 5 minutes.
      summary: The service {{ $labels.job }} is not responding
- alert: RedisDown
  expr: avg_over_time(redis_up[5m]) * 100 < 50
  annotations:
      description: The Redis service {{ $labels.job }} instance {{ $labels.instance
        }} is not responding for more than 50% of the time for 5 minutes.
      summary: The Redis service {{ $labels.job }} is not responding
- alert: PostgresDown
  expr: avg_over_time(pg_up[5m]) * 100 < 50
  annotations:
      description: The Postgres service {{ $labels.job }} instance {{ $labels.instance
        }} is not responding for more than 50% of the time for 5 minutes.
      summary: The Postgres service {{ $labels.job }} is not responding

定义Recording rules

recording rules 是提前设置好一个比较花费大量时间运算或经常运算的表达式，其结果保存成一组新的时间序列数据。当需要查询的时候直接会返回已经计算好的结果，这样会比直接查询快，同时也减轻了PromQl的计算压力，同时对可视化查询的时候也很有用，可视化展示每次只需要刷新重复查询相同的表达式即可。

在配置的时候，除却 record: <string> 需要注意，其他的基本上是一样的，一个 groups 下可以包含多条规则 rules ，Recording 和 Rules 保存在 group 内，Group 中的规则以规则的配置时间间隔顺序运算，也就是全局中的 evaluation_interval 设置。

配置范例：

groups:
- name: http_requests_total
  rules:
  - record: job:http_requests_total:rate10m
    expr: sum by (job)(rate(http_requests_total[10m]))
    lables:
      team: operations
  - record: job:http_requests_total:rate30m
    expr: sum by (job)(rate(http_requests_total[30m]))
    lables:
      team: operations

上面的规则其实就是根据 record 规则中的定义，Prometheus 会在后台完成 expr 中定义的 PromQL 表达式周期性运算，以 job 为维度使用 sum 聚合运算符计算函数rate 对http_requests_total 指标区间 10m 内的增长率，并且将计算结果保存到新的时间序列 job:http_requests_total:rate10m 中，同时还可以通过 labels 为样本数据添加额外的自定义标签，但是要注意的是这个 lables 一定存在当前表达式 Metrics 中。

使用模板

模板是在警报中使用时间序列标签和值展示的一种方法，可以用于警报规则中的注释（annotation）与标签（lable）。模板其实使用的go语言的标准模板语法，并公开一些包含时间序列标签和值的变量。这样查询的时候，更具有可读性，也可以执行其他PromQL查询来向警报添加额外内容，ALertmanager Web UI中会根据标签值显示器警报信息。

{{ $lable.<lablename>}} 可以获取当前警报实例中的指定标签值

{{ $value }} 变量可以获取当前PromQL表达式的计算样本值。

groups:
- name: operations
  rules:
# monitor node memory usage
  - alert: node-memory-usage
    expr: (1 - (node_memory_MemAvailable_bytes{env="operations",job!='atlassian'} / (node_memory_MemTotal_bytes{env="operations"})))* 100 > 90
    for: 1m
    labels:
      status: Warning
      team: operations
    annotations:
      description: "Environment: {{ $labels.env }} Instance: {{ $labels.instance }} memory usage above {{ $value }} ! ! !"
      summary:  "node os memory usage status"

调整好rules以后，我们可以使用 curl -XPOST http://localhost:9090/-/reload 或者对Prometheus服务重启，让警报规则生效。

这个时候，我们可以把阈值调整为 50 来进行故障模拟操作，这时在去访问UI的时候，当持续1分钟满足警报条件，实际警报状态已转换为 Firing，可以在 Annotations中看到模板信息 summary 与 description 已经成功显示。

规则检查

#打镜像后使用
FROM golang:1.10

RUN GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go get -u github.com/prometheus/prometheus/cmd/promtool

FROM alpine:latest  

COPY --from=0 /go/bin/promtool /bin
ENTRYPOINT ["/bin/promtool"]  

# 编译
docker build -t promtool:0.1 .
#使用
docker run --rm -v /root/test/prom:/opt promtool:0.1 check rules /opt/rule.yml
#返回
Checking /opt/rule.yml
  SUCCESS: 1 rules found

3.4 scrape_configs字段

拉取数据配置，在配置字段内可以配置拉取数据的对象(Targets)，job以及实例

job_name

定义job名称，是一个拉取单元。每个job_name都会自动引入默认配置如

scrape_interval 依赖全局配置
scrape_timeout 依赖全局配置
metrics_path 默认为’/metrics’
scheme 默认为’http’

这些也可以在单独的job中自定义

[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]
[ metrics_path: <path> | default = /metrics ]

honor_labels

服务端拉取过来的数据也会存在标签，配置文件中也会有标签，这样就可能发生冲突。

true就是以抓取数据中的标签为准
false就会重新命名抓取数据中的标签为“exported”形式，然后添加配置文件中的标签

[ honor_labels: <boolean> | default = false ]

scheme

切换抓取数据所用的协议

[ scheme: <scheme> | default = http ]

params

定义可选的url参数

[ <string>: [<string>, ...] ]

抓取认证类

每次抓取数据请求的认证信息

basic_auth

password和password_file互斥只可以选择其一

basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]

bearer_token

bearer_token和bearer_token_file互斥只可以选择其一

[ bearer_token: <secret> ]
[ bearer_token_file: /path/to/bearer/token/file ]

tls_config

抓取ssl请求时证书配置

tls_config:
  [ ca_file: <filename> ]
  [ cert_file: <filename> ]
  [ key_file: <filename> ]
  [ server_name: <string> ]
  #禁用证书验证
  [ insecure_skip_verify: <boolean> ]

proxy_url

通过代理去主去数据

[ proxy_url: <string> ]

服务发现类

Prometheus支持多种服务现工具，详细配置这里不再展开

#sd就是service discovery的缩写
azure_sd_configs: 
consul_sd_configs:
dns_sd_configs:
ec2_sd_configs:
openstack_sd_configs:
file_sd_configs:
gce_sd_configs:
kubernetes_sd_configs:
marathon_sd_configs:
nerve_sd_configs:
serverset_sd_configs:
triton_sd_configs:

更多参考官网：https://prometheus.io/docs/prometheus/latest/configuratio n/configuration/

static_configs

服务发现来获取抓取目标为动态配置，这个配置项目为静态配置，静态配置为典型的targets配置，在改配置字段可以直接添加标签

- targets:
    [ - '<host>' ]
  labels:
    [ <labelname>: <labelvalue> ... ]

采集器所采集的数据都会带有label，当使用服务发现时，比如consul所携带的label如下:

__meta_consul_address: consul地址
__meta_consul_dc: consul中服务所在的数据中心
__meta_consul_metadata_: 服务的metadata
__meta_consul_node: 服务所在consul节点的信息
__meta_consul_service_address: 服务访问地址
__meta_consul_service_id: 服务ID
__meta_consul_service_port: 服务端口
__meta_consul_service: 服务名称
__meta_consul_tags: 服务包含的标签信息

这些lable是数据筛选与聚合计算的基础。

数据过滤类

抓取数据很繁杂，尤其是通过服务发现添加的target。所以过滤就显得尤为重要，我们知道抓取数据就是抓取target的一些列metrics，Prometheus过滤是通过对标签操作操现的，在字段relabel_configs和metric_relabel_configs里面配置，两者的配置都需要relabel_config字段。该字段需要配置项如下

[ source_labels: '[' <labelname> [, ...] ']' ]

[ separator: <string> | default = ; ]

[ target_label: <labelname> ]

[ regex: <regex> | default = (.*) ]

[ modulus: <uint64> ]

[ replacement: <string> | default = $1 ]

#action除了默认动作还有keep、drop、hashmod、labelmap、labeldrop、labelkeep
[ action: <relabel_action> | default = replace ]

target配置示例

relabel_configs:
  - source_labels: [job]
    regex:         (.*)some-[regex]
    action:        drop
  - source_labels: [__address__]
    modulus:       8
    target_label:  __tmp_hash
    action:        hashmod

target中metric示例

- job_name: cadvisor
  ...
  metric_relabel_configs:
  - source_labels: [id]
    regex: '/system.slice/var-lib-docker-containers.*-shm.mount'
    action: drop
  - source_labels: [container_label_JenkinsId]
    regex: '.+'
    action: drop

target中metric示例

- job_name: cadvisor
  ...
  metric_relabel_configs:
  - source_labels: [id]
    regex: '/system.slice/var-lib-docker-containers.*-shm.mount'
    action: drop
  - source_labels: [container_label_JenkinsId]
    regex: '.+'
    action: drop

使用示例
由以上可知当使用服务发现consul会带入标签__meta_consul_dc，现在为了表示方便需要将该标签变为dc

需要做如下配置，这里面action使用的replacement

scrape_configs:
  - job_name: consul_sd
    relabel_configs:
    - source_labels:  ["__meta_consul_dc"]
      regex: "(.*)"
      replacement: $1
      action: replace
      target_label: "dc"

#或者
- source_labels:  ["__meta_consul_dc"]
  target_label: "dc"

过滤采集target

relabel_configs:
- source_labels: ["__meta_consul_tags"]
  regex: ".*,development,.*"
  action: keep

sample_limit

为了防止Prometheus服务过载，使用该字段限制经过relabel之后的数据采集数量，超过该数字拉取的数据就会被忽略

[ sample_limit: <int> | default = 0 ]

3.8 远程读写

Prometheus可以进行远程读/写数据。字段remote_read和remote_write

remote_read

#远程读取的url
url: <string>

#通过标签来过滤读取的数据
required_matchers:
  [ <labelname>: <labelvalue> ... ]

[ remote_timeout: <duration> | default = 1m ]

#当远端不是存储的时候激活该项
[ read_recent: <boolean> | default = false ]

basic_auth:
  [ username: <string> ]
  [ password: <string> ]
  [ password_file: <string> ]
[ bearer_token: <string> ]
[ bearer_token_file: /path/to/bearer/token/file ]
tls_config:
  [ <tls_config> ]
[ proxy_url: <string> ]

remote_write

url: <string>

[ remote_timeout: <duration> | default = 30s ]

#写入数据时候进行标签过滤
write_relabel_configs:
  [ - <relabel_config> ... ]

basic_auth:
  [ username: <string> ]
  [ password: <string> ]
  [ password_file: <string> ]

[ bearer_token: <string> ]

[ bearer_token_file: /path/to/bearer/token/file ]

tls_config:
  [ <tls_config> ]

[ proxy_url: <string> ]

#远端写细粒度配置，这里暂时仅仅列出官方注释
queue_config:
  # Number of samples to buffer per shard before we start dropping them.
  [ capacity: <int> | default = 10000 ]
  # Maximum number of shards, i.e. amount of concurrency.
  [ max_shards: <int> | default = 1000 ]
  # Maximum number of samples per send.
  [ max_samples_per_send: <int> | default = 100]
  # Maximum time a sample will wait in buffer.
  [ batch_send_deadline: <duration> | default = 5s ]
  # Maximum number of times to retry a batch on recoverable errors.
  [ max_retries: <int> | default = 3 ]
  # Initial retry delay. Gets doubled for every retry.
  [ min_backoff: <duration> | default = 30ms ]
  # Maximum retry delay.
  [ max_backoff: <duration> | default = 100ms ]

3. 参考

（1）Prometheus 配置详解
https://www.dazhuanlan.com/2019/12/12/5df11ada207ce/
（2）Prometheus配置文件prometheus.yml 四个模块详解
http://www.21yunwei.com/archives/7321
（3）官方文档说明
https://prometheus.io/docs/prometheus/latest/configuration/configuration/
（4）Prometheus监控神器-Rules篇
https://zhuanlan.zhihu.com/p/179295676
（5）Prometheus监控神器-Alertmanager篇(1)
https://zhuanlan.zhihu.com/p/179292686
（6）Prometheus监控神器-Alertmanager篇(2)
https://zhuanlan.zhihu.com/p/179294441