AlertmanagerConfig:
https://github.com/prometheus-operator/prometheus-operator/blob/master/example/user-guides/alerting/alertmanager-config-example.yaml
https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#alertmanagerconfig
(一) PrometheusRule
可以通过如下命令查看默认配置的告警策略:
# kubectl get prometheusrule -n monitoring
NAME AGE
alertmanager-main-rules 19d
kube-prometheus-rules 19d
kube-state-metrics-rules 19d
kubernetes-monitoring-rules 19d
node-exporter-rules 19d
prometheus-k8s-prometheus-rules 19d
prometheus-operator-rules 19d
也可以通过-oyaml查看某个rules的详细配置:
# kubectl get prometheusrule -n monitoring node-exporter-rules -oyaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
...
spec:
groups:
- name: node-exporter
rules:
- alert: NodeFilesystemSpaceFillingUp
annotations:
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left and is filling
up.
runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemspacefillingup
summary: Filesystem is predicted to run out of space within the next 24 hours.
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40
and
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
Ø alert:告警策略的名称
Ø annotations:告警注释信息,一般写为告警信息
Ø expr:告警表达式
Ø for:评估等待时间,告警持续多久才会发送告警数据
Ø labels:告警的标签,用于告警的路由
(二) 域名访问延迟告警
假设需要对域名访问延迟进行监控,访问延迟大于1秒进行告警,此时可以创建一个PrometheusRule如下:
# **cat blackbox.yaml**
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
**metadata:**
**labels:**
app.kubernetes.io/component: exporter
app.kubernetes.io/name: blackbox-exporter
**prometheus: k8s**
**role: alert-rules**
name: blackbox
namespace: monitoring
spec:
**groups:**
**- name: blackbox-exporter**
**rules:**
- alert: DomainAccessDelayExceeds1s
annotations:
**description**: 域名:{{ $labels.instance }} 探测延迟大于1秒,当前延迟为:{{ $value }}
**summary**: 域名探测,访问延迟超过1秒
expr: sum(probe_http_duration_seconds{job=~"blackbox"}) by (instance) > 1
for: 1m
**labels**:
severity: warning
type: blackbox
创建并查看该PrometheusRule:
# kubectl create -f blackbox.yaml
prometheusrule.monitoring.coreos.com/blackbox created
# kubectl get -f blackbox.yaml
NAME AGE
blackbox 65s
之后也可以在Prometheus的Web UI看到此规则:
![](https://img.haomeiwen.com/i20896689/cb3f8eb15eba5a03.png)
如果探测延迟有超过1s的域名,就会触发告警,如图所示:
![](https://img.haomeiwen.com/i20896689/447de18e86c6d166.png)
由于告警路由并未匹配黑盒监控的标签,所以会发送给默认的收件人,也就是邮箱:
![](https://img.haomeiwen.com/i20896689/1d3e721aee933f25.png)
接下来可以根据实际业务情况将告警发送给指定的人,此时可以更改路由,将域名探测发送至微信,配置如下(部分代码):
- match:
type: blackbox
receiver: "wechat-ops"
repeat_interval: 10m
之后在微信端即可收到告警:
![](https://img.haomeiwen.com/i20896689/63ef84bd9f466a5e.png)
网友评论