一、配置告警规则
1、配置rule告警规则存放路径
$ vim prometheus-configmap.yaml
增加如下配置:
rule_files:
- /etc/config/rules/*.rules
如下图:
image.png
2、再次更新prometheus-configmap.yaml ,使其生效。
$ kubectl apply -f prometheus-configmap.yaml
configmap/prometheus-config configured
3、编写告警rules
这里我们直接编辑几个常规告警rules用于测试(prometheus-rules.yaml)
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: kube-system
data:
general.rules: |
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }}: job {{ $labels.job }} 已经停止5分钟以上."
node.rules: |
groups:
- name: node.rules
rules:
- alert: NodeFilesystemUsage
expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: {{$labels.mountpoint }} 分区使用过高"
description: "{{$labels.instance}}: {{$labels.mountpoint }} 分区使用大于 1% (当前值: {{ $value }})"
- alert: NodeMemoryUsage
expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: 内存使用过高"
description: "{{$labels.instance}}: 内存使用大于 80% (当前值: {{ $value }})"
- alert: NodeCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: CPU使用过高"
description: "{{$labels.instance}}: CPU使用大于 80% (当前值: {{ $value }})"
4、应用 prometheus-rules.yaml
$ kubectl apply -f prometheus-rules.yaml
configmap/prometheus-rules created
5、将configmap挂载到容器rules目录,修改prometheus-statefulset.yaml,增加下图中红框内容。
$ vim prometheus-statefulset.yaml
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: prometheus-data
mountPath: /data
subPath: ""
- name: prometheus-rules
mountPath: /etc/config/rules
terminationGracePeriodSeconds: 300
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: prometheus-rules
configMap:
name: prometheus-rules
image.png
注意:这里的configMap名字对应刚刚prometheus-rules创建的configmap名字
6、重新应用prometheus-statefulset.yaml
$ kubectl apply -f prometheus-statefulset.yaml
NAME READY STATUS RESTARTS AGE
alertmanager-6b5bbd5bd4-g9mpd 2/2 Running 0 66m
coredns-55f46dd959-9kspv 1/1 Running 3 35d
coredns-55f46dd959-l5vww 1/1 Running 0 35d
grafana-0 1/1 Running 0 2d
kube-state-metrics-6cf969f79b-29f2r 1/1 Running 0 5d23h
kubernetes-dashboard-ccd98cd4c-jzlbs 1/1 Running 0 34d
node-exporter-7x9zl 1/1 Running 0 18h
node-exporter-ksslf 1/1 Running 0 18h
prometheus-0 2/2 Running 0 30m
7、查看prometheus rules规则已显示生效
image.png
二、配置钉钉告警
1、注册钉钉账号->机器人管理->自定义(通过webhook接入自定义服务)->添加->复制webhook
上述配置好群机器人,获得这个机器人对应的Webhook地址,记录下来,后续配置钉钉告警插件要用,格式如下
https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
2、创建钉钉告警插件(dingtalk-webhook.yaml),并修改文件中 access_token=xxxxxx 为上一步你获得的机器人认证 token
$ vim dingtalk-webhook.yaml
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
run: dingtalk
name: webhook-dingtalk
namespace: monitoring
spec:
replicas: 1
template:
metadata:
labels:
run: dingtalk
spec:
containers:
- name: dingtalk
image: timonwong/prometheus-webhook-dingtalk:v0.3.0
imagePullPolicy: IfNotPresent
# 设置钉钉群聊自定义机器人后,使用实际 access_token 替换下面 xxxxxx部分
args:
- --ding.profile=webhook1=https://oapi.dingtalk.com/robot/send?access_token=94c9f3664df1a928cb59550ac88caf504ca1808a22e7018fdcf92c50d9960fab
ports:
- containerPort: 8060
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
run: dingtalk
name: webhook-dingtalk
namespace: monitoring
spec:
ports:
- port: 8060
protocol: TCP
targetPort: 8060
selector:
run: dingtalk
sessionAffinity: None
3、应用dingtalk-webhook.yaml
$ kubectl apply -f dingtalk-webhook.yaml
4、修改 alertsmanager 告警配置后,更新alertmanager-configmap.yaml 部署,成功后测试告警发送
$ vim alertmanager-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
data:
alertmanager.yml: |
global: null
receivers:
- name: default-receiver
route:
group_interval: 5m
group_wait: 10s
receiver: dingtalk
repeat_interval: 10m
receivers:
- name: dingtalk
webhook_configs:
- send_resolved: true
url: http://webhook-dingtalk.monitoring.svc.cluster.local:8060/dingtalk/webhook1/send
image.png
注:url处可以直接使用的svc地址,格式为:servicename.namespace.svc.cluster.local
5、测试钉钉接收告警
image.png①、修改prometheus-rules.yaml中的规则
②、查看prometheus Alerts中的状态(pending或FIRING)
其中pending状态为:已触发告警,未发送。
其中FIRING状态为:已发送告警。(具体信息请查看webhook-dingtalk 的pod日志)
image.png
网友评论