Prometheus监控,这里做一个简单的磁盘空间不足的邮箱报警示例。
prometheus.yml配置文件
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['172.1.5.220:9093']
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "node_down.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
static_configs:
- targets: ['172.1.5.220:9090']
- job_name: 'cadvisor'
static_configs:
- targets: ['172.1.5.220:8080']
- job_name: 'harbor-250'
static_configs:
- targets: ['192.168.8.250:4080']
- job_name: 'java-demo'
scrape_interval: 5s
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['192.168.9.222:8080']
- job_name: 'node'
scrape_interval: 8s
static_configs:
- targets: ['172.1.5.220:9100', '192.168.9.223:9100', '192.168.8.250:4100']
node_down.yml配置如下,HostOutOfDiskSpace的rule,意思是当磁盘空间少于60%时,报警。
同学们,如果rule不会写的话,可以参考这里,很多规则,我也是参与这里的
Rule参考
groups:
- name: node_down
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
user: test
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
- name: out_of_disk_space
rules:
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes{mountpoint="/etc/hostname"} * 100) / node_filesystem_size_bytes{mountpoint="/etc/hostname"} < 60
for: 5m
labels:
severity: warning
annotations:
summary: "Host out of disk space (instance {{ $labels.instance }})"
description: "Disk is almost full (< 60% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
alertmanager.yml配置如下,因为公司用的邮箱系统是阿里企业邮箱,所以host是:smtp.qiye.aliyun.com:465
global:
smtp_smarthost: 'smtp.qiye.aliyun.com:465'
smtp_from: 'abc@xxx.com'
smtp_auth_username: 'abc@xxx.com'
smtp_auth_password: 't43123456'
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 10m
receiver: live-monitoring
receivers:
- name: 'live-monitoring'
email_configs:
- to: '3424354443@qq.com'
这里的smtp_smarthost很重要,一开始,我以为填公司域名smtp.xxx.com:25就行了,结果报如下的错,
level=error ts=2020-04-08T06:02:44.036Z caller=notify.go:372 component=dispatcher msg="Error on notify" err="send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.xxx.com" context_err="context deadline exceeded"
level=error ts=2020-04-08T06:02:44.036Z caller=dispatch.go:301 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.xxx.com"
经过网上搜索,才找到是:smtp.qiye.aliyun.com:465
效果如下:
![](https://img.haomeiwen.com/i1150260/ee9e7794de152c24.png)
![](https://img.haomeiwen.com/i1150260/c38d88200d60e0f7.png)
![](https://img.haomeiwen.com/i1150260/2c6d77652279c690.png)
网友评论