Describe:
prometheus监控、报警、抓取并储存信息 grafana展示
输出被监控组件信息的HTTP接口被叫做exporter
metrics+label:http_requests_total{method=”POST”,endpoint=”/api/tracks”}
针对http_requests_total这个metrics name无论是增加标签还是删除标签都会形成一条新的时间序列,标签的组合来查询聚合结果
prometheus四种数据类型:
1)Counter用于累计值,一直增加,不会减少。重启进程后,会被重置。
2)Gauge常规数值,例如 温度变化、内存使用变化。可变大,可变小。重启进程后,会被重置。
3) Histogram(直方图)可以理解为柱状图的意思,常用于跟踪事件发生的规模,例如:请求耗时、响应大小。它特别之处是可以对记录的内容进行分组,提供count和sum全部值的功能。
4) 它提供一个quantiles的功能,可以按%比划分跟踪的结果。例如:quantile取值0.95,表示取采样值里面的95%数据。
监控机Images:
docker pull prom/node-exporter
docker pull prom/prometheus
docker pull grafana/grafana
被监控机上安装images:(每台被监控都需要安装)
docker pull prom/node-exporter
docker run -d -p 9100:9100 \
> -v "/proc:/host/proc:ro" \
> -v "/sys:/host/sys:ro" \
> -v "/:/rootfs:ro" \
> --net="host" \
> --name sakura \
> prom/node-exporter
Command:
└─[$] <> su
Password:
sh-3.2# mkdir -p /opt/prometheus/
sh-3.2# chmod 777 /opt/prometheus/
sh-3.2# exit
exit
┌─[web@WebdeMBP] - [~] - [三 7 15, 23:27]
└─[$] <> cd /opt/prometheus/
vim prometheus.yml
global:
scrape_interval: 60s //Prometheus 从各种 metrics 接口抓取指标数据的时间间隔
evaluation_interval: 60s //对报警规则进行评估计算的时间间隔
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
labels:
instance: prometheus
- job_name: linux
static_configs:
- targets: ['10.88.88.4:9100']
labels:
instance: test_node
docker run -d \\n -p 9090:9090 \\n -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \\n prom/prometheus
docker run -d -p 3000:3000 grafana/grafana
grafana初始 admin/admin




轮询数据来源:localhost:9090/metrics
点击右边导航栏加号点击import 导入该dashboard json或url来添加中文dashboard如图:



配置告警规则(alertManager实现根据规则触发):
- vim rules.yml
groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job="linux",mode="idle"}[5m]))) * 100 > 85
for: 1m
labels:
instance: test_node
job: linux
annotations:
summary: "Instance {{ $labels.instance }} CPU usgae high"
description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
- alert: hostMemUsageAlert
expr: (1-(node_memory_MemAvailable_bytes{job="linux"} / (node_memory_MemTotal_bytes{job="linux"})))* 100 > 85
for: 1m
labels:
instance: test_node
job: linux
annotations:
summary: "Instance {{ $labels.instance }} MEM usgae high"
description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
- vim prometheus.yml(Prometheus的promtool检查rules.yml语法)
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["192.168.199.204:9093"]
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules.yml"
docker run -d -p 9090:9090 --name=prometheus \
-v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /opt/prometheus/rules.yml:/etc/prometheus/rules.yml \
prom/prometheus
容器启动后localhost:9090/rules无法列出报警规则

尝试docker logs -f prometheus查看日志,如果报错如下,则是rule.yml问题,检查格式或内容ip配置
level=error ts=2020-07-16T09:14:31.981Z caller=manager.go:904 component="rule manager" msg="loading groups failed" err="/etc/prometheus/rules.yml: yaml: line 1: did not find expected key"
level=error ts=2020-07-16T09:14:31.981Z caller=manager.go:904 component="rule manager" msg="loading groups failed" err="/etc/prometheus/rules.yml: yaml: line 1: did not find expected key"
level=error ts=2020-07-16T09:14:31.981Z caller=main.go:818 msg="Failed to apply configuration" err="error loading rules, previous rule set restored"
localhost:9090/alerts可以查看当前告警的活动状态
测试报警可手动提升cpu: cat /dev/zero>/dev/null或降低expr表达式中的伐值
为了避免连续发送类似的告警通知,可以将相关告警分到同一组中进行告警。分组机制可以将详细的告警信息合并成一个通知,在某些情况下,比如由于系统宕机导致大量的告警被同时触发,在这种情况下分组机制可以将这些被触发的告警合并为一个告警通知,避免一次性接受大量的告警通知:
group_by: ['alertname', 'job']
当一个新的报警分组被创建后,需要等待至少 group_wait 时间来初始化告警。
这样实际上就缓冲了从 Prometheus 发送到 AlertManager 的告警,将告警按相同的标签分组,而不必全都发送:
group_by: ['alertname', 'job']
group_wait: 45s # 通常设置成0s ~ 几分钟
without用于从计算结果中移除列举的标签,而保留其它标签。by则正好相反,结果向量中只保留列出的标签,其余标签则移除。通过without和by可以按照样本的问题对数据进行聚合。
常用alert的expr参考:https://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_liunx_65_prometheus_alertmanager_rule.html
触发规则后会根据 for:1m 一分钟内会pending状态 大于一分钟会firing

启动alertmanager (prometheus中firing向alertmanager发送报警 alert提供三方功能)
docker pull prom/alertmanager
vim alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: 'xx@qq.com'
smtp_auth_username: 'xx@qq.com'
smtp_auth_password: 'nmfzotlhraeqbijh'
smtp_hello: 'qq.com'
smtp_require_tls: false
route:
group_wait: 10s
group_interval: 10s
repeat_interval: 10s
receiver: 'email-receiver'
receivers:
- name: email-receiver
email_configs:
- to: <xx@qq.com>
send_resolved: true
- to: <xx@qq.com>
send_resolved: true
smtp_auth_password: 'nmfzotlhraeqbijh'是授权码不是密码
未使用授权码前报错:level=error ts=2020-07-17T07:48:55.036Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="email-receiver/email[1]: notify retry canceled after 8 attempts: *smtp.plainAuth auth: unencrypted connection; email-receiver/email[0]: notify retry canceled after 8 attempts: *smtp.plainAuth auth: unencrypted connection"
qq邮箱 设置 账户 POP3/IMAP/SMTP/Exchange/CardDAV/CalDAV服务开启后复制授权码
docker restart alertmanager
docker exec -ti grafana bash
vi /usr/share/grafana/conf/defaults.ini
# The public facing domain name used to access grafana from a browser
domain = 10.88.89.60
restart
网友评论