美文网首页
Prometheus Grafana

Prometheus Grafana

作者: 手捧樱花v | 来源:发表于2020-07-16 00:07 被阅读0次

Describe:
prometheus监控、报警、抓取并储存信息 grafana展示
输出被监控组件信息的HTTP接口被叫做exporter
metrics+label:http_requests_total{method=”POST”,endpoint=”/api/tracks”}
针对http_requests_total这个metrics name无论是增加标签还是删除标签都会形成一条新的时间序列,标签的组合来查询聚合结果
prometheus四种数据类型:
1)Counter用于累计值,一直增加,不会减少。重启进程后,会被重置。
2)Gauge常规数值,例如 温度变化、内存使用变化。可变大,可变小。重启进程后,会被重置。
3) Histogram(直方图)可以理解为柱状图的意思,常用于跟踪事件发生的规模,例如:请求耗时、响应大小。它特别之处是可以对记录的内容进行分组,提供count和sum全部值的功能。
4) 它提供一个quantiles的功能,可以按%比划分跟踪的结果。例如:quantile取值0.95,表示取采样值里面的95%数据。
监控机Images:

docker pull prom/node-exporter
docker pull prom/prometheus
docker pull grafana/grafana

被监控机上安装images:(每台被监控都需要安装)

docker pull prom/node-exporter
docker run -d -p 9100:9100 \
>   -v "/proc:/host/proc:ro" \
>   -v "/sys:/host/sys:ro" \
>   -v "/:/rootfs:ro" \
>   --net="host" \
> --name sakura \
>   prom/node-exporter

Command:

└─[$] <> su
Password:
sh-3.2# mkdir -p /opt/prometheus/
sh-3.2# chmod 777 /opt/prometheus/
sh-3.2# exit
exit
┌─[web@WebdeMBP] - [~] - [三  7 15, 23:27]
└─[$] <> cd /opt/prometheus/
vim prometheus.yml
global:
  scrape_interval:     60s   //Prometheus 从各种 metrics 接口抓取指标数据的时间间隔
  evaluation_interval: 60s  //对报警规则进行评估计算的时间间隔

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: prometheus

  - job_name: linux
    static_configs:
      - targets: ['10.88.88.4:9100']
        labels:
          instance: test_node
docker run  -d \\n  -p 9090:9090 \\n  -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml  \\n  prom/prometheus
docker run -d -p 3000:3000 grafana/grafana

grafana初始 admin/admin


healthy state
配置prometheus localhost不生效换成ip
导入面板
初次见面~

轮询数据来源:localhost:9090/metrics

点击右边导航栏加号点击import 导入该dashboard json或url来添加中文dashboard如图:

1-node-exporter-for-prometheus-dashboard-cn20200628version 1-node-exporter-for-prometheus-dashboard-cn20200628version
job/instance列表与prometheus.yml中配置一致

配置告警规则(alertManager实现根据规则触发):

  1. vim rules.yml
groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job="linux",mode="idle"}[5m])))  * 100 > 85
    for: 1m
    labels:
      instance: test_node
      job: linux
    annotations:
      summary: "Instance {{ $labels.instance }} CPU usgae high"
      description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
  - alert: hostMemUsageAlert
    expr: (1-(node_memory_MemAvailable_bytes{job="linux"} / (node_memory_MemTotal_bytes{job="linux"})))* 100 > 85
    for: 1m
    labels:
      instance: test_node
      job: linux
    annotations:
      summary: "Instance {{ $labels.instance }} MEM usgae high"
      description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
  1. vim prometheus.yml(Prometheus的promtool检查rules.yml语法)
# Alertmanager configuration
alerting:
   alertmanagers:
     - static_configs:
         - targets: ["192.168.199.204:9093"]
               # - alertmanager:9093


# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
   - "rules.yml"
docker run -d -p 9090:9090 --name=prometheus \
-v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /opt/prometheus/rules.yml:/etc/prometheus/rules.yml \
prom/prometheus

容器启动后localhost:9090/rules无法列出报警规则


报警规则

尝试docker logs -f prometheus查看日志,如果报错如下,则是rule.yml问题,检查格式或内容ip配置

level=error ts=2020-07-16T09:14:31.981Z caller=manager.go:904 component="rule manager" msg="loading groups failed" err="/etc/prometheus/rules.yml: yaml: line 1: did not find expected key"
level=error ts=2020-07-16T09:14:31.981Z caller=manager.go:904 component="rule manager" msg="loading groups failed" err="/etc/prometheus/rules.yml: yaml: line 1: did not find expected key"
level=error ts=2020-07-16T09:14:31.981Z caller=main.go:818 msg="Failed to apply configuration" err="error loading rules, previous rule set restored"

localhost:9090/alerts可以查看当前告警的活动状态
测试报警可手动提升cpu: cat /dev/zero>/dev/null或降低expr表达式中的伐值

为了避免连续发送类似的告警通知,可以将相关告警分到同一组中进行告警。分组机制可以将详细的告警信息合并成一个通知,在某些情况下,比如由于系统宕机导致大量的告警被同时触发,在这种情况下分组机制可以将这些被触发的告警合并为一个告警通知,避免一次性接受大量的告警通知:
group_by: ['alertname', 'job']
当一个新的报警分组被创建后,需要等待至少 group_wait 时间来初始化告警。
这样实际上就缓冲了从 Prometheus 发送到 AlertManager 的告警,将告警按相同的标签分组,而不必全都发送:
group_by: ['alertname', 'job']
group_wait: 45s # 通常设置成0s ~ 几分钟
without用于从计算结果中移除列举的标签,而保留其它标签。by则正好相反,结果向量中只保留列出的标签,其余标签则移除。通过without和by可以按照样本的问题对数据进行聚合。

常用alert的expr参考:https://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_liunx_65_prometheus_alertmanager_rule.html

触发规则后会根据 for:1m 一分钟内会pending状态 大于一分钟会firing


firing

启动alertmanager (prometheus中firing向alertmanager发送报警 alert提供三方功能)

docker pull prom/alertmanager
vim alertmanager.yml
global:
  resolve_timeout: 5m

  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: 'xx@qq.com'
  smtp_auth_username: 'xx@qq.com'
  smtp_auth_password: 'nmfzotlhraeqbijh'
  smtp_hello: 'qq.com'
  smtp_require_tls: false

route:
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 10s
  receiver: 'email-receiver'

receivers:
  - name: email-receiver
    email_configs:
      - to: <xx@qq.com>
        send_resolved: true
      - to: <xx@qq.com>
        send_resolved: true

smtp_auth_password: 'nmfzotlhraeqbijh'是授权码不是密码
未使用授权码前报错:level=error ts=2020-07-17T07:48:55.036Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="email-receiver/email[1]: notify retry canceled after 8 attempts: *smtp.plainAuth auth: unencrypted connection; email-receiver/email[0]: notify retry canceled after 8 attempts: *smtp.plainAuth auth: unencrypted connection"
qq邮箱 设置 账户 POP3/IMAP/SMTP/Exchange/CardDAV/CalDAV服务开启后复制授权码
docker restart alertmanager
docker exec -ti grafana bash
vi /usr/share/grafana/conf/defaults.ini

# The public facing domain name used to access grafana from a browser
domain = 10.88.89.60

restart

相关文章

网友评论

      本文标题:Prometheus Grafana

      本文链接:https://www.haomeiwen.com/subject/guushktx.html