美文网首页
一文学会Prometheus的AlertManager

一文学会Prometheus的AlertManager

作者: sknfie | 来源:发表于2021-06-30 17:33 被阅读0次

    概述

    Prometheus本身不支持告警功能,主要通过插件alertmanage来实现告警。AlertManager用于接收Prometheus发送的告警并对于告警进行一系列的处理后发送给指定的用户。
    Prometheus监控系统的的报警规则是在Prometheus这个组件完成配置的。
    prometheus支持2种类型的规则:

    • 记录规则
      记录规则主要是为了简写报警规则和提高规则复用的。
    • 报警规则
      真正去判定是否需要报警的规则,报警规则中是可以使用记录规则的。

    一、安装

    1)安装Alertmanager

    [root@localhost opt]# ll alertmanager-0.20.0.linux-amd64.tar.gz 
    -rw-r--r--. 1 root root 23928771 May 21 20:02 alertmanager-0.20.0.linux-amd64.tar.gz
    [root@localhost opt]# tar -zxvf alertmanager-0.20.0.linux-amd64.tar.gz
    [root@localhost opt]# cp -r alertmanager-0.20.0.linux-amd64 /usr/local/alertmanager
    

    2)添加alertmanager为系统服务开机启动

    [root@localhost ~]# vi /usr/lib/systemd/system/alertmanager.service
    [Unit]
    Description=Prometheus Alertmanager Service daemon
    After=network.target
    
    [Service]
    User=root
    Group=root
    Type=simple
    ExecStart=/usr/local/alertmanager/alertmanager \
        --config.file=/usr/local/alertmanager/alertmanager.yml \
        --storage.path=/usr/local/alertmanager/data/ \
        --data.retention=120h \
        --web.external-url=http://192.168.1.10:9093
        --web.listen-address=:9093
    Restart=on-failure
    
    [Install]
    WantedBy=multi-user.target
    
    # alertmanager选项说明
    # ExecStart=/usr/local/alertmanager/alertmanager  启动运行alertmanager程序所在的路径
    # --config.file=/usr/local/alertmanager/alertmanager.yml  指定alertmanager配置文件路径
    # --storage.path=/usr/local/alertmanager/data/  数据存储路径
    # --data.retention=120h  历史数据最大保留时间,默认120小时
    # --web.external-url  生成返回alertmanager的相对和绝对链接地址,可以在后续告警通知信息中直接点击链接地址访问alertmanager web ui。其格式为http://{ip或者域名}:9093
    # --web.listen-address  监听web接口和API的地址端口
    
    [root@localhost ~]# systemctl daemon-reload
    [root@localhost ~]# systemctl restart alertmanager.service
    [root@localhost ~]# systemctl status alertmanager.service
    

    3)web访问测试

    浏览器访问示例地址:http://192.168.2.136:9093/#/status

    docker方式安装

    1)下载alertmanager镜像

    [root@localhost ~]# docker pull prom/alertmanager
    

    2)检查是否下载成功

    [root@localhost ~]# docker images
    REPOSITORY                    TAG                 IMAGE ID            CREATED             SIZE
    docker.io/prom/alertmanager   latest              0881eb8f169f        5 months ago        52.1 MB
    

    3)运行alertmanager镜像

    [root@localhost ~]# docker run -d -p 9093:9093 -v /usr/local/alertmanager/simple.yml:/etc/alertmanager/config.yml --name alertmanager prom/alertmanager
    [root@localhost ~]# docker ps 
    CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
    121610a9f7ee        prom/alertmanager   "/bin/alertmanager..."   17 seconds ago      Up 16 seconds       0.0.0.0:9093->9093/tcp   alertmanager
    

    告警配置及监控

    1.配置

    打开prometheus.yml配置文件,去掉注释,修改如下:

    # Alertmanager configuration
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - 192.168.2.136:9093
    ......
      - job_name: 'AlertManager'
        static_configs:
        - targets: ['localhost:9090']
    

    2.监控

    node_exporter是否是up的,不是的话会发告警

    rule_files:
      - "/usr/local/prometheus/rules/*_rules.yml"
    

    创建规则:

    [root@centos7_9-mod prometheus]#mkdir rules
    [root@centos7_9-mod prometheus]#cd rules
    [root@centos7_9-mod rules]# vi node_rules.yml
    groups:
    - name: test
      rules:
      - alert: prometheus
        expr: up{job="node_exporter"} == 0
        for: 3m
        labels:
          serverity: critical
        annotations:
          summary: "node down"
          description: "Node has been down for more than 3 minutes."
    

    校验及重启:

    [root@centos7_9-mod prometheus]# ./promtool check rules rules/node_rules.yml
    Checking rules/node_rules.yml
      SUCCESS: 1 rules found
    
    systemctl restart prometheus.service
    

    模拟停止node_exporter:

    systemctl stop node_exporter.service
    
    告警页面
    告警查询
    3m后变红

    3.配置模板

    vi rules/node_rules.yml
    - name: test
      rules:
      - alert: prometheus
        expr: up{job="node_exporter"} == 0
        for: 3m
        labels:
          serverity: critical
        annotations:
          summary: "{{ $labels.instance }} down.up=={{ $value }}"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 3 minutes."
    
    [root@centos7_9-mod prometheus]# systemctl restart node_exporter.service 
    [root@centos7_9-mod prometheus]# systemctl restart prometheus.service 
    
    正常

    重新停止:

    [root@centos7_9-mod prometheus]# systemctl stop node_exporter.service
    
    有图有真相

    email报警

    1)修改alertmanager默认配置文件

    [root@localhost alertmanager]# cat alertmanager.yml 
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.sknfie.com:465'  # 邮箱SMTP服务器代理地址
      smtp_from: 'sknfie@163.com'    # 发送邮件的名称
      smtp_auth_username: 'sknfie@163.com'  # 邮箱用户名称
      smtp_auth_password: 'rkmdpoviehcvddde'   # 邮箱授权密码
      smtp_require_tls: false
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'email'
    receivers:
    - name: 'email'
      email_configs:
      - to: 'sknfie@163.com'
        headers: { Subject: " WARNING- -告警邮件" }
        send_resolved: true
    inhibit_rules:
      - source_match:
          severity: 'critical'
        target_match:
          severity: 'warning'
        equal: ['alertname', 'dev', 'instance']
    

    2)检查配置文件 并 重启服务

    [root@localhost alertmanager]# ./amtool check-config alertmanager.yml 
    Checking 'alertmanager.yml'  SUCCESS
    [root@localhost alertmanager]# systemctl restart alertmanager
    

    3)配置prometheus配置文件

    [root@localhost prometheus]# cat prometheus.yml 
    global:
      scrape_interval:     15s
      evaluation_interval: 15s 
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - 192.168.2.136:9093
    rule_files:
       - "/usr/local/prometheus/rules/*.yml"
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
        - targets: ['localhost:9090']
    
      - job_name: 'node'
        static_configs:
        - targets: ['192.168.1.6:9100']
    
      - job_name: 'Alertmanager'
        static_configs:
        - targets: ['192.168.1.10:9093']
    

    4)配置告警规则文件

    [root@localhost prometheus]# cat rules/up_rules.yml 
    groups:
    - name: UP
      rules:
      - alert: node
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: crirical
        annotations:
          description: " {{ $labels.instance }} of job of {{ $labels.job }} has been down for more than 5 minutes."
          summary: "{{ $labels.instance }} down,up=={{ $value }}"
    

    5)重启prometheus服务

    [root@localhost prometheus]# systemctl restart prometheus
    

    6)测试

    停止node_exporter

    [root@localhost ~]# systemctl stop node_exporter
    

    相关文章

      网友评论

          本文标题:一文学会Prometheus的AlertManager

          本文链接:https://www.haomeiwen.com/subject/yuxrultx.html