美文网首页
基于prometheus/alertmanager的监控报警实现

基于prometheus/alertmanager的监控报警实现

作者: 天草二十六_简村人 | 来源:发表于2022-12-11 16:54 被阅读0次

    一、背景

    prometheus采集了很多指标,包括机器、应用、中间件等的,可是一直缺少告警,毕竟我们不可能一直盯着grafana大屏看。。。

    本文的范围是讲述告警的实现,前提你对Prometheus已有初步了解,有一定的编程基础。

    采用alertmanager来做告警,是prometheus的官方推荐,在本文中,实现代码非常少,只需要配置一个回调地址接口。核心实现并不在alertmanager,希望没让你失望哈。

    二、目标

    • 1、及时发现机器故障、数据指标异常等。
    • 2、监控的基础上增加报警,适用于任何环境。告警的规则发布,就像程序发版一样,经由开发到测试,再到生产。

    三、部署图

    image.png

    Prometheus能够监控的对象很多,除了这里罗列的一些,还包括容器、Prometheus自身等。

    四、报警实现

    1、prometheus

    启动命令:nohup ./prometheus --web.enable-lifecycle --web.enable-admin-api --storage.tsdb.retention=60d &

    prometheus.yml

    这里配置alertmanager,指标的规则以及爬取终端。爬取既支持自定义的json格式,也支持consul这样子的注册中心, 当然也支持数组。

    # my global config
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - 127.0.0.1:9093
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      - "/opt/prometheus-2.17.2.linux-amd64/rules/*.yml"
       #- "first_rules.yml"
       #- "second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'file_sd'
        metrics_path: '/metrics'
        file_sd_configs:
          - files:
            - linux-targets.json
      - job_name: 'consul-prometheus'
        metrics_path: '/mgm/prometheus'
        consul_sd_configs:
        - server: '192.168.50.61:8500'
          services: []
      - job_name: 'cAdvisor'
        metrics_path: '/metrics'
        static_configs:
        - targets: ['192.168.10.150:8091','192.168.10.120:8091','192.168.5.66:8091']
      - job_name: cwp-to-video
        metrics_path: '/mgm/prometheus'
        static_configs:
        - targets: ['192.168.53.29:7109']
    

    rule指标规则

    由上面的主配置可以看出,规则文件是存储在/opt/prometheus-2.17.2.linux-amd64/rules/*.yml。

    image.png
    • node-up.yml
    groups:
    - name: node-rule
      rules:
      - alert: linux机器
        expr: up{job="linux"} == 0 #只对linux机器监控上下线,不对服务 
        for: 120s
        labels:
          severity: warning
        annotations:
          summary: "机器{{ $labels.instance }} 挂了"
          description: "报告.请立即查看!"
          value: "{{ $value }}"
    
    • api.yml
    groups:
    - name: api-rule
      rules:
      - alert: "3秒以上的慢接口"
        expr: sum(increase(http_server_requests_seconds_count{}[1m])) by (application) - sum(increase(http_server_requests_seconds_bucket{le="3.0"}[1m])) by (application) > 10
        for: 120s
        labels:
          severity: warning
          application: "{{$labels.application}}"
        annotations:
          summary: "服务名:{{$labels.application}}3秒以上的慢接口超过10次"
          description: "应用的慢接口(3秒以上)次数的监控"
          value: "{{ $value }}"
    
      - alert: "5xx错误的接口"
        expr: sum(increase(http_server_requests_seconds_count{status=~"5.."}[1m])) by (application)  > 10
        for: 120s
        labels:
          severity: warning
          application: "{{$labels.application}}"
        annotations:
          summary: "服务名:{{$labels.application}}接口出现5xx错误的次数超过10次"
          description: "应用的5xx错误次数的监控"
          value: "{{ $value }}"
    
    • logback.yml
    groups:
    - name: logback-rule
      rules:
      - alert: "日志报警"
        expr: sum by (application) (increase(logback_events_total{level="error"}[1m]))  > 10
        for: 15s
        labels:
          application: "{{$labels.application}}"
          severity: warning
        annotations:
          summary: "服务名:{{$labels.application}}错误日志数超过了每分钟10条"
          description: "应用的报警值: {{ $value }}"
          value: "{{ $value }}"
    
    • disk.yml
    groups:
    - name: disk-rule
      rules:
      - alert: "磁盘空间报警"
        expr: 100 - (node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) > 80
        for: 60s
        labels:
          severity: warning
        annotations:
          summary: "服务名:{{$labels.instance}}磁盘空间使用超过80%了"
          description: "开发环境机器报警值: {{ $value }}"
          value: "{{ $value }}"
    
    • cpu.yml
    groups:
    - name: cpu-rule
      rules:
      - alert: "CPU报警"
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
        for: 120s
        labels:
          severity: warning
          instance: "{{$labels.instance}}"
        annotations:
          summary: "linux机器:{{$labels.instance}}CPU使用率超过80%了"
          description: "开发环境机器报警值: {{ $value }}"
          value: "{{ $value }}"
    
      - alert: "linux load5 over 5"
        for: 120s
        expr: node_load5 > 5
        labels:
          severity: warning
          instance: "{{$labels.instance}}"
        annotations:
          description: "{{ $labels.instance }} over 5, 当前值:{{ $value }}"
          summary: "linux load5 over 5"
          value: "{{ $value }}"
    
    • memory.yml
    groups:
    - name: memory-rule
      rules:
      - alert: "内存使用率高"
        expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 80
        for: 120s
        labels:
          severity: warning
        annotations:
          summary: "服务名:{{$labels.instance}}内存使用率超过80%了"
          description: "开发环境机器报警,内存使用率过高!"
          value: "{{ $value }}"
    
      - alert: "内存不足提醒"
        expr: (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 < 10
        for: 120s
        labels:
          severity: warning
        annotations:
          summary: "linux机器:{{$labels.instance}}内存不足低于10%了"
          description: "开发环境机器报警,内存不足!"
          value: "{{ $value }}"
    
    • network.yml
    groups:
    - name: network-rule
      rules:
      - alert: "eth0 input traffic network over 10M"
        expr: sum by(instance) (irate(node_network_receive_bytes_total{device="eth0",instance!~"172.1.*|172..*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
        for: 60s
        labels:
          severity: warning
          instance: "{{$labels.instance}}"
        annotations:
          summary: "eth0 input traffic network over 10M"
          description: "{{$labels.instance}}流入流量为:{{ $value }}M"
          value: "{{ $value }}"
    
      - alert: "eth0 output traffic network over 10M"
        expr: sum by(instance) (irate(node_network_transmit_bytes_total{device="eth0",instance!~"172.1.*|175.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
        for: 60s
        labels:
          severity: warning
          instance: "{{$labels.instance}}"
        annotations:
          summary: "eth0 output traffic network over 10M"
          description: "{{$labels.instance}}流出流量为: {{ $value }}"
          value: "{{ $value }}"
    

    2、alertmanager

    启动命令

    nohup ./alertmanager  2>&1 | tee -a alertmanager.log &
    

    alertmanager.yml

    对告警进行打标签,这里配置自定义的回调接口(具体发送告警的实现在接口中)

    global:
      resolve_timeout: 5m
    
    route:
      group_wait: 30s # 在组内等待所配置的时间,如果同组内,30秒内出现相同报警,在一个组内出现。
      group_interval: 5m # 如果组内内容不变化,合并为一条警报信息,5m后发送。
      repeat_interval: 24h # 发送报警间隔,如果指定时间内没有修复,则重新发送报警。
      group_by: ['alertname']  # 报警分组
      receiver: 'webhook'
    
    receivers:
    - name: 'webhook'
      webhook_configs:
      - url: 'http://192.168.10.47/devops/api/prometheus/notify?env=dev'
    
    inhibit_rules:
      - source_match:
          severity: 'critical'
        target_match:
          severity: 'warning'
        equal: ['alertname', 'dev', 'instance']
    

    3、发送警告

    • 接收报文
    {
        "receiver":"webhook",
        "status":"resolved",
        "alerts":[
            {
                "status":"resolved",
                "labels":{
                    "alertname":"linux机器",
                    "instance":"192.168.8.18",
                    "job":"linux",
                    "severity":"warning"
                },
                "annotations":{
                    "description":"报告.请立即查看!",
                    "summary":"机器192.168.8.18 挂了",
                    "value":"0"
                },
                "startsAt":"2022-05-27T16:57:03.205485317+08:00",
                "endsAt":"2022-12-12T15:05:03.205485317+08:00",
                "generatorURL":"[http://CplusSev0201:9090/graph?g0.expr=up%7Bjob%3D%22linux%22%7D+%3D%3D+0\u0026g0.tab=1](http://cplussev0201:9090/graph?g0.expr=up%7Bjob%3D%22linux%22%7D+%3D%3D+0\u0026g0.tab=1)",
                "fingerprint":"f6e41781dace19ad"
            }
        ],
        "groupLabels":{
            "alertname":"linux机器"
        },
        "commonLabels":{
            "alertname":"linux机器",
            "job":"linux",
            "severity":"warning"
        },
        "commonAnnotations":{
            "description":"报告.请立即查看!",
            "value":"0"
        },
        "externalURL":"[http://CplusSev0201:9093](http://cplussev0201:9093/)",
        "version":"4",
        "groupKey":"{}:{alertname=\"linux机器\"}",
        "truncatedAlerts":0
    }
    
    
    • 处理报文

    比在alertmanager里实现发送告警要方便得多,核心问题是要知道某个机器或应用,它的负责方是谁。而这又本属于cmdb的范畴,所以建议你采用标签机制,对机器、中间件和应用打好标签,和告警的消息标签对应起来,轻松找到告警接收者。 -- 这就是为什么我们要自己写回调接口来实现的原因之一。

    • 环境区分是通过参数变量env,它在接口的表单里传递。
    • 消息体没有使用模板,因为alertmanager发送过来的消息是批量的,它会帮助我们对告警进行一个聚合收敛。
    • 最后是发送策略,你可以发送到某个人,也可以发送到指定群,当然也可以是SMS,或者仅打印error日志(通过sentry告警)。如果是有界面的话,一般会让管理者选择告警的时间段,毕竟开发或测试环境挂了,它的优先级相对线上低多了。
            JSONObject jsonObject = JSON.parseObject(requestJson);
    
            String alerts = jsonObject.getString("alerts");
            if (StringUtils.isEmpty(alerts)) {
                if (log.isWarnEnabled()) {
                    log.warn("prometheus回调的告警列表为空");
                }
                return ResponseEntity.noContent().build();
            }
    
            // 环境包括:dev/test/prod。 默认是prod
            String env = StringUtils.isEmpty(request.getParameter("env")) ? "prod" : request.getParameter("env");
    
            List<AlertDTO> alertDTOList = JSONObject.parseArray(alerts, AlertDTO.class);
    
            StringBuilder content = new StringBuilder("> Prometheus出现告警,需要及时跟进!!\n");
    
            for (AlertDTO alert : alertDTOList) {
    
                content.append("> =======start=========\n\n");
    
                content.append("> **告警类型:** ").append(alert.getLabels().getAlertname()).append("\n\n");
                content.append("> **告警主题:** ").append(alert.getAnnotations().getSummary()).append("\n\n");
                content.append("> **告警详情:** ").append(alert.getAnnotations().getDescription()).append("\n\n");
                content.append("> **触发阈值:** ").append(alert.getAnnotations().getValue()).append("\n\n");
    
                content.append("> **触发时间:** ").append(this.formatDateTime(alert.getStartsAt())).append("\n\n");
                content.append("> **链接地址:** ").append(this.replaceUrl(alert.getGeneratorURL()))
                        .append("[点击跳转](").append(this.replaceUrl(alert.getGeneratorURL())).append(")").append("\n\n");
    
                content.append("> =======end=========\n\n");
    
                content.append("\n\n\n\n\n");
            }
    
            // 不同的环境,发送的策略不一
            switch (env) {
                case "dev":
                case "test":
                    int iHour = DateUtil.thisHour(true);
                    if (iHour >= 8 && iHour <= 18) {
                        WxchatMessageUtil.sendByPhone(content.toString(), "150xxxx9916");
                    } else {
                        log.error("触发告警!因已下班,故打印日志以记录。");
                    }
                    break;
                case "prod":
                    WxchatMessageUtil.sendByRobot(content.toString(), "a82xx480-3b64-485a-8c25-b90c483308cc");
                    break;
                default:
                    break;
            }
    
    • 发送警告

    本来是在alertmanager里实现的复杂配置,落在devops-service服务中实现,应对多种发送渠道,灵活得很。这里强烈建议。

    • 至于怎么发送企业微信、SMS、钉钉等消息,就不属于本文的范畴了,不赘述。
    企业微信截图_0caa58fa-61a7-40c1-952c-cf1d8ed075a6.png

    相关文章

      网友评论

          本文标题:基于prometheus/alertmanager的监控报警实现

          本文链接:https://www.haomeiwen.com/subject/dgmfqdtx.html