美文网首页
Prometheus邮件报警

Prometheus邮件报警

作者: 小李飞刀_lql | 来源:发表于2022-01-13 09:18 被阅读0次

    Alertmanager

    配置服务

    [root@k8smaster ~]# vi /usr/lib/systemd/system/alertmanager.service 
    [Unit]
    Description=alertmanager
    [Service]
    ExecStart=/opt/monitor/alertmanager/alertmanager --config.file=/opt/monitor/alertmanager/alertmanager.yml
    ExecReload=/bin/kill -HUP $MAINPID
    KillMode=process
    Restart=on-failure
    [Install]
    WantedBy=multi-user.target
    

    启动服务

    [root@k8smaster alertmanager]# systemctl daemon-reload
    [root@k8smaster alertmanager]# systemctl restart alertmanager
    

    配置邮件发送

    [root@k8smaster alertmanager]# vi alertmanager.yml 
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: 'lql_h@163.com'
      smtp_auth_username: 'lql_h@163.com'
      smtp_auth_password: 'BBTGIGYUNBZNNQEB'
      smtp_require_tls: false
    
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'lql'
    receivers:
    - name: 'lql'
      email_configs:
      - to: 'lql_h@163.com'
    

    prometheus配置文件

    alerting:
      alertmanagers:
      - static_configs:
        - targets:
           - localhost:9093
    
    
    rule_files:
       - "/opt/monitor/prometheus/rules/*.yml"
    
    

    报警规则设置

    实时检查服务是否正常

    groups:
    - name: general.rules
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: error
        annotations:
          summary: "Instance {{ $labels.instance }} 停止工作"
          description: "{{ $labels.instance }}: job {{ $labels.job }} 已经停止5分钟以上."
    

    实时检查cpu、内存、磁盘指标是否正常

    groups:
    - name: node.rules
      rules:
      - alert: NodeFilesystemUsage
        expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{$labels.instance}}: {{$labels.mountpoint }} 分区使用过高"
          description: "{{$labels.instance}}: {{$labels.mountpoint }} 分区使用大于 80% (当前值: {{ $value }})"
      - alert: NodeMemoryUsage
        expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{$labels.instance}}: 内存使用过高"
          description: "{{$labels.instance}}: 内存使用大于 80% (当前值: {{ $value }})"
      - alert: NodeCPUUsage
        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{$labels.instance}}: CPU使用过高"
          description: "{{$labels.instance}}: CPU使用大于 80% (当前值: {{ $value }})"
    
    

    报警后查看prometheus效果

    1635752971404.png

    查看是否触动相关指标

    1635820036174.png

    触发报警后发送邮件

    #每个alert发送一个邮件
    #PromQL所查询的数据(多条)显示在邮件里,每条都显示
    
    1635820061504.png

    相关文章

      网友评论

          本文标题:Prometheus邮件报警

          本文链接:https://www.haomeiwen.com/subject/hvomcrtx.html