美文网首页
Alertmanager邮箱报警配置

Alertmanager邮箱报警配置

作者: 风吹路过的云 | 来源:发表于2020-04-08 14:37 被阅读0次

    Prometheus监控,这里做一个简单的磁盘空间不足的邮箱报警示例。
    prometheus.yml配置文件

    # my global config
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ['172.1.5.220:9093']
          # - alertmanager:9093
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      - "node_down.yml"
      # - "first_rules.yml"
      # - "second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'prometheus'
        static_configs:
          - targets: ['172.1.5.220:9090']
      - job_name: 'cadvisor'
        static_configs:
          - targets: ['172.1.5.220:8080']
      - job_name: 'harbor-250'
        static_configs:
          - targets: ['192.168.8.250:4080']
      - job_name: 'java-demo'
        scrape_interval: 5s
        metrics_path: '/actuator/prometheus'
        static_configs:
          - targets: ['192.168.9.222:8080']
      - job_name: 'node'
        scrape_interval: 8s
        static_configs:
          - targets: ['172.1.5.220:9100', '192.168.9.223:9100', '192.168.8.250:4100']
    

    node_down.yml配置如下,HostOutOfDiskSpace的rule,意思是当磁盘空间少于60%时,报警。
    同学们,如果rule不会写的话,可以参考这里,很多规则,我也是参与这里的
    Rule参考

    groups:
    - name: node_down
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          user: test
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
    - name: out_of_disk_space
      rules: 
      - alert: HostOutOfDiskSpace
        expr: (node_filesystem_avail_bytes{mountpoint="/etc/hostname"}  * 100) / node_filesystem_size_bytes{mountpoint="/etc/hostname"} < 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Host out of disk space (instance {{ $labels.instance }})"
          description: "Disk is almost full (< 60% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    

    alertmanager.yml配置如下,因为公司用的邮箱系统是阿里企业邮箱,所以host是:smtp.qiye.aliyun.com:465

    global: 
      smtp_smarthost: 'smtp.qiye.aliyun.com:465'
      smtp_from: 'abc@xxx.com'
      smtp_auth_username: 'abc@xxx.com'
      smtp_auth_password: 't43123456'
      smtp_require_tls: false
    
    route: 
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 10m
      receiver: live-monitoring
    
    receivers: 
      - name: 'live-monitoring'
        email_configs: 
        - to: '3424354443@qq.com'
    

    这里的smtp_smarthost很重要,一开始,我以为填公司域名smtp.xxx.com:25就行了,结果报如下的错,

    level=error ts=2020-04-08T06:02:44.036Z caller=notify.go:372 component=dispatcher msg="Error on notify" err="send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.xxx.com" context_err="context deadline exceeded"
    level=error ts=2020-04-08T06:02:44.036Z caller=dispatch.go:301 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.xxx.com"
    

    经过网上搜索,才找到是:smtp.qiye.aliyun.com:465
    效果如下:

    已使用大小 firing 报警邮件

    参考资料:
    https://awesome-prometheus-alerts.grep.to/rules

    相关文章

      网友评论

          本文标题:Alertmanager邮箱报警配置

          本文链接:https://www.haomeiwen.com/subject/akrtmhtx.html