美文网首页服务器监控自动化监控ZabbixDocker容器
Docker部署Prometheus实现微信邮件报警

Docker部署Prometheus实现微信邮件报警

作者: Anson前行 | 来源:发表于2019-05-06 14:47 被阅读50次
    Prometheus 组成及架构

    Prometheus 生态圈中包含了多个组件,其中许多组件是可选的:

    • Prometheus Server: 用于收集和存储时间序列数据。
    • Client Library: 客户端库,为需要监控的服务生成相应的 metrics 并暴露给 Prometheus server。当 Prometheus server 来 pull 时,直接返回实时状态的 metrics。
    • Push Gateway: 主要用于短期的 jobs。由于这类 jobs 存在时间较短,可能在 Prometheus 来 pull 之前就消失了。为此,这次 jobs 可以直接向 Prometheus server 端推送它们的 metrics。这种方式主要用于服务层面的 metrics,对于机器层面的 metrices,需要使用 node exporter。
    • Exporters: 用于暴露已有的第三方服务的 metrics 给 Prometheus。
    • Alertmanager: 从 Prometheus server 端接收到 alerts 后,会进行去除重复数据,分组,并路由到对应的接收方式,发出报警。常见的接收方式有:电子邮件,pagerduty,OpsGenie, webhook 等。

    Prometheus 官方文档中的架构图:

    unmin.club

    从上图可以看出,Prometheus 的主要模块包括:Prometheus server, exporters, Pushgateway, PromQL, Alertmanager 以及图形界面。

    其大概的工作流程是:

    1. Prometheus server 定期从配置好的 jobs 或者 exporters 中拉 metrics,或者接收来自 Pushgateway 发过来的 metrics,或者从其他的 Prometheus server 中拉 metrics。
    2. Prometheus server 在本地存储收集到的 metrics,并运行已定义好的 alert.rules,记录新的时间序列或者向 Alertmanager 推送警报。
    3. Alertmanager 根据配置文件,对接收到的警报进行处理,发出告警。
    4. 在图形界面中,可视化采集数据。

    Prometheus官网:https://prometheus.io/

    1. Prometheus安装及配置

    192.168.16.251      Prometheus,grafana,alertmanager,Node-exporter
    192.168.16.252      Node-exporter,Jmx-exporter,Cadvisor
    

    创建Prometheus配置文件prometheus.yml
    本地宿主机/root/prometheus/conf/下创建

    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
    alerting:       #指定alertmanager报警组件地址
      alertmanagers:
      - static_configs:
        - targets: [ '192.168.16.251:9093']
    
    rule_files:  #指定报警规则文件
      - "rules.yml"
    
    scrape_configs:
      - job_name: 'nodehost'   
        static_configs:
          - targets: ['192.168.16.251:9100']
            labels:
              appname: 'Node1'
     static_configs:
          - targets: ['192.168.16.252:9100']
            labels:
              appname: 'Node2'
      - job_name: 'tomcat'
        static_configs:
          - targets: ['192.168.16.173:12345']
            labels:
              appname: 'mytest'
      - job_name: 'cadvisor'
        static_configs:
          - targets: [ '192.168.16.251:8080','192.168.16.252:8080','192.168.16.173:8080']
            labels:
              appname: 'cadvisor'
      - job_name: 'prometheus'
        static_configs:
          - targets: [ '192.168.16.251:9090']
            labels:
              appname: 'prometheus'
    

    上面我们使用静态的方式指定了各Metris的地址,但后面应用数量越来越多,手动的添加就不太现实了,Prometheus支持服务发现等多种方式,具体信息移步官网https://prometheus.io/docs/prometheus/latest/configuration/configuration/
    创建Prometheus规则文件rules.yml
    本地宿主机/root/prometheus/conf/下创建
    下面监控宿主机和容器的内存,CPU,磁盘等状态

    groups:
    - name: example #定义规则组
      rules:
      - alert: InstanceDown  #定义报警名称
        expr: up == 0   #Promql语句,触发规则
        for: 1m            # 一分钟
        labels:       #标签定义报警的级别和主机
          name: instance
          severity: Critical
        annotations:  #注解
          summary: " {{ $labels.appname }}" #报警摘要,取报警信息的appname名称
          description: " 服务停止运行 "   #报警信息
          value: "{{ $value }}%"  # 当前报警状态值
    - name: Host
      rules:
      - alert: HostMemory Usage
        expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 >  80
        for: 1m
        labels:
          name: Memory
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: "宿主机内存使用率超过80%."
          value: "{{ $value }}"
      - alert: HostCPU Usage
        expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance,appname) > 0.65
        for: 1m
        labels:
          name: CPU
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: "宿主机CPU使用率超过65%."
          value: "{{ $value }}"
      - alert: HostLoad 
        expr: node_load5 > 4
        for: 1m
        labels:
          name: Load
          severity: Warning
        annotations:
          summary: "{{ $labels.appname }} "
          description: " 主机负载5分钟超过4."
          value: "{{ $value }}"
      - alert: HostFilesystem Usage
        expr: 1-(node_filesystem_free_bytes / node_filesystem_size_bytes) >  0.8
        for: 1m
        labels:
          name: Disk
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: " 宿主机 [ {{ $labels.mountpoint }} ]分区使用超过80%."
          value: "{{ $value }}%"
      - alert: HostDiskio
        expr: irate(node_disk_writes_completed_total{job=~"Host"}[1m]) > 10
        for: 1m
        labels:
          name: Diskio
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: " 宿主机 [{{ $labels.device }}]磁盘1分钟平均写入IO负载较高."
          value: "{{ $value }}iops"
      - alert: Network_receive
        expr: irate(node_network_receive_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m]) / 1048576  > 3 
        for: 1m
        labels:
          name: Network_receive
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: " 宿主机 [{{ $labels.device }}] 网卡5分钟平均接收流量超过3Mbps."
          value: "{{ $value }}3Mbps"
      - alert: Network_transmit
        expr: irate(node_network_transmit_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m]) / 1048576  > 3
        for: 1m
        labels:
          name: Network_transmit
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: " 宿主机 [{{ $labels.device }}] 网卡5分钟内平均发送流量超过3Mbps."
          value: "{{ $value }}3Mbps"
    - name: Container
      rules:
      - alert: ContainerCPU Usage
        expr: (sum by(name,instance) (rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 60
        for: 1m
        labels:
          name: CPU
          severity: Warning
        annotations:
          summary: "{{ $labels.name }} "
          description: " 容器CPU使用超过60%."
          value: "{{ $value }}%"
      - alert: ContainerMem Usage
    #    expr: (container_memory_usage_bytes - container_memory_cache)  / container_spec_memory_limit_bytes   * 100 > 10  
        expr:  container_memory_usage_bytes{name=~".+"}  / 1048576 > 1024
        for: 1m
        labels:
          name: Memory
          severity: Warning
        annotations:
          summary: "{{ $labels.name }} "
          description: " 容器内存使用超过1GB."
          value: "{{ $value }}G"
    
    

    部署Prometheus

    docker run -d -p 9090:9090 --name=prometheus \
     -v  /root/prometheus/conf/:/etc/prometheus/  \
    prom/prometheus 
    

    上面采用的官方镜像,因为启动参数没有指定--web.enable-lifecycle,所以无法使用热加载,时区也是相差八个小时,我们可以通过官方提供的Dockerfile进行修改
    下载源码包,制作Prometheus镜像
    https://github.com/prometheus/prometheus

    FROM   centos:7
    LABEL maintainer "The Prometheus Authors <prometheus-developers@googlegroups.com>, Custom by <leichen.china@gmail.com>"
    COPY prometheus                             /bin/prometheus
    COPY promtool                               /bin/promtool
    COPY console_libraries/                     /usr/share/prometheus/console_libraries/
    COPY consoles/                              /usr/share/prometheus/consoles/
    
    WORKDIR    /prometheus
    RUN ln -snf /usr/share/zoneinfo/Asia/Shanghai  /etc/localtime
    ENTRYPOINT [ "/bin/prometheus" ]
    CMD        [ "--config.file=/etc/prometheus/prometheus.yml", \
                 "--storage.tsdb.path=/prometheus", \
                 "--web.console.libraries=/usr/share/prometheus/console_libraries", \
                 "--web.enable-lifecycle", \
                 "--web.console.templates=/usr/share/prometheus/consoles" ]
    

    创建容器并运行

    docker build  -t prometheus:latest .
    docker run -d -p 9090:9090 --name prometheus   -v  /root/prometheus/conf/:/etc/prometheus/    prometheus:latest
    

    访问prometheus的9090端口,可以查看监控数据


    unmin.club

    2. 部署Node-exporter

    docker run -d -p 9100:9100   -v "/:/host:ro,rslave" quay.io/prometheus/node-exporter --path.rootfs /host
    

    3. 部署Cadvisor-exporter

     docker run --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=8080:8080 --detach=true --name=cadvisor --net=host google/cadvisor:latest
    

    访问cadvisor的8080端口,可以看到容器的监控指标


    unmin.club

    4. 部署Jmx-exporter

    下载jar :https://github.com/prometheus/jmx_exporter (jmx_prometheus_javaagent-0.11.0.jar )
    配置文件: https://github.com/prometheus/jmx_exporter/tree/master/example_configs
    中间件启动参数添加:
    CATALINA_OPTS="-javaagent:/app/tomcat-8.5.23/lib/jmx_prometheus_javaagent-0.11.0.jar=1234:/app/tomcat-8.5.23/conf/config.yaml"
    

    具体查看http://www.unmin.club

    5. Grafana安装及配置

    docker run -d -i -p 3000:3000 -e "GF_SERVER_ROOT_URL=http://grafana.server.name" -e "GF_SECURITY_ADMIN_PASSWORD=secret" --net=host grafana/grafana
    

    web访问 192.168.16.251:3000
    user:admin,passwd:secret
    首先我们添加数据源


    unmin.club

    import导入8919Node-exporter展示模板

    unmin.club
    针对容器和JMX的监控模板,我们可以去https://grafana.com/dashboards自行查找。

    6. 配置报警alertmanager

    创建alertmanager.yml报警通知文件

    global:
      resolve_timeout: 2m
      smtp_smarthost: smtp.163.com:25
      smtp_from: 12345678@163.com
      smtp_auth_username: 12345678@163.com
      smtp_auth_password: 123456 (授权码)
    
    templates:     ##消息模板
      - '/etc/alertmanager/template/wechat.tmpl'
    route:
      group_by: ['alertname_wechat']
      group_wait: 30s
      group_interval: 60s
      receiver: 'wechat'    # 优先使用wechat发送
      repeat_interval: 1h
      routes:  #子路由,使用email发送
      - receiver: email
        match_re: 
          serverity: email
    receivers:
    - name: 'email'
      email_configs:
      - to: '11111122@qq.com'
        send_resolved: true  # 发送已解决通知
    - name: 'wechat'
      wechat_configs:
      - corp_id: 'wwd402ce40b1120f24' #企业ID
        to_party: '2'  # 通知组ID
        agent_id: '1000002'    
        api_secret: '9nmYa4pWq63sQ123kToCbh_oNc' # 生成的secret
        send_resolved: true
    

    编写微信通知模板

    {{ define "wechat.default.message" }}
    {{ range $i, $alert :=.Alerts }}
    ========监控报警==========
    告警状态:{{   .Status }}
    告警级别:{{ $alert.Labels.severity }}
    告警类型:{{ $alert.Labels.alertname }}
    告警应用:{{ $alert.Annotations.summary }}
    告警主机:{{ $alert.Labels.instance }}
    告警详情:{{ $alert.Annotations.description }}
    触发阀值:{{ $alert.Annotations.value }}
    告警时间:{{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
    ========end=============
    {{ end }}
    {{ end }}
    

    部署alertmanager

    docker run -d -p 9093:9093 --name alertmanager  -v /root/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml -v /root/alertmanager/template:/etc/alertmanager/template docker.io/prom/alertmanager:latest
    

    访问alertmanager的9093端口,可以看到当前报警状态


    unmin.club unmin.club
    unmin.club

    相关文章

      网友评论

        本文标题:Docker部署Prometheus实现微信邮件报警

        本文链接:https://www.haomeiwen.com/subject/frtroqtx.html