美文网首页
Prometheus + Grafana (2) mysql、r

Prometheus + Grafana (2) mysql、r

作者: 骑着蜗牛去娶你_f136 | 来源:发表于2020-11-05 17:09 被阅读0次

    接着上一节 《Prometheus + Grafana (1) 监控 》,我们继续探讨 Prometheus + Grafana 的复杂应用

    实现目标

    这节我们的目标是搭建一个多维度监控微服务的可视化平台,包括Docker容器监控、MySQL监控、Redis监控和微服务JVM监控等,并且在必要的情况下可以发送预警邮件。

    主要用到的组件有Prometheus、Grafana、alertmanager、node_exporter、mysql_exporter、redis_exporter、cadvisor。各自作用如下所示:

    1. Prometheus:获取、存储监控数据,供第三方查询;
    2. Grafana:提供Web页面,从Prometheus获取监控数据可视化展示;
    3. alertmanager:定义预警规则,发送预警信息;
    4. node_exporter:收集微服务端点监控数据(与Prometheus一套);
    5. mysql_exporter:收集MySQL数据库监控数据;
    6. redis_exporter:收集Redis监控数据;
    7. cadvisor:收集Docker容器监控数据。

    使用docker安装 Grafana、Prometheus及监控服务

    上一节我们是直接使用的Windows下的安装软件安装Grafana和Prometheus,但是在我们的日常生产=环境中多是用的Linux,所以我们选择了方便的docker进行安装部署。

    • 在自己的挂载目录下创建 prometheus.yml
    #创建Prometheus挂载目录
    mkdir -p /dimples/volumes/prometheus
    
    #在该目录下创建Prometheus配置文件
    vim /dimples/volumes/prometheus/prometheus.yml
    
    # my global config
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          # - alertmanager:9093
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      # - "first_rules.yml"
      # - "second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'prometheus'
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
        - targets: ['localhost:9090']
    
    • 在自己的挂载目录下创建 alertmanager.yml
    global:
      smtp_smarthost: 'smtp.qq.com:465'
      smtp_from: '1126834403@qq.com'
      smtp_auth_username: '1126834403@qq.com'
      # qq邮箱获取的授权码
      smtp_auth_password: 'xxxxxxxxxxxxxxxxx'
      smtp_require_tls: false
    
    #templates:
    #  - '/alertmanager/template/*.tmpl'
    
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 5m
      repeat_interval: 5m
      receiver: 'default-receiver'
    
    receivers:
      - name: 'default-receiver'
        email_configs:
          - to: '2119713895@qq.com'
            send_resolved: true
    
    • 创建创建 docker-compose.yml 文件
    version: '3'
    
    services:
      prometheus:
        image: prom/prometheus
        container_name: prometheus
        volumes:
          - /dimples/volumes/prometheus/:/etc/prometheus/
        ports:
          - 9090:9090
        restart: on-failure
        command: 
          - '--web.enable-lifecycle '
      grafana:
        image: grafana/grafana
        container_name: grafana
        ports:
          - 3000:3000
      node_exporter:
        image: prom/node-exporter
        container_name: node_exporter
        ports:
          - 9100:9100
      redis_exporter:
        image: oliver006/redis_exporter
        container_name: redis_exporter
        command:
          - "--redis.addr=redis://127.0.0.1:6379"
          - "--redis.password 'ZHONG9602.class'"    # 认证密码,如果没有密码,该参数不需要
        ports:
          - 9101:9121
        restart: on-failure
      mysql_exporter:
        image: prom/mysqld-exporter
        container_name: mysql_exporter
        environment:
          - DATA_SOURCE_NAME=root:123456@(127.0.0.1:3306)/
        ports:
          - 9102:9104
      cadvisor:
        image: google/cadvisor
        container_name: cadvisor
        volumes:
          - /:/rootfs:ro
          - /var/run:/var/run:rw
          - /sys:/sys:ro
          - /var/lib/docker/:/var/lib/docker:ro
        ports:
          - 9103:8080
      alertmanager:
        image: prom/alertmanager
        container_name: alertmanager
        volumes:
          - /dimples/volumes/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
        ports:
          - 9104:9093
    

    使用 docker-compose up -d 启动服务

    image
    # 不使用docker-compose安装
    docker run -d --name prometheus -p 9090:9090 -v /dimples/volumes/prometheus/:/etc/prometheus/ prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle
    
    docker run -d --name redis_exporter -p 9101:9121 oliver006/redis_exporter --redis.addr redis://127.0.0.1:6379 --redis.password 'ZHONG9602.class'
    
    • 测试是否监控到数据

    http://127.0.0.1:9090/alerts

    image

    如上图所示,我们刚刚定义的两个警告规则已经成功加载

    接着访问 http://127.0.0.1:9090/targets 观察在Prometheus配置文件里定义的各个job的状态:

    image

    可以看的都是监控的UP状态。

    还可以点击上面这个页面的各个 Endpoint 的链接,如果页面显示出了收集的数据,则说明各个Endpoint已经成功采集到了数据,以mysql_exporter为例子,访问
    http://127.0.0.1:9102/metrics

    image

    访问http://127.0.0.1:9104/#/status看看我们在alertmanager.yml配置的规则是否已经生效:

    image

    配置Java程序监控

    在上面的配置中我们简单的将Prometheus采集的对于自身的数据通过Grafana进行了展示,而我们的核心是通过Prometheus去采集Java应用的数据,这就需要针对前面提到的通过Prometheus的pull模式定时去拉取SpringBoot通过Actuator暴露的Micrometer采集的监控指标

    • 首先需要的做的是完成Java应用的Micrometer集成,访问actuator/prometheus或者/prometheus能够正常的返回Micrometer采集的数据指标(这一步操作在上节中已经很详细的介绍了,此处不再赘述)
    • 进入部署Prometheus的文件目录,打prometheus.yml进行拉取节点的配置,在配置文件的scrape_configs节点添加针对java的配置

    修改 prometheus.yml 配置所有监控服务

    在上面启动的 prometheus,我们没有配置任何的监控,所以我们要修改 prometheus.yml 文件,使其监控我们想监控的数据源,具体的修改内容如下图所示

    global:
      scrape_interval:     15s
      evaluation_interval: 15s
    
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['127.0.0.1:9090']
      - job_name: 'node_exporter'
        static_configs:
          - targets: ['127.0.0.1:9100']
            labels:
              instance: 'node_exporter'
      - job_name: 'redis_exporter'
        static_configs:
          - targets: ['127.0.0.1:9101']
            labels:
              instance: 'redis_exporter'
      - job_name: 'mysql_exporter'
        static_configs:
          - targets: ['127.0.0.1:9102']
            labels:
              instance: 'mysql_exporter'
      - job_name: 'cadvisor'
        static_configs:
          - targets: ['127.0.0.1:9103']
            labels:
              instance: 'cadvisor'
    
      - job_name: 'server-demo-actuator'
        metrics_path: '/actuator/prometheus'
        scrape_interval: 5s
        static_configs:
          - targets: ['127.0.0.1:8001']
            labels:
              instance: 'server-demo'
    rule_files:
      - 'memory_over.yml'
      - 'server_down.yml'
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ["127.0.0.1:9104"]
    

    PS: 每个服务的targets都是一个数组,可以收集多个服务器下的exporter提供的监控数据。

    接着创建上面提到的两个监控规则 memory_over.yml 和 server_down.yml

    # 创建 memory_over.yml
    vim /dimples/volumes/prometheus/memory_over.yml
    

    内容如下:

    groups:
      - name: server_down
        rules:
          - alert: InstanceDown
            expr: up == 0
            for: 20s
            labels:
              user: Dimples
            annotations:
              summary: "Instance {{ $labels.instance }} down"
              description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 20 s."
    

    当某个节点的内存使用率大于80%,并且持续时间大于20秒后,触发监控预警。

    接着创建 server_down.yml:

    # server_down.yml
    vim /dimples/volumes/prometheus/server_down.yml
    

    内容如下:

    groups:
      - name: server_down
        rules:
          - alert: InstanceDown
            expr: up == 0
            for: 20s
            labels:
              user: Dimples
            annotations:
              summary: "Instance {{ $labels.instance }} down"
              description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 20 s."
    

    当某个节点宕机(up==0表示宕机,1表示正常运行)超过20秒后,则触发监控。

    在 Grafana 中使用

    使用浏览器访问 http://127.0.0.1:9090,用户名密码为admin/admin,首次登录需要修改密码。

    第一步:首先需要添加数据源,上一节中已经详细介绍过了,此处不再赘述,结果如图:

    image

    添加数据源成功后,我们就可以添加监控面板了,同样的,我们可以去Grafana官方市场选择别人配置好的模板:https://grafana.com/grafana/dashboards

    此处我收集了几个好用的监控模板,已经上传到微云网盘,只需要下载然后导入即可( 链接:https://share.weiyun.com/XDzICKtf

    下面以 MySql 监控为例,演示导入模板:

    image

    点击 Upload JSON file 后,选择对应的文件,成功后会自动弹出一下界面,然后点击Import

    image image image image image

    额外补充

    alertmanager 丰富的预警配置

    groups:
    - name: example #定义规则组
      rules:
      - alert: InstanceDown  #定义报警名称
        expr: up == 0   #Promql语句,触发规则
        for: 1m            # 一分钟
        labels:       #标签定义报警的级别和主机
          name: instance
          severity: Critical
        annotations:  #注解
          summary: " {{ $labels.appname }}" #报警摘要,取报警信息的appname名称
          description: " 服务停止运行 "   #报警信息
          value: "{{ $value }}%"  # 当前报警状态值
    - name: Host
      rules:
      - alert: HostMemory Usage
        expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 >  80
        for: 1m
        labels:
          name: Memory
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: "宿主机内存使用率超过80%."
          value: "{{ $value }}"
      - alert: HostCPU Usage
        expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance,appname) > 0.65
        for: 1m
        labels:
          name: CPU
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: "宿主机CPU使用率超过65%."
          value: "{{ $value }}"
      - alert: HostLoad 
        expr: node_load5 > 4
        for: 1m
        labels:
          name: Load
          severity: Warning
        annotations:
          summary: "{{ $labels.appname }} "
          description: " 主机负载5分钟超过4."
          value: "{{ $value }}"
      - alert: HostFilesystem Usage
        expr: 1-(node_filesystem_free_bytes / node_filesystem_size_bytes) >  0.8
        for: 1m
        labels:
          name: Disk
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: " 宿主机 [ {{ $labels.mountpoint }} ]分区使用超过80%."
          value: "{{ $value }}%"
      - alert: HostDiskio
        expr: irate(node_disk_writes_completed_total{job=~"Host"}[1m]) > 10
        for: 1m
        labels:
          name: Diskio
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: " 宿主机 [{{ $labels.device }}]磁盘1分钟平均写入IO负载较高."
          value: "{{ $value }}iops"
      - alert: Network_receive
        expr: irate(node_network_receive_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m]) / 1048576  > 3 
        for: 1m
        labels:
          name: Network_receive
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: " 宿主机 [{{ $labels.device }}] 网卡5分钟平均接收流量超过3Mbps."
          value: "{{ $value }}3Mbps"
      - alert: Network_transmit
        expr: irate(node_network_transmit_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m]) / 1048576  > 3
        for: 1m
        labels:
          name: Network_transmit
          severity: Warning
        annotations:
          summary: " {{ $labels.appname }} "
          description: " 宿主机 [{{ $labels.device }}] 网卡5分钟内平均发送流量超过3Mbps."
          value: "{{ $value }}3Mbps"
    - name: Container
      rules:
      - alert: ContainerCPU Usage
        expr: (sum by(name,instance) (rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 60
        for: 1m
        labels:
          name: CPU
          severity: Warning
        annotations:
          summary: "{{ $labels.name }} "
          description: " 容器CPU使用超过60%."
          value: "{{ $value }}%"
      - alert: ContainerMem Usage
    #    expr: (container_memory_usage_bytes - container_memory_cache)  / container_spec_memory_limit_bytes   * 100 > 10  
        expr:  container_memory_usage_bytes{name=~".+"}  / 1048576 > 1024
        for: 1m
        labels:
          name: Memory
          severity: Warning
        annotations:
          summary: "{{ $labels.name }} "
          description: " 容器内存使用超过1GB."
          value: "{{ $value }}G"
    

    预警除了使用邮件外,也可以使用企业微信接收,可以参考:https://songjiayang.gitbooks.io/prometheus/content/alertmanager/wechat.html

    相关文章

      网友评论

          本文标题:Prometheus + Grafana (2) mysql、r

          本文链接:https://www.haomeiwen.com/subject/mxravktx.html